A “recovering data analyst,” Matthew Lynley is the founding writer of the AI newsletter Supervised, which helps readers understand the implications of new technologies. Previously, Matt was Business Insider's lead reporter on AI and big data, and lead blogger for The Wall Street Journal's technology blog Digits.
As the Co-founder and CEO of Alation, Satyen lives his passion of empowering a curious and rational world by fundamentally improving the way data consumers, creators, and stewards find, understand, and trust data. Industry insiders call him a visionary entrepreneur. Those who meet him call him warm and down-to-earth. His kids call him “Dad.”
Producer: (00:01) Hello and welcome to Data Radicals. In this episode, Satyen sits down with Matthew Lynley, founding writer of the AI newsletter Supervised. The goal of Supervised is to help readers understand the implications of new technologies and the team building it. Prior to Supervised, Matt was Business Insider's lead reporter on AI and big data. In this episode, Matt dives into LLMs, vector databases, and the rivalry between Databricks and Snowflake.
Producer: (00:30) This podcast is brought to you by Alation. We bring relief to a world of garbage-in, garbage-out with enterprise data solutions that deliver intelligence-in, intelligence-out. Learn how we fuel success in self-service analytics, data governance, and cloud data migration at alation.com.
Satyen Sangani: (00:56) Today on Data Radicals, we have the journalist Matt Lynley. Matt has spent the last decade reporting on the tech industry at publications like Business Insider, the Wall Street Journal, BuzzFeed News, and TechCrunch. Most recently, he created Supervised, a newsletter that covers AI and big data and helps folks navigate the complex space. Matthew has extensive experience with Python, SQL, C++, and supervised modeling, so he's writing from a place of knowledge. Matt, welcome to Data Radicals.
Matthew Lynley: (01:22) Thank you very much.
Satyen Sangani: (01:24) So tell us a little bit about Supervised. That's the thing that you just started — hard to do, to build a newsletter.
Matthew Lynley: (01:30) So they say.
Satyen Sangani: (01:32) I think it is. I mean, it seems like it is. It looks so from the outside like it is. So what is it? Why did you do it?
Matthew Lynley: (01:38) Basically, we're going back to November now, which feels like 10 years ago at this point. When ChatGPT came out and I was taking a break from my work at Business Insider, where I was covering big data and AI. Again, this predates ChatGPT, so we're talking Hugging Face, Weights & Biases, Snowflake, Databricks, like all the original machine learning, I guess we called it MLOps at the time.
Satyen Sangani: (01:56) The legacy crew.
Matthew Lynley: (01:57) The legacy crew, yeah.
Satyen Sangani: (01:58) Yeah.
Matthew Lynley: (01:58) We need a name for it, right? We need like the “pre era.” And while I was away, I was looking for something to just kind of keep up to speed. Something that was a little bit more technical because I have some experience doing all of this stuff, but not so really deep in the weeds that a lot of these other publications and newsletters where they go like hyper-technical, it's concerned with the intricacies of a loss function and things like that. I couldn't really find something in the middle. So of course I asked a hundred people what they thought and what they were reading and trying to get some sort of feedback. Generally, it seemed there wasn't a lot happening in that space. When I came back to journalism from working in analytics, the idea was to write about the things I used when I was working in analytics and doing some work in data science. So I thought I can potentially lean on my experience having been a practitioner to a certain extent. Obviously I'm not the expert here. To try and go one step deeper to focus on the people who are making the decisions. Founders, buyers, practitioners, the people that know enough to be dangerous.
Matthew Lynley: (02:57) Someone runs up to me and says, and says, “Hey, I need 150 Weights & Biases licenses.” My first question is not “What's Weights & Biases?” It's “Why do you need 150? Tell me more about that.” That's the direction that I was thinking about when I first launched it.
Satyen Sangani: (03:10) On one level you've got sort of headline news where you get reporting like, Weights and Hugging Face just raised at a $4.5 billion valuation from Salesforce Ventures. And then you've got the other level, which is product-oriented blogs that can be super detailed, highly specific, and you are trying to cater to the person that wants to understand what the company is and what the product does, but perhaps not live the product experience in the process of having to do that.
Matthew Lynley: (03:39) Right. It's what are people talking about that no one's writing about? So… not how to use LangChain. What do people think about LangChain? You talk to as many people as you possibly can, You'll find out that LangChain is actually a very divisive subject. And the kind of story you see on Supervised is not “Is LangChain good or bad?” The story is, it's complicated and here's why it's complicated.
Satyen Sangani: (03:57) Right. That's the kind of information that, frankly, you'd really only get with one-on-one conversations or perhaps with an industry event or meeting where people can get into a room and talk about actual deployment success or the like. And you're capturing this by talking to actual practitioners and users.
Matthew Lynley: (04:16) Hopefully I'm doing your work for you. I'm going to a conference so you don't have to and save you like $1,000 or something.
Satyen Sangani: (04:21) And so is that where the name comes from? It's like a sort of supervision of the market, in addition to talking about supervised learning.
Matthew Lynley: (04:28) It's kind of a play on a joke in some interview loops with data scientists, they're PhDs that are mindblowingly smarter than me, but you'll walk through a scenario or a problem and they'll just jump straight to “...and then I'll build a neural net” and it's like, “Hold on. That didn't need to be a neural net. It turns out it's literally just a really easy supervised learning problem.” And then you can use it. It's five lines of code with scikit. And we've gone into this great era of LLMs and sort of modern AI, supervised learning does a lot really well, and it's much cheaper and much easier to put together. Hence the name Supervised, where the idea is like a lot of these problems are a lot simpler than we're making them out to be. And maybe we're hitting a nail with a sledgehammer here. Although if you do look at the logo of it, it's kind of a play on gradient descent, which is the first one we all learn when we're going through this stuff.
Satyen Sangani: (05:11) Very cool. It's interesting. The last podcast that I just did was an interview with Anupam Jena, who's basically a Harvard medical professor. He's done a lot of work on medical practices and studying patterns. Apparently the best week to get a heart attack is the week that the American Academy of Cardiologists are having their conference. It's because they tend to overprescribe procedures. So this idea that sort of expertise can be overapplied, I think it also extends to the context of tools and algorithms and when do you need this advanced thing versus when do you need, just like a very basic thing. That's a hard thing to figure out. It's a hard thing to figure out for people who are practitioners. Also, probably a hard thing to figure out, I think, especially for the people who are trying to buy products in this space. Because a lot of people who are not using the data products or the data itself don't have an idea of where their spend is going. Do you speak to a lot of people who are trying to figure that out?
Matthew Lynley: (06:05) Oh, absolutely. I mean, one of the biggest issues now, which again goes back — I mean I'm sure you've dealt with this as well — it goes back like three, four years, is that data scientists doing the work of data analysts and vice versa sometimes, depending on which company, which title you're given. The reality is that most companies don't need intense deep learning operations. They need to know, is this group of customers gonna churn? That's a problem that gives me an immediate return on my investment because I can know, okay, I can stick my customer success team on this cohort, which is at risk of churning. And it's not like, you don't need a PhD in order to build a churn model. What we had with this challenge was we had these incredibly smart, intelligent data scientists coming in, but the problems that needed to be solved in most organizations were pretty straightforward.
Matthew Lynley: (06:54) Which was basically like a little bit of exploratory data analysis, putting together a model, most of the time collecting the correct data set and cleaning it and curating it, which is the universal problem that we all face. And it'll never go away, I expect, and we're gonna continue to spend all of our time on that and then putting it in production and monitoring it, seeing if it drifts, seeing if it starts redlining along those lines. And that's not a Transformers-based approach necessarily. It's this funny thing where for a long time we had a lot of technical firepower, but the actual answers you needed were not everyone was building a crazy recommendation algorithm, I just need to know if my customers are gonna churn.
Satyen Sangani: (07:28) It makes complete sense. I think that 80/20 rule applies to analytics and tooling just as much as it applies to everything else in the world. It's also why real revolutionary technologies are really hard to find because there’s sexy and then there's actually effective. Who are you writing Supervised for? What does the community of users look like? Do you know who's reading the newsletter? Do you have a sense for that?
Matthew Lynley: (07:47) I love this question because people assume I have like a grand master plan and I'm just putting it together while it's in the air. Although to be fair, I do have a target audience in mind, which is the kinds of people that have to make these decisions, these important decisions for what tools am I gonna use, what products am I gonna use at some point? Do I go open source or close source model? And when you're evaluating a contract for some of these things, like they're not in prod for like a year or sometimes generously a year.
Matthew Lynley: (08:11) So when I'm signing this contract, I'm signing away the first year with like not really getting any real return. That's not true for everyone, obviously, but like when you're looking at a lot of data analytics and machine learning products. What I wanted to write for with someone who, you can be a little flippant with the terminology because they know enough to be dangerous. They probably are able to run Llama on their laptop because it's really easy to do. It's shockingly easy to do. They know how to use the Hugging Face Transformers module. They know they're aware of the differences between the GPT-4 and GPT-3.5 Turbo, things along those lines. But they need to go one step deeper, right? So when you look at GPT-3.5 Turbo and GPT-4, what are the actual differences from an operations perspective?
Matthew Lynley: (08:53) Like, one's much more expensive than the other. How much more expensive is it? And just trying to understand what is the business impact of the decision that I'm gonna make? Which is a technical decision, but often there's a pretty substantial dollar value associated with it.
Satyen Sangani: (09:05) Yeah. So you're writing for technological practitioners — and I read the newsletter, it's a pretty advanced newsletter. If you didn't know anything about data or analytics, you might have a hard time digesting it, but if you are somebody who's in the weeds and a practitioner or understanding how to build these tools or, frankly, just even have some casual association, it can be massively valuable. The nuggets you get out of the newsletter are pretty hard to discover.
Matthew Lynley: (09:30) That's the hope. Hopefully someone from Databricks or Snowflake is listening and then the entire MLE team starts subscribing to it, that'd be beautiful. [laughter] That's the target audience that I think about.
Satyen Sangani: (09:39) Yeah. I think in some sense we create that which we would wish existed. So I presume this is something that you wish you had.
Matthew Lynley: (09:46) Oh yeah. That's the other thing: I couldn't find it and I wanted to read it.
Satyen Sangani: (09:50) Have you been able to get feedback from people who have actually listened to it?
Matthew Lynley: (09:53) Yeah. I live and die by feedback and specifically brutal feedback. I love brutal feedback.
Satyen Sangani: (10:00) What's the worst feedback so far that you've received? Or it should be the “hardest” feedback. Because the worst feedback's like the feedback that doesn't actually affect you.
Matthew Lynley: (10:04) The most surprising feedback that I've received is that when I first launched it, I wanted to do three a week and I was worried that it was gonna spam people. It turns out actually people like that because it's sort of, again, it's like what are people talking about that no one's writing about? It's understanding like the meta discussion around all of these tools, which is a kind of conversational thing. And so it's following the conversation along. That was surprising. I think a lot of the more challenging feedback is someone will subscribe and read it and say, “Hey, you're not really defining these terms, this is a little too technical. Like can you go a little bit deeper into this? What does it mean when you're talking about this score?” I look at that and I sort of think, “Okay, where's that balance to strike where it's great that someone who's not necessarily as technical and doesn't fully understand these concepts is reading it. And they sort of thought it would be valuable, but it went over their heads.”
Matthew Lynley: (11:00) I hear that a lot. It's like, “Oh yeah, a lot of it goes over my head, but it's really interesting.” And it's like, “Well, it's difficult because I really love that you're reading it and I like that you thought it was interesting, but you're not necessarily my target reader when I think about where am I truly trying to add dollar value.” It's hard to strike that balance, right? And it's a little devastating when you get a message like that and you're like, I can't do that. I can't serve this customer, I can't serve this reader.
Satyen Sangani: (11:24) You can't do 101 on LLMs every single time you write an article.
Matthew Lynley: (11:28) Right.
Satyen Sangani: (11:29) Otherwise you'd be spending 50% of the air space doing that.
Matthew Lynley: (11:32) Oh, probably closer to 75.
Satyen Sangani: (11:34) Yeah. [laughter] Yeah, for sure. But this moment has been a particularly interesting and transformational moment in terms of just the amount of light, the amount of venture investment in the ecosystem, the number of companies that have exploded, the amount of imagination that has been kindled, given all of this innovation. How do you explain it to people? I mean, that problem still remains, which is that there are people who are sort of living in the world of, “Oh, I have an LLM that I'm running on my laptop” or “I deeply understand the difference between different language models.” But then there's this entire world of people who are just like, “What? What's a language model? What is GenAI?” Maybe we can use a little bit of the time because I think there's probably more listeners in that latter category than they're on the fore category for this podcast. If I'm just coming into this space, what do I need to know? And what do I need to care about. If I'm just not a practitioner but somebody who just cares about how AI's gonna affect the world, how do I think about this space?
Matthew Lynley: (12:27) I think part of why this caught the level of fire it has is that anyone can use it. You can fire up ChatGPT, you don't even have to explain what an LLM is to someone. It's like, oh, just go to chat.openai.com. It works right away and it's cool as hell. It's the first time you use it, you're like, “Oh my God, I'm literally talking to something.” And so that kind of like aha customer moment that you think about in the enterprise sales cycle, which usually comes at a webinar or a sales call or a demo or something like that, where the light bulb goes off, it's immediate. They use it for the first time and they're like, “Holy crap.” And so this idea of, how does an LLM work?
Matthew Lynley: (13:04) I think the second you touch one for the first time, you get it right away. Now there's an enormous level of intricacy and complication once you go a single step deeper, which is the differences between the LLMs, how do you think about crafting the right prompt and knowing that they can go off the rails really fast if you're not careful? And the whole network of tools that are associated on top of it. But when you think about it from an education perspective, the education really only starts when you are talking to people that are like, “Okay, this is really cool. I've tried it, it's awesome, it's cool as hell, but how can I use it to improve my business?” Because I can improve my daily life. I can have it write my emails for me. It's fine. It does some things really well. I'm sure like there's tons of unsanctioned ChatGPT usage inside companies everywhere.
Matthew Lynley: (13:46) But it's like, okay, how do I build a business on top of this? And then it starts to get complicated. Then you have to start understanding how expensive is OpenAI, how do you integrate it, and whether or not do I go closed source to open source. And so the learning curve starts off very, very easy because you can get it right away and then it quickly becomes one of the hardest possible products to understand once you start trying to dig into it.
Satyen Sangani: (14:10) Even that first question: What is an LLM? There are different types of LLMs. Give us a sense for: What are the different types of LLMs that might exist and how much differentiation is there out there? Are there hundreds? Are there thousands?
Matthew Lynley: (14:26) There's 50 ways you can answer that, obviously. The kind of big delineator is, there's closed source and open source. The closed source are the ones that you find from OpenAI, Anthropic, Cohere, some of these other ones where they handle the developments and build them and they handle managing the data around it, all of that. They provide either a front end for an average user or an API that's pretty alarmingly easy to implement. You can put together a chat bot in two seconds if you wanted to and Streamlit with the OpenAI API. And then when you go over to the open source side, it's hard to even count, keep up at this point. Hugging Face, speaking of super valuable companies, has a leaderboard that keeps track of the most “powerful” models. It's the hundreds of models and there's new ones every single day.
Matthew Lynley: (15:12) But the part of the promise of those open source models is a process called fine tuning, which is effectively a way of saying, “Hey, I want to teach this a little bit more about how my company operates. Whether that's using my emails, my Slack messages, I wanna teach it how to tell me about our Slack etiquette internally. If I'm filing a bug report, then what is the format for that bug when I'm posting it in an incident channel?” Or something along those lines. That process starts to get a little hairy pretty quickly. But what you get from that is this whole universe of hyper-specialized LLMs. You can have ones that are really good for coding. You can have ones that are really good for evaluating legal contracts. They're sort of trained to focus specifically on legal language and things like that.
Matthew Lynley: (15:55) You can get ones that can be an onboarding buddy for me, because I have my own personal data that's in Confluence and Jira and Slack and God knows how many other places internally. But I can sort of train it to basically do these specific things that I want it to do. The rabbit hole gets deep very, very, very quickly, as you can see.
Satyen Sangani: (16:07) These LLMs, these models, what are the priors or the perimeters for them? You've basically got GPT-3 or -4, right? You've got the dataset on which the models are being trained. I think you mentioned this thing called prompts, like prompt engineering, which is the other element that is an input to most of these models. I would imagine if we take your contracts use case, the model, we take a set of algorithms and then train those algorithms on this set of contracts that you might have thousands, millions, hundreds, whatever it might be. What is this prompt thing? How does that work and how do people actually interact with the models to train them or tune them?
Matthew Lynley: (16:45) Prompting is much more complicated than it seems. You can ask GPT for a question and you'll get an answer and your mileage may vary on its accuracy, obviously. But there's a whole set of techniques that go into crafting a prompt when you're literally just asking a question. Specifically, say I'm asking a question about, I have 50,000 contracts that I've put together from my sales team for advertising or whatever. I don't wanna say, “Hey, help me craft a new contract for this type of customer.” It might not do a good job with that. But what you could do is give it a couple of examples. So it's like, “We did this contract for this customer and it was this way.”
Matthew Lynley: (17:24) “We did this contract for this customer and it was this way and we did this contract for this customer and this way. Following this pattern, can you please write a contract for a customer that's like that?” It's called few shot learning. So you're giving a little bit of examples and you can give them instructions on how to do it because these models are actually surprisingly flexible and are somewhat reasonable to work with. The challenge there is they're very sensitive, right? You can put a space at the end of a prompt and it'll completely change the outcome if you're not careful. A lot of it is like coaxing out the best possible response. You look at companies that are built on top of products like OpenAI, like GPT-3.5 Turbo or whatnot. What they've done is they spent their time crafting the exact right prompts for the end user.
Matthew Lynley: (18:05) It's sort of abstracted out and they don't have to see it. Whereas, from sort of like looking at an LLM internally, I may have to put together some examples and things like that. There are whole strategies around how it works, which makes it almost like a kind of new “coding paradigm.” You could call it that. It's not a good example, but it's how can I create a logical structure into coaxing this thing to tell me what it is that I need to know without lying to me like some of these models accidentally do.
Satyen Sangani: (18:30) Yeah. You need to know, generally speaking, English or some language and you need to have some intuition on how the model works and what you're trying to get out of that domain. It actually requires some interdisciplinary expertise in order to be a good prompt engineer in a good domain.
Matthew Lynley: (18:41) Yeah.
Satyen Sangani: (18:42) I wouldn't be able to walk into a model for medicine and treat it appropriately. Because I know...
Matthew Lynley: (18:46) Nothing about...
Satyen Sangani: (18:48) Practicing medicine. [laughter] So that, which I think is super, super interesting. You've got this kind of space of LLMs, and there's lots of people developing lots of different types of LLMs, some closed source, some open source. What are the other vectors of innovation where battles are being fought in this war for dominance in the GenAI world?
Matthew Lynley: (19:06) Everything comes back to data. Shocker. We literally both work in the data industry, right? Everything comes back to the data. Probably one of the most closely tied emerging fields is vector databases. The idea is I have this huge corpus of data, like contracts, or we'll go back to the contract example, and I want to be able to access it intelligently using an LLM. What I can do is I can embed this in a vector format, so it basically creates a long vector with varying values which correspond to some varying concept, and you can make it as granular as you want. You could be like all the way to a page if you really wanted to, which is not very particularly well advised, all the way down to sentences or even parts of sentences. Part of the reason that's really interesting is because there's a lot of work being done in a field called retrieval, which is essentially, I have a prompt that I'm trying to fish this information out of this LLM.
Matthew Lynley: (20:00) It could just be GPT-3, -3.5, or it could be some custom model that I've built internally, which there's actually a surprising number of those. Not necessarily open source but more customized versions. And I want to put together these examples that we talked about earlier to get a better response out of it. What I can do is I have this question I'm about to ask: why don't I fetch some similar examples in my database? It turns out that doing a vector search format is actually probably the fastest and most efficient way of doing that. You're sort of yanking those examples out to stick in the prompt. It's called RAG, retrieval augmented generation, which is kind of an old concept, but it's picking up a lot of traction now because it's a way to improve the accuracy of those prompts. That's one area of vector database. I mean, it's like every part of the AI stack. It's very divisive. Everyone's got opinions on is it a product, is it a feature, is it real, is it not? That's one of the ones that seems to be coalescing.
Satyen Sangani: (20:55) Do you think it's a product or do you think it's a feature?
Matthew Lynley: (20:56) I wish I had an answer for that. [laughter] But they said the same thing about graph databases, right? It's like, "Okay. Is it a product or a feature?" It turns out that high-performance graph databases are really useful. Maybe everything needs a killer use case. It may be that RAG is actually just the killer use case for vector databases when you're looking for what do I really need this high performance system for? I want to have really good prompt responses, RAG is really helpful. It means I don't have to go through the rigamarole of customizing this model and teaching it new things, which would necessitate the vector database to go with it. Like I said, they said the same thing about graph and graph is still here. So everything is still tossed up in the air at this point.
Satyen Sangani: (21:35) Yeah. The thing about the database is that ultimately the database is optimized for a particular type of data. So, graphs for graphs, and NoSQL for semi-structured data, and obviously vector databases for vectors, vectors or data that's formatted as a vector. It's funny that it used to be, way back when I was at Oracle, Oracle would just build these specialized algorithms into the underlying relational database and they would just constantly pull in new types of data. At the time, XML was the most interesting one. One could see that also happening, but at the same time, there's also so much innovation based on the cloud from how you process and retrieve data just in a standard format. It'll be just interesting to see how all of that transpires. Do you have a view? Do you hear Snowflake folks and Databricks folks saying, "Ah. Time for us to go build a vector database?" Is that...
Matthew Lynley: (22:27) Databricks has a sort of vector search product with its Unity Catalog for information retrieval. You look at something like Snowflake, they've sort of handed off that responsibility to partners like they've done with a lot of things. Pinecone, I believe, is their main partner with that one. But you look at something like a MongoDB, like everything is stored in MongoDB, and it turns out MongoDB's products are like somewhat well suited for vector search and vector similarity. The sort of like base, base, base cases, you could say, "Oh. If you can do a dot product in the SQL query, technically you can do vector search," which is like kind of accurate, I guess, but you're sort of losing a little bit of the nuance there.
Matthew Lynley: (23:07) It's a tossup for who's gonna sort of dive into what. Postgres has a vector search format. MongoDB has a vector search format. Databricks has a vector search format for Unity Catalog. Snowflake has decided, “We're gonna go let someone else do it.” There's a lot of different interpretations in terms of how you want to tackle that problem. What is increasingly clear is that if you want to sidestep a lot of the complexities for using these models, you really need to go through a vector database. And that's sort of a recent-ish development. I'd say it's probably like from developers and customers and stuff that I talked to. It's probably like last two, three months or so, I'd say, that the interest has turned around on RAG, which again piggybacks on the growth of all these Pinecone and Weaviate and Chroma and Lance and whatever Snowflake decides to do it or otherwise, or it sticks with Pinecone or...
Satyen Sangani: (23:57) And how quickly are these companies growing?
Matthew Lynley: (24:00) Fast. [laughter] When the Pinecone round happened, and I've reported this in an issue in the newsletter, they had something around like $2-$3 million AAR and it's well past that at this point. I think that the pickup for these has been a little surprising for a lot of people, I think, and you haven't really seen like a mega round like the Pinecone one recently, but obviously we're sort of in the process of reaching that practicality moment of an industry shift where an enterprise is like, "Oh. If I wanted to apply an LLM, how do I make this actually affordable?" It turns out RAG is like the most affordable way of doing it. So we're sort of getting to that point where, is it an inflection point? Like you could argue, right? But if the trend sticks in that direction, everyone kind of moves towards rag. You're gonna see a lot of interest growing in vector DBs.
Satyen Sangani: (24:50) Yeah. I would imagine you would. There is the vector database space, there's obviously the LLM space. Are there any other spaces that you're actively watching that you think are at an inflection point or similarly going to experience?
Matthew Lynley: (25:00) Yeah. If you remember AutoGPT at the beginning of the year, it was like, "Oh. You can use this toy on your laptop to build a $50,000-a-year T-shirt business by doing these five steps." It turns out, “Okay. You can't really do that with general purpose agent.” But if you restrict the scope of the use of what those agents are trying to do, like, say, “Create me a SQL query for this,” it turns out to be pretty effective because you're just sort of creating this sort of logical architecture when you know when you're writing a SQL query, what do you do? Well you write the most primitive like select star limit 500 order by random and build it out from there until you get this sort of specific result from this table and then you move it to a CTE and then you do the next thing.
Matthew Lynley: (25:45) You move that to a CTE and join those two CTEs. Then you do the next one and you move that to a CTE and join those three CTEs. And then by the end of it, it's a sort of spaghetti code that you sort of evaluate and say, "Okay. Well maybe this can be a window function and maybe this can be a little faster." And so on and so forth. But it follows a kind of chain of logic. It turns out if you instead of, say, "Hey, write me this prompt" or "Write me this SQL query." And it jumps straight to being like, "Let's use window functions on this." And you say, "Okay. Go through the process of writing this query for me." And it goes through that sort of same iterative step. It turns out that agents, which is essentially a way of chaining together prompts, is pretty well suited for that.
Matthew Lynley: (26:24) And that extends beyond SQL. SQL's a good example, that extends beyond SQL. You can think about, "Okay. It could be JSON." It's another one, right? A format of data structure. It could be, what's a really repetitive workflow? A KYC or fraud, right? I haven't seen anything like that use agency in that situation. But you sort of see those repetitive workflows that's sort of like step A, step B, step C, step D. Part of the reason is the creation and growth of these… you increasingly hear them refer to as like orchestration frameworks for large language models, LlamaIndex and LangChain. But the kind of creation and emergence of those has made the use of these types of agents very practical. What I can do is instead of saying, "Okay. I'm gonna have a one-size-fits-all agent that does everything for me. If I have a SQL problem, I can just sic my SQL agent on it. Or if I have a JSON problem, I can sic my JSON one on it. Or if I have a fraud problem, I can sic my fraud agent on it and so on and so forth."
Satyen Sangani: (27:21) How can we build agents to essentially approximate what otherwise would've been manual business processes or it would be a subspace of, or an extension of, many of those process modeling tools and process automation tools. That makes a ton of sense. And that would feel like an area where you could have sustainable differentiation. I mean I just spoke, for example, with a company actually doing almost exactly that. Like taking English, turning it into SQL. And one of the things that that the CEO in this case who happened to have a PhD from a very prestigious school, you know, he said, "The entire LLM space is kind of just a race to zero." Obviously, there's lots of funding going into this set of technology to develop, build LLMs. How do you see it? Do you think there's sustainable differentiation and these companies are gonna be worth billions and trillions of dollars in the future or do you see a world in which all of it goes to the open source world? Some set of both? Like any observations on the space?
Matthew Lynley: (28:11) I think what we saw with the kind of emergence of APIs originally is there are always gonna be individuals and companies that are willing to trade money and control for convenience. So that's always gonna be the case, right? I'll hand off having to build this entire thing to an API because I can get to my minimum viable product tomorrow.
Satyen Sangani: (28:32) And if I've got a business problem, why do I care about having to reinvent all of this stuff?
Matthew Lynley: (28:35) Right. What the closed source providers like OpenAI and Anthropic and all these other guys offer is an API that takes two seconds to put together a chatbot. Literally, it's six lines of code. If you wanted to, you can literally tell GPT-4 to write the six lines of code to create a GPT-4 chatbot using the open AI, chat completions API. And that trade off is always gonna be there. What will be more challenging for the one-size-fits-all API model providers is if everything is trained on the same data, what's the differentiation? And OpenAI has this like colossal corpus of data and they have their own crawler now, web crawler, to sort of collect more data for their current or future models. If it's all trained on the same like Common Crawl and C3 and Wikipedia and Archive and the rest of that stuff where's differentiation there because the value, a lot of the value of the model comes from having access to unique data.
Matthew Lynley: (29:26) That's the whole appeal of fine tuning in the first place. Now you can build experiences around that, you can build really good experiences around that and you can compete on experience if you really wanted to. There's the history of two or three developer companies pointed toward slightly different users. You look at the vector database space, like it's still completely up in the air. Obviously there's cross-pollination everywhere, but Chroma’s really geared toward data scientists and Weaviate is really geared toward developers and Pinecone does a lot of things all at once and we've seen that kind of play out a lot of times; cart-catering to specific audiences. They're all vector database products, but you can build different experiences around them. So if you're looking at API providers, it's like, "Maybe I can build a unique fine-tuning experience that differentiates me from X." That's a product problem, right?
Matthew Lynley: (30:14) That's not like a "Is my model better than yours?" problem. When you get to open source, it gets a little bit different because anyone can fine tune Llama or it's "open source." But the thing is you can literally do anything with these, right? If I'm remembering correctly, Llama 2 came out and by the end of the day someone had already ripped the guardrails off of it and put it on Hugging Face. It happens in hours, right? These changes happen in hours. That gets a little bit different. Where you start to see some interesting stuff, because they're so freely available, maybe I have something unique that I've collected. "Oh. I'm a kind of prolific writer, but I never publish my work, so I'm gonna train it on my writing." And so then you start to see… Proprietary data starts to get a lot more interesting and so you could say, "Oh. Facebook... "
Matthew Lynley: (31:00) I believe they say we don't use user data in it, but maybe they've collected some really interesting data from PyTorch usage or something along those lines. I'm making this up, obviously. They're uniquely suited to put out a custom model that no one else can put out because they have their own data. And you could see a lot of companies potentially doing something like that. You could see a lot of companies saying, "I have this interesting proprietary data, I'm gonna make a model." Snowflake has like a LLM, I believe, because they probably have more SQL queries than anyone on the planet. So it's like, "If I have visibility into everyone's SQL and how it's optimized, I can make this single-best SQL generator that no one else can make because no one posts SQL and stack overflow for obvious reasons." Right? No one wants to expose their tables or their schemas. If you look at a sort of raw performance, this score versus this score, it will be a race to the bottom. But I think a lot of the interesting things is gonna be like crafting the experiences around and what do the productize LLM look like?
Satyen Sangani: (31:55) How quickly are the best open source LLMs, are they closing the gap with the dominant proprietary models?
Matthew Lynley: (32:05) If you ascribe to the idea that one, OpenAI was honest about the scores that it released — because we kind of have to take the word for it — and that those scores are the correct way of evaluating those models, which again, you have to buy into that, "This is the correct evaluation framework." Which again gets to a very divisive subject, lead to the leaderification. You have three people in a room and there's 15 opinions on it, I think. Then, yes, there are models that technically from like a score perspective outperform GPT-3.5. It's like a platypus-tuned model or something along those lines on the top of Hugging Face. That being said, the evaluation spaces change just as fast as the LLM space. And so the way we think about, “Does model X outperform GPT-3.5, broadly speaking?” We don't have a lot of visibility into the performance of GPT-3.5 because we don't work at OpenAI. We can throw prompts at it and we can throw a bunch of questions at it and sort of like test it into and make them like fight each other to see which one is better at really specific things. But can you say, is this model better than this model broadly?
Satyen Sangani: (33:10) Yeah. You'd have to have a very strong and rigid criteria for within a domain understanding the performance of a model.
Matthew Lynley: (33:18) And everyone agree on that.
Satyen Sangani: (33:18) And everybody agree that is in fact the right thing. Yeah. Which is really tough to do. But it is, I think, heartening and interesting that these open source models are so capable so quickly.
Matthew Lynley: (33:28) Yeah. You can literally download one on a MacBook Pro and it works relatively well. It's a, for really basic stuff, right?
Satyen Sangani: (33:37) Yeah. All of this model development will take its course. We're gonna watch that for the next year, two years, five years, seven years. What do you think things are gonna look like a year from now? Any crystal ball? Where does the space go?
Matthew Lynley: (33:52) I think one thing that — and I talked about this in one of the issues I put out — is that people love to say, use the word “iPhone moment.” I'm also guilty of this as well, right? When the first iPhone came out, it was really clear. It was like, "Holy crap! There's something here." It has Safari... Mobile web was crap at the time, but it technically has Safari. I can text message, YouTube is on it from... YouTube is on the original one, but like Maps is on it. I can listen to music from...
Satyen Sangani: (34:20) I can store my contacts and make phone calls.
Matthew Lynley: (34:21) Yeah, yeah, yeah. And it was one of those, "Oh my God, like this is immediately useful to me. And it's really freaking cool." Which is essentially the same thing that ChatGPT is. It's really freaking cool and it's actually immediately useful, which is part of the reason why you can look at it and be like, "Oh my God. I think this is real." We got to a point... One, we needed the App Store, obviously, but it wasn't until we got to a point where developers were creating this ramshackle web of all these apps talking to each other and cross-posting to each other and pulling in data from each other through APIs. Specifically REST, I think, was probably… The popularization of REST is of JSON-enabled REST is probably one of the most underrated influential moments in modern web development.
Satyen Sangani: (35:08) Because it provided a standard protocol.
Matthew Lynley: (35:11) Getting back to that, everyone agreed on it. Which very rarely, very rarely happens, right? It took a couple years and a couple of extra developments to come into focus before it was like, "This is actually life-changing technology. This has actually changed the way that I go about my daily life. I don't remember phone numbers anymore." I post photos of my mille-feuille so some of my friends can see it and all that kind of stuff. I call a cab from my phone instead of trying to hail one. What ChatGPT aside from whatever unsanctioned usage people are doing, like writing performance reviews or emails or whatever, what it did right out the gate was solve this really cool use case, which is like, "I'm bored and I wanna talk to something."
Matthew Lynley: (35:54) And it's a really amazing experience around that. And there's Character AI and some other offshoots that kind of double-down and triple-down on that, "I'm bored and I want to talk to someone or talk to something, and have an interesting discussion." But we are still missing a couple pieces, I think, to start to get to whatever those killer use cases are. The onboarding buddy I think is probably one of the cooler ones that I've heard in a while.
Satyen Sangani: (36:22) What does it do?
Matthew Lynley: (36:23) You join a company and it's like, "Who do I message for X and what's an etiquette for reporting a bug in the Slack channel? What's the actual table schema?" Because I've got 50 tables with the same name and there's like an underscore one, underscore two, and like 49 are deprecated. But it turns out number 39 is the one that's actually live, right? You start to see little glimpses of interesting things like that come up and it's a byproduct of some of the kind of one-level-down techniques, right? Something like that, you need to fine tune it. You need to introduce your own company data into it in order to make it operational. It needs to know like what's the Slack etiquette? Like you need my Slack channels, you need my Slack information. Which gets the whole question of what is ETL for LLMs? That's a minefield that we're not even gonna walk into right now. But I think there's still a couple of missing pieces that really do boil down to practicality in the same way that APIs just made apps talking to each other really practical.
Matthew Lynley: (37:18) That's part of the reason why RAG is really interesting because in addition to making it cheaper and more efficient, it's just easier to do, conceptually speaking. I mean assuming you're able to spin up a vector database and you know how to store this stuff efficiently and chunk it efficiently. The idea of using RAG is literally just like, "Let me get better information and stick it in this prompt." Instead of having to train it and fine-tune it and have experts that know how to use these LLMs on staff and things like that. You have to start to see the difficulty of managing these things come down a bit more before we start to see those really interesting killer use cases.
Satyen Sangani: (37:52) Do you think the APIs and the underlying protocols are advancing quickly enough?
Matthew Lynley: (38:00) If you asked me three months ago, I probably would have said no. From everyone I've talked to about RAG, I would say like a soft maybe now. I think we'll call it a soft yes because of the pace of innovation, which is, when we say innovation, it's hacking around the limitations to make these things actually practical to use, which started almost right from the get-go, right? We had to, you can rattle off 50 terms like low-rank adaptation or quantization or things like that. All these terms are basically like, “Let's make this thing easier to train and run on less powerful hardware.” A lot of developers have been hacking around the limitations of these larger language models for a couple months now, and it happened really fast, and it kind of slowed down a little bit, and then RAG came around — which again, it's old, but it's like saying it's like, okay, maybe I can hack around limitations of LLMs here and get a lot of gains out of it.
Matthew Lynley: (38:49) The biggest issue that these LLMs have, obviously, is they don't have access to recent data. They hallucinate, which is probably the biggest issue. They can be really expensive. If you look at the cost for using a fine-tuned GPT-3.5 Turbo, it's very not-cheap. There's regulatory requirements for some situations. If you can knock out all four of those, plus whatever kind of problems are in the periphery, then you can start to really rapidly experiment in a way that we saw in Web2, which was basically like, add a third one to it, like App Store, Web APIs, and 3G, right? So suddenly, these API calls were actually really snappy and fast because we had 3G on phones, right? So add a third one to it, right?
Satyen Sangani: (39:31) The one that seems to me to be the most difficult one is the hallucination effect of these models because there's obvious hallucination, a lot on Microsoft...
Matthew Lynley: (39:44) Bing Chat.
Satyen Sangani: (39:45) Bing Chat. You don't have a girlfriend, I'm your girlfriend, I love you, whatever it was. But then there's like, just slight errors, which are much more insidious, much more hard to detect.
Matthew Lynley: (39:55) Like missing a decimal point on an ARR number.
Satyen Sangani: (40:00) Oh, for sure.
Matthew Lynley: (40:00) No one wants to be the first person to put a wrong number in a board meeting.
Satyen Sangani: (40:03) For sure, for sure. And/or making an incorrect recommendation on either important things or even unimportant things like pricing of airline fares. Not catastrophic, but it's also just wrong, right? That one seems to me to be one of the toughest. I think in some ways, it's almost like self-driving cars. The thing about driving a car is it's life and death and people don't want to resign themselves over to death by automated vehicle or autonomous vehicle. But I think it's got a lot of the same characteristics. I'm just curious to see if — it is a really interesting moment — but I'm just curious to see how long it takes for people to actually both lose trust and develop trust over time.
Matthew Lynley: (40:36) I think a lot of business processes are like there's no “yes or no,” it's stochastic in a lot of these places. What an LLM provides you is a stochastic answer, which may or may not be accurate. It probably is more likely to be accurate than not. I think that — never say never, obviously — but in the current state of development, like you can't ever say that there will never be a hallucination. But you can basically do a lot of work to limit those and get that as close to zero as physically possible. If you can get it within the bounds of human error, then it starts to get really interesting. I personally would not do that for AVs. But when you talk about maybe I'm in a quarterly business review or something along those lines, and I realize using an LLM, I can get this really advanced metric that's actually really interesting. And it's a much better directional indicator for our success than this other potential metric that we've been using for a long time. Is it off by a point? Maybe, but is it the end of the world if it's off by a point, if that metric is actually way more accurate than the one that we've been using prior?
Matthew Lynley: (41:35) Oh, super interesting. And then, you know, I mean, we have friends who are developing companies or building companies that are trying to develop LLMs for health advice in non-critical care settings. Super, super interesting. Because there's a lot of work that primary care doctors do, some of which can be rote and mechanic. Let's switch gears a bit and talk about Databricks and Snowflake. You've done a ton of reporting on them, obviously two of the biggest companies in data. How much overlap is there in your experience between what the companies are up to?
Matthew Lynley: (42:04) I'm sure if you talk to your customers, you ask them, like, which one are you using, they're probably gonna say both more likely than not. And part of the reason is because Databricks has always been a very data science-native product, and Snowflake has always been a very analytics-native product and Snowflake, as a data warehouse, is an amazing data warehouse. It was an industry-altering product in the same way that Databricks is an amazing data management product, right? And it's amazing for doing a lot of very complicated machine learning problems. Now, granted, both companies are data companies, and the way you win as a data company is I have all of your data, and you're doing that stuff on me, right? As long as you're firing addition operations on my servers, I don't care, right? You look at the valuation of the company, like, Databricks was $38 billion in its last round or something along those lines, and Snowflake, I think, today I checked, it was a $55 billion company or something along those lines.
Matthew Lynley: (43:00) So clearly, they both are enormous markets, and it would make sense for them to go into each other's markets, right? Snowflake, they play around and toy with machine learning and try and invest in it. They were dragging their feet on Python for the longest time, right? And then they finally put out Snowpark, originally in 2022, to finally support Python, which is a requirement in machine learning, flat out. And then they bought Streamlit for 800 million, which is an incredibly easy-to-use product. And obviously, Databricks has its own warehouse and warehouse product, and they look at the Lakehouse paradigm as a way to unify that. They're two very different data types, which makes things a little bit weird, right? Tabular data and unstructured data lend themselves very much to different use cases. There's some overlap, obviously.
Matthew Lynley: (43:39) But I just want all of your stuff on my servers. And if I have to go into warehouses to get your stuff on my servers to get your tabular data, then I'll go into warehouses, right? Or if I have to go into ML to get your stuff on Iceberg in my servers, fine, I'll go into ML to a certain extent. When you think about rivalry, it's not Celtics–Lakers. Because again, you talk to a lot of people and they're like, oh, I use Databricks for X and Snowflake for Y, right? But at the same time, if your goal is you just sign a contract, I want that contract in my servers, stick it in this data lake here. Both of them have an incentive to get that because you get a compute there, and maybe you have Fivetran and a dbt cycle, and maybe there's a Hightouch cycle on the other end, and all this kind of stuff. And so you have suddenly 55 SQL queries going on in here for one event. There's obviously an incentive to get that there. When you think about it, if you take a step back and say, okay, they're both data layer companies, then they're obviously going after the same customer. They've crafted very different experiences around it, and they're good at different things, but it obviously makes sense even if it can snap out 10% of this or 15% of this. That's actually a very big market.
Satyen Sangani: (44:48) Yeah, it's a very material amount of spend. And do you talk to CIOs or customers that are buying who are trying to consolidate these workloads, or is this a fight that's really happening at the user level more than it is around the budget level?
Matthew Lynley: (44:56) If there were concerns around budget, it's probably because the data stack has sprawled out a little too aggressively, where you have a Fivetran and dbt and all these interstitials here, and then you have an orchestration layer on top of it, and then you're getting to your output, like where you're putting your output, does it go into a looker, and all those things, right? When you look at the two companies, Snowflake has a very partner-centric approach where they hand off a lot of the complexities to companies that they connect with and work with really closely. Pinecone's a great example, right? Pinecone's VectorDB is a partner of them. Whereas Databricks has crafted this holistic — or is in the process of crafting this holistic — horizontal data management platform for ML, and then if the Lakehouse paradigm turns out as expected, then structured data will also be a part of that. Databricks also partners with a lot of companies, but they have their own "competing products" for some of its partners, right? I think Alation maybe is a good example, right? Where technically Databricks offers a catalog product, but they also work really closely with Alation because a lot of people prefer Alation, right? If you were to put the two strategies side-by-side, they represent different philosophies, I think, for...
Satyen Sangani: (46:17) How to build a company.
Matthew Lynley: (46:18) How to build a company...
Satyen Sangani: (46:19) That's right. One's trying to vertically integrate and the other one is trying to be very specialized.
Matthew Lynley: (46:24) And the jury's still out on that, right? Realistically, they're probably going to coexist and they're probably going to continue snapping away at little bits and pieces of each other's market. There's not going to be a single winner and people are still going to use both of them in some way or another. But again, it gets back to the whole, most problems that companies need to solve, they're not actually that complicated. It's like, what is the lifetime value of this customer or this set of customers? That's not an incredibly complicated mathematical problem to solve. As long as you satisfy that business use case, as long as you're awesome at making it easy to address that specific question or that specific problem, you don't have to go all the way down into LLMs if you really didn't want to.
Satyen Sangani: (47:05) But I think the reality of this case is that what you're describing, and I think what people — when they get caught up in the competitive dynamic or the competitive point of view — see is that there's sort of winner-take-all markets or that these markets are sort of zero-sum, but actually these markets are quite distinct, even though they're positioned as rivals, they're selling very different things and often to very different people. That's also what makes these markets so interesting and dynamic and fun because there's so much to go do. Matt, it's been fun to speak with you, as always. Thank you for taking the time and look forward to having you back on again.
Matthew Lynley: (47:40) Thanks.
Satyen Sangani: (47:47) In today's chat, Matt delivered a crash course on LLMs. He started from the fundamental concept of what an LLM is, delved into vector databases, and discussed the trends and companies that are driving the market forward. The rise of GenAI shows that LLMs are accessible to everyone. However, all models aren't created equal. Crafting effective prompts is crucial for keeping users focused and context that we initially feed the model will drive how useful the model is in solving different problems. Matt also helped us understand more about the Snowflake–Databricks rivalry. In his view, one is taking a more vertically integrated approach while the other is leaning more on a broad ecosystem. With its Unity catalog, Databricks offers its own vector search product, while Snowflake relies on partners like Pinecone to provide vector databases. Thank you for listening to this episode and thank you, Matt, for joining us and helping us understand words and trends that we hear a ton about, but don't always fully understand. I'm your host, Satyen Sangani, CEO of Alation. And Data Radicals, stay the course, keep learning and sharing. Until next time.
Producer: (48:50) Data mature organizations can effectively find and leverage high volumes of data. They're also more likely to acquire and retain customers while also outperforming peers. For a framework for benchmarking and advancing your data management capabilities, download this white paper. It's called The Path to Data Excellence: The Alation Data Maturity Model, at alation.com/dmm.
Season 2 Episode 26
Edo Liberty, CEO and founder of Pinecone, introduces the impact of vector databases on AI, likening them to Esperanto for algorithms—a universally understandable language that transforms intricate data into an easily interpretable format for AI systems. Unlike traditional databases' clunky, one-size-fits-all approach, they make AI smarter, faster, and infinitely more useful. As the fabric of AI's cognitive processes, vector databases are the hidden engine behind the Generative AI revolution.
Season 1 Episode 25
During the eras of papyrus, parchment, and paper — as well as our current "paperless" age — librarians have been among the gatekeepers of information. In this episode, Alation's senior director of learning and communities, Deb Seys, brings a librarian's perspective to how data (and cataloging it) can tell stories and deliver unbiased metrics.
Season 1 Episode 20
Forget surveys — study Google searches! Bestselling author Seth Stephens-Davidowitz leverages the accuracy of Google Trends to maintain an unbiased, data-driven mindset — and offers ways to unearth honest insights to power your business decisions.