Data Quality Is a Risky Business

With Kyle Kirwan, Co-Founder and CEO of BigEye

Kyle Kirwan
Co-Founder and CEO of BigEye

Kyle Kirwan wants to help the world make magic with data. He is the co-founder and CEO of Bigeye, a data reliability engineering platform that helps data teams build trust in the data their organizations depend on. As one of the first data scientists, data analysts, and product managers at Uber, he helped launch teams like Experimentation Platform, and products like Databook.

Satyen Sangani
Co-founder & CEO of Alation

As the Co-founder and CEO of Alation, Satyen lives his passion of empowering a curious and rational world by fundamentally improving the way data consumers, creators, and stewards find, understand, and trust data. Industry insiders call him a visionary entrepreneur. Those who meet him call him warm and down-to-earth. His kids call him “Dad.”


Kyle Kirwan
Co-Founder and CEO of BigEye

Kyle Kirwan wants to help the world make magic with data. He is the co-founder and CEO of Bigeye, a data reliability engineering platform that helps data teams build trust in the data their organizations depend on. As one of the first data scientists, data analysts, and product managers at Uber, he helped launch teams like Experimentation Platform, and products like Databook.


Satyen Sangani
Co-founder & CEO of Alation

As the Co-founder and CEO of Alation, Satyen lives his passion of empowering a curious and rational world by fundamentally improving the way data consumers, creators, and stewards find, understand, and trust data. Industry insiders call him a visionary entrepreneur. Those who meet him call him warm and down-to-earth. His kids call him “Dad.”

Kyle Kirwan
Co-Founder and CEO of BigEye

Kyle Kirwan wants to help the world make magic with data. He is the co-founder and CEO of Bigeye, a data reliability engineering platform that helps data teams build trust in the data their organizations depend on. As one of the first data scientists, data analysts, and product managers at Uber, he helped launch teams like Experimentation Platform, and products like Databook.

Satyen Sangani
Co-founder & CEO of Alation

As the Co-founder and CEO of Alation, Satyen lives his passion of empowering a curious and rational world by fundamentally improving the way data consumers, creators, and stewards find, understand, and trust data. Industry insiders call him a visionary entrepreneur. Those who meet him call him warm and down-to-earth. His kids call him “Dad.”

Satyen Sangani: (00:03) Think of the different games we play. In Monopoly, you win by accumulating the most properties. It’s a game with increasing returns to scale. The more you buy, the more you win. The more you win, the more you can buy.

Chess is another game. The rules here are a little bit different. Scale doesn’t really matter. You can lose every other piece, but your king, and still win.

So now think of data intelligence as a game, if you’re trying to comply with laws and regulations, the goal is to prevent any non-compliant activities. In this game, violations should be black and white. You always know where you stand, and losing means getting caught in non-compliance.

If you’re trying to create business opportunities, then you might be playing a different game. (00:47) In this world, you spend less time caring about a singular downside event, and instead, you plan for multiple upside opportunities.

What’s weird about data intelligence is that a lot of people and companies bring the strategy of compliance to the goals of creating business opportunity. And most of the time it doesn’t work. You never sacrifice all your properties in Monopoly just so you can buy Boardwalk.

The reality of data management is that you’re playing multiple games. You’re trying to comply, you’re trying to drive operational efficiency, and you’re trying to optimize for business opportunity. And often, the software you use reflects the game you and your organization thinks you’re trying to play.

Satyen Sangani: (01:27) So today on Data Radicals, I’m excited to welcome Kyle Kirwan. Kyle has excelled at the game of data management, and is now creating tools to help others play the game better. Today, he is the CEO and co-founder of Bigeye, a data quality engineering platform. And before launching Bigeye, Kyle worked at Uber as one of the company’s first data scientists and data analysts.

As both a founder and a former employee of one of the world’s biggest data aggregators, Kyle has a unique perspective on the issues facing data radicals today and how you can excel at the game of data, too.

Producer Read: (02:09) Welcome to Data Radicals, a show about the people who use data to see things that nobody else can. This episode features an interview with Kyle Kirwan, co-founder and CEO of Bigeye. In this episode, he and Satyen discuss Kyle’s experiences working at Uber, creating the data culture at Bigeye, and the modern data stack. This podcast is brought to you by Alation. What if we told you that data governance can drive real business results? This white paper from Gartner shows you how. Go to alation.com/dag to get your free copy of a Gartner’s guide. It’s called Adaptive Data and Analytics Governance to Achieve Digital Business Success. And it’s yours for the downloading. Check it out today.

 


 

Driving Uber with Data

Satyen Sangani: (02:54) We began our conversation by discussing Kyle’s time at Uber. As you can imagine, investing in data was a major priority.

Kyle Kirwan: (03:00) Data was always, I think, a pretty big priority. So thinking back to even when I joined, I don’t ever recall running into issues, for example, like I wanted to go get a Tableau Desktop license — I just went out and got it. There wasn’t really a roadblock there. And then same thing when it came to the data warehouse, and when it came to going and hiring folks to work on the data tools as well.

So I want to say when I left in 2018, I think there were between 300 and 500 people working on the data platform organizations. So that included managing Kafka, Hadoop, building our internal tools. So the data products that I managed on my team, our internal query tool, dashboarding tool, machine learning platform. So all that together was, I think, 300 to 500 people, something like that, which is pretty sizable, right? That’s a big organization just to work on an internal data platform.

Kyle Kirwan: (03:53) I think that data was always a priority for the business, or at least, ever since I joined. And so there was lots of investment and appetite for pushing forward on what technology we were using, having enough personnel to do the work. And I think that came down really from leadership, as there was a big appetite to use data as a competitive advantage. And that meant using it for operations, for marketing, obviously in the application itself, price setting and ETAs is a pretty big thing. And the marketplace team even had their own separate data stack separately just to take care of unique needs that were specific to marketplace. So, plenty of investment in data from the get go. And I think it probably is still that way, at least to a great extent today.

Satyen Sangani: (04:37) Yeah. And there’s an interesting phenomena, both at Uber and at a lot of other tech companies where… Sounds like you bought Vertica, you bought Tableau, but in a lot of cases, and in your case, you ended up building your own tools. It was so strategic on some level that basically not only was the idea of data important, but also building these data platforms important, and so important that you felt like you couldn’t trust an external vendor to do that. What was the thinking there?

Kyle Kirwan: (05:08) The primary product that I was responsible for when I was a product manager on the data platform was Uber’s internal data catalog. So in many ways, serving a function there that Alation would serve for Alation’s customers, but not just the catalog though. Our lineage system, our ETL orchestrator, the query editor, the query pad, the dashboarding tools, the machine learning platform, these were all in-house products.

Maybe the one nitpick there is, the ETL system was a very, very early fork of Airflow. But at this point, I’m sure they’re totally separate at this point. But yeah, most of the tools got their own in-house teams of, in my case up to 10 or 15 engineers, just to build a small cluster of tools around things like incident management or change management, tracking ownership, trying to identify GDPR-sensitive data.

Kyle Kirwan: (05:56) And I think that was probably a combination of the general culture at the company around building things in-house and that extended to operations tools, that extended to marketplace stuff. So there was just a very, very healthy appetite for building in-house. I think if you asked many folks who worked there, it was probably taken too far. I think there were a couple tools that I won’t name in particular, but I think they’ve since been decommissioned and they brought in vendors to serve those functions. And I think that in some cases that was viewed as a big win internally, because the internal effort just wasn’t keeping pace with what the vendors were able to provide.

Kyle Kirwan: (06:35) On data though, I think because it was viewed as such a competitive advantage, there was more of an appetite than usual to fund teams to build in-house products or to manage just very large pieces of infrastructure. I think Uber had 100,000 tables in the data lake if I remember correctly. Their Kafka installation was one of the largest ones in the world. So just the scale and complexity of it made it easier to justify. But I think it can’t be ignored that there was just a very strong cultural push to build in-house, and that came from the top.

 


 

The Proliferation of Tools & Rise of the Modern Data Stack

Satyen Sangani: (07:07) And loaded question because you run a data software company, and obviously so do I: Would you have done it again in that exact same way? And would you advocate for people to have gone down the same journey that you and Uber went down, buying or building a lot of tools internally? And, I guess, how should somebody going through that journey think about that decision?

Kyle Kirwan: (07:29) I think it’s very different today and maybe it’s gone to an extreme, but I feel like we’re kind of living in this Renaissance era when it comes to data tools. And I say, it’s gone to an extreme because I think if anybody’s been on LinkedIn or any major conference recently, it seems like there are maybe too many data tools flying around right now. And it’s a bit of a Cambrian explosion, but there’s a lot of opportunity there. In that sort of plethora of tools, you can find things that already fit your needs or you can find a startup that is willing to partner with you as a design partner and build toward use cases that really matter to you.

Kyle Kirwan: (08:06) [Bigeye co-founder] Egor [Gryaznov] and I frequently talk about how, today, if you wanted to stand up what I would call a vanilla or a pretty standard data stack, if you gave me a credit card, we could probably get it done in a week or less. A few days, if you really want to strip it down to the basics. I’m simplifying obviously, but I think that’s a wildly different situation than it was five years ago, certainly 10 years ago. And so I think that the quality and the availability of tools now, because it is so different, I think that changes the calculus.

Satyen Sangani: (08:39) I think that could have been said, even… This is ancient history: 10 years ago when I was founding Alation and I often get this question around, why are there so many data tools? And my response just tends to be, “Well, because people think in different ways and a lot of what analytics is all about is automating or accelerating the way in which humans think.” And that is super-context-sensitive and super-specific to that individual and their background and their training. And so in some ways it’s hard to see an end to this explosion. Do you think there’s some rationalization or reckoning to come where there’s fewer tools in the future or do you think it just continues? I’m just curious to get your perspective on that.

Kyle Kirwan: (09:22) I tend to look to the cycle that software engineering went through; in particular, the layer of the data stack, if you will, that we find ourselves in at Bigeye, I would say is pretty much in sort of this DevOps type of parallel within the data stack, right. So Bigeye and the folks that we’re working with are not, for example, doing analytics, they’re the ones that are operating the data platform. And so we frequently look to the history and the evolution of the DevOps space, site reliability, engineering, et cetera, to see what happened, and how did that evolve over time? And I think that looking to that parallel, I don’t know that there will be a cooling in that there won’t always be new products coming out and new ways of thinking and new solutions.

Kyle Kirwan: (10:08) That’s obviously going to happen all the time, there’s new companies entering the DevOps space now that are doing innovative and interesting and creative things, and they’re building on top of the shoulders of what was done previously in the space. So I think that will always continue, I’d think it’d be silly to say otherwise. That said, I think the difference is that if we look to some other categories: I know Datadog gets brought up pretty frequently, at least. And with folks that I talk to as a north star in the data ops realm, if you look at Datadog and if you wind the clock back seven years, eight years, something like that, I’m sure it was not very clear who were going to be the preeminent leaders in the various categories that they were in versus these days when we were starting Bigeye, we set up Datadog, we set up Century, we set up an AWS, et cetera.

Kyle Kirwan: (11:00) There is some general pattern for how to build the foundational layers of our tech stack. And I think that in data, I would expect to see the same thing. And I think that we’re starting to see this with the modern data stack. And there’s still a lot of debate about what exactly does that mean, what does it include or not include, but I think the zeitgeist behind the idea of a modern data stack is, can we get to a sort of standard template? And then from there you may come into a new company as a data engineer or the head of data, and you might say, “Well, here’s the standard template. Here are the things that don’t work based on this particular business, the particular data culture that’s here, or that we want to promote here. And so I’m going to change out these specific parts.” But I think you could come in with that template at least as a reasonable starting point. And I think we’re just starting to see that template come into focus right now.

 Data Reliability Engineering

Satyen Sangani: (11:53) You mentioned this notion that you’re really trying to bring on some level to data, what happened in the DevOps space, and you’ve coined this term “data reliability engineering.” Tell us a little bit about that term, what does it mean, why is it applicable, what historical precedence are you taking it from and how is it different perhaps from software development, if it is at all?

Kyle Kirwan: (12:14) The main idea behind it, or the biggest thing that might be different about approaches in the past is really starting to think from a systems level about what does reliability mean for data? What does quality mean? How do we define that? And what are the things that matter? How do we measure it and then how do we systematize it? And I think this is the biggest thing to be done, is to systematize. And so what that means is having structure, process, organization processes. So how do we measure data quality? Who depends on the data and what does fitness for use look like from their perspective? And then having processes and structures in place to support that. So for example, SLAs, this is a concept that’s been pretty battle tested, but an SLA in essence is not specific to software engineering.

Kyle Kirwan: (13:07) And in fact, SLAs predate software engineering. They were used in telecom even earlier than that. So I think the SLA, for example, is a concept that helps describe what reliability looks like, how that can be measured. And that has been battle tested heavily in software engineering, but it’s not unique to, or special to software engineering. It can be carried forward into another space. And I think that’s what we’re seeing with the adoption of data reliability engineering.

Incident management, the idea of risk being a variable; it’s on a spectrum. So not aiming for zero risk, but instead saying, “Okay, given where we are in our journey in data…” I think I remember talking to Colin Zima from Looker a while ago. And he talked about this shifting of the level of risk that a data organization is willing to take on over time and about how that starts out is, the balance is really on one end of the spectrum where you’re willing to take on a lot of risk.

Kyle Kirwan: (14:05) It’s okay if the data’s fluctuating and changing frequently, maybe you have breakage because of that frequently, but the evolution and the velocity of change in your data model is really the thing to prioritize. And then there’s sort of a cooling off that comes over time. And that tradeoff starts to move towards preferring reliability over velocity and saying, “Okay, we’re going to have some stewardship, we’re going to have some controls in place for the way our data evolves. It needs to be documented” and that type of thing.

I think data reliability engineering is trying to bring some structure and process to thinking about how we manage our data platform from a risk perspective. And so SLAs is one tool in that tool belt that we can import from site reliability engineering, other things like having very broad monitoring at all times, having automated alerting systems and things like that. These are all concepts that have been proven in a software engineering context, but they’re not unique to it. And so we can adopt those and with a little bit of massaging, they apply well to data platform teams.

 Three Axes of Data Operations

Satyen Sangani: (15:06) There’s two things that you said that I want to dig into there, one around this concept of risk, but then there’s another which is, why now? Quality controls have existed for a reasonably long period of time. What’s changed to cause this shift in the technology landscape and the need for a more advanced solution that seems to have a lot higher resolution?

Kyle Kirwan: (15:28) I think that these three trends here that I’m about to describe are not specific to data quality. I think they apply in general to anything that has to do with data operations or data management.
In the discovery and governance space as well, I think the appetite in the market for catalogs, for example… And you might remember I blogged about catalogs a few years ago before you and I had a chance to meet, even. And I was talking about the fact that I see catalogs as going from being a thing that a mature data organization adopts to being one of the first tools that a data team should have in place in the data stack. And I still think that shift is coming over time.

Kyle Kirwan: (16:09) And what is driving that pattern change, and also in my part of the market, the desire for better data reliability for better data monitoring, data observability, et cetera, is three things. I think we’re moving forward on all three of these axes at the same time as an industry.

So the first one is the level of business value that’s riding on data. And I think this is what we experienced in a very positive, but also sometimes stressful way at Uber, which is if you’re going to just say, “Okay, as an organization, we are going to put data directly into our business processes and we will figure out everything else afterward.” That’s what we did there. So we said, “We are going to set dynamic prices. We’re going to predict ETAs, we’re going to tune our ad spend on a continuous basis, based on the ROI we see from different channels.” That was automatic.

Kyle Kirwan: (17:03) And so what that meant is, if we had a data reliability issue, the denominator in some number would be too low or too high. And that would translate directly into an adjustment in ad spend on that channel. So what this means is that the company’s able to leverage data in a really strong, competitive way, but it also means that you take on a dependency on it as well. Because you have more business value riding directly on a data pipeline. And so that means that your data operations, your data management has to evolve as well. That’s the first axis, the increased surface area in the business where data’s being applied. And when people talk about being data driven, I think this is what they actually mean: how do we take the data we have and use it to create business value? In doing so you’re creating a dependency on the data at the same time. Two sides of the same coin.

Satyen Sangani: (17:54) The second component? The needs of the people you’ll be working with.

Kyle Kirwan: (17:28) Second axis is the people involved. I think back to when I started, and a product manager would come to me and they would ask me questions and then I would go off and do some analysis; I’d find the data I needed. If I had Alation, for example, I’d go do a search for a data set and go find the right one and build my analysis on it.

Nowadays, a lot of organizations, the product managers are just going to go into a self-service tool and they’re just going to put the dashboard together themselves. And so you’ve got this wider range of personas around the business with different types of questions they’re asking, different tasks that they have, different levels of understanding of data, different technical skill levels, but they all need to use the data to get things done. So this means that in addition to having more business value riding on it, you’ve got more hands in the pie because there’s just a broader range of people involved in data as consumers or as producers as well.

Kyle Kirwan: (18:51) So somebody traditionally entering data into, like HubSpot or something like that, for example, they’re a data producer and in a very material way these days. So just the number of people involved at all stages of the pipeline has increased as well. So that’s the second axis that we’re moving on in my mind, is who is involved in data is expanding.

And then the third is the easy one. It’s just the complexity of the data. The scale of it just continues to increase. I think we’re over the fascination with big data, perhaps, and now we’ve moved into, “Okay, how do we actually get value from it? How many dollars of business value can we get per gig? That would be one way to boil it down.”

Kyle Kirwan: (19:33) But even in the face of that, the number of data sources that might be in use at a given company is continuing to grow, the velocity of it might be continuing to increase as the appetite for streaming data continues to pick up. So I think that’s the third axis, the complexity and the scale and the volume of the physical data itself is not really slowing down. At least as far as I can see.

 


 

The Rewards and Risks of a Wild West Data Landscape

Satyen Sangani: (19:55) Maybe switching gears a little bit then. You’ve lived in a culture that is super interesting and instructive in both obviously super positive ways and the change and the scale that you were driving at Uber, but also somewhat notorious in terms of how it was managed and how much risk they took and how many norms they flouted. And so how do you think about that experience informing data culture?

Kyle Kirwan: (20:20) It was a super unique culture. And like you said, first it had some real serious positives and I think everybody has seen a lot of the ugly side of it as well. But if I wanted to scope that down to data specifically, I think there was a very Wild West attitude, especially from the beginning and that has advantages and it has weaknesses. And so the advantage is speed and the advantage is innovation. It was not uncommon to see a query doing something super interesting, somebody would come up with a new way to get something done. Later on, when we had a modeling framework and the pipeline orchestrator was available to everybody, and basically anybody at the company who could write SQL could go and spin up a pipeline. You get this explosion of interesting things that were happening and being done with the data.

Kyle Kirwan: (21:08) And when you just drop all the data from all your service-oriented application databases into a giant data lake, the whole company can go and do whatever they want with it. So you get a lot of rapid innovation, a lot of quick evolution, and that’s awesome from the perspective of what we can do with our data when we give everybody access to it. That’s the good part.

Kyle Kirwan: (21:23) The drawback was a couple things. So the obvious one is just that today I would be a little bit uncomfortable with almost everybody in an organization having access to my data. There were not really access controls, at least not for a long time. There are now, from what I understand. When I was managing the data catalog, there were 2,000 or 3,000 weekly active users of the catalog, and there were probably some people internally who maybe were not using it, they were just going directly to the query editor.

Kyle Kirwan: (21:59) But you could take that as a lower limit out of an org of 15,000, 20,000 people. You had 2,000 to 3,000 people using the catalog each week. It’s a lot of people with access to the entire data warehouse. And that includes private information about individuals, where they’re taking trips to, sometimes that can be sensitive. That’s a scary thing. So that’s the tradeoff that we made at the time, for better or for worse, was speed, innovation, no one was blocked, you could get done whatever you needed to get done at all times.

And then the ugly side of that is if somebody really wanted to go and look up some sensitive information, there wasn’t really a whole lot there to stop them, except the threat that every single query was logged. And if somebody did see that you had gone and looked up at private information about an individual, you would almost certainly be terminated. But that was really the main stick there. There was not a control in place. It was just, “If you do this, you will be fired.”

Satyen Sangani: (22:57) Do you think that Uber could have been as successful as it was without having that freedom of exploration in the positive sense, or maybe that cavalier treatment of data in the negative sense?

Kyle Kirwan: (23:11) I don’t think so. I think that you could make the argument that some controls, at least some basic controls, would have produced more net benefit than the cost to innovation. I think that would be a defensible argument to make. But if you had to choose between relatively fine-grained or strict controls, to be sure, at the beginning, no. I think that totally would’ve put a lid on the speed of innovation and the rate at which people could do things.

Satyen Sangani: (23:40) So you are now talking to a CDO and ostensibly you are, because those are the folks listen to this podcast and they say, “Yeah, but Kyle, look, I’ve got this existing scaled, highly regulated Global 2000 that has lots of data. Is the only way for me to build a data culture to free the data and to essentially allow so many people to use it? What can I do? Help me with my conundrum.”

Kyle Kirwan: (24:09) At the end of the day, the easy cop out is, “It’s above my pay grade.” I’ve never been a CDO. That said, I think from my perspective, anyway, the easier the tools are to apply in the right way, the better a balance we can strike. So for example, if you have some basic control, if you know that the schema over here contains a bunch of tables that really just the finance team should have access to, that’s a fairly straightforward control to have in place, I would think.

Now, if it comes down to, “We need to have row-level access and we need some very, very complex hierarchy of controls, and we need to have a team of people who are going to review suspicious queries, and we’re getting into some pretty heavy-handed stuff at this point,” which may make sense in a highly regulated environment or a very risk-averse environment.

Kyle Kirwan: (25:02) And if you’re in that environment, then that seems like a perfectly defensible thing to do. But this goes back to what we were talking about earlier, a younger organization may value velocity over risk. And in that case, they are going to make trades in favor of innovation, and they may run the risk of some problems that come out of that. A more mature organization that has a CDO can afford a CDO, can afford people to think about these types of problems. Maybe they’re just at a different place on that risk-velocity spectrum.

Kyle Kirwan: (25:34) And so they’re going to make the optimal choices for where they are. But I think that the availability of good tools for doing this pushes the efficient frontier of that spectrum out a little bit further. And so we can make fewer tradeoffs to achieve the level of control we need when it comes to innovation. And I think the tools are really the best way to help drive that forward.

Satyen Sangani: (25:56) I appreciate that advice, because I think it’d be so easy to say, “Well, if you just buy all these tools and you just invest in data, then it’s all going to get better.” And I think the reality is that you are going to have to build a data culture. You’re going to have to take on some level of risk with trusting people with data and to do the right thing and to have judgment. And you can control for that risk, maybe by clustering off organizations, by building in data access controls, and by doing a whole bunch of other stuff. But the reality is that there is some risk.

Satyen Sangani: (26:30) And for these folks, often they have to either take the trade off of taking those risks versus taking existential business model risk, which may be less imminent, but very real because lots of companies go out of business because they don’t know how to use data. And you’ve now started a new company. What have you learned about building a data culture, and how have you tried to set up your company’s culture in purposeful and maybe implicit ways?

 


 

Launching a Company, Building a Data Culture

Kyle Kirwan: (26:56) The number one thing is, maybe this goes without saying, but at a company that builds tools for data reliability engineering for data observability, we have a group of people who are pretty excited about using hard numbers to make or inform decisions. For example, I was just having a conversation with our head of sales engineering this morning about our customer success score. So looking at indicators from customers of, are they being successful with Bigeye? Are they rolling it out on the amount of data that they plan to? Are they getting the coverage that they want? Are they creating SLAs? Are they keeping those SLAs up? Are we helping them to defend those SLAs and keep their data reliable? Are they adding team members to the platform over time? Are they taking advantage of new features that we ship?

Kyle Kirwan: (27:39) And so I think that rather than go to an individual and ask them, “Hey, is this customer doing okay?” subjectively — now that can be useful input and that can add detail and color to the picture — but we try to start the conversation from, “What can we measure about what’s happening in this customer’s use of Bigeye?

Kyle Kirwan: (27:59) And are they going to be successful relative to why they made the purchase and decided to bring Bigeye into their data stack?” And so this applies to a lot of other parts of the business as well.

Defining the Modern Data Stack

Satyen Sangani: (28:10) Yeah. I think so much of leadership, particularly in a data context in general, is about just being able to communicate context and the why, which gets back to some of that Simon Sinek “people don’t buy the what, they buy the why” stuff.

You mentioned dbt and they are emblematic of this idea that you mentioned earlier, which was this thing called the modern data stack. For people who are not familiar with the modern data stack or that term, educate us about what that is, what does that mean. Is it just the data stack more modern? What is this thing?

Kyle Kirwan: (28:48) If I had to explain the modern data stack in a nutshell, it’s like a starting point or a blueprint of primarily managed tools. So think of an elastic warehouse like Snowflake, or think of like a SaaS version or a cloud-hosted version of Alation or dbt’s cloud offering, for example, which is how we use it or Astronomer when it comes to an ETL system or an orchestrator. And it’s taking these managed tools, usually available as subscription services or consumption-based services, and they tend to snap together pretty conveniently to give you a stack that ranges from data ingestion and landing it into a warehouse, to a warehouse that can handle a bunch of different types of workloads. It’s elastic to different levels of load that need to be placed on it, transformation, the orchestration of those transformations, so how do we model the data, how do we blend it or hydrate it, et cetera.

Kyle Kirwan: (29:43) And then all the way through to reverse ETL, how do we pump it out of the warehouse and into some other destination back into our HubSpot or Salesforce, for example. And then analytics and machine learning; so how do we consume the data out of the warehouse and apply it to the business. So it describes a template or a blueprint of relatively modern tools, like Fivetran and Snowflake have been around for a while, but by some definition they’re modern, and it’s that blueprint for how would one snap these different services together into a relatively managed cloud-native data stack from bottom to top.

Satyen Sangani: (30:18) The two words that came across pretty clearly are both elasticity and on some level integration with the snapping together. Is this only applicable to digital-native, technology-first companies? Or what does the modern data stack have to do with people who maybe already don’t have a modern data stack, or haven’t been born having a modern data stack?

Kyle Kirwan: (30:43) We might expect to see the modern data stack more often at a younger company, but I don’t think there’s any restriction there. I buy my coffee from a subscription service called Trade. And if I think about Trade, they’re like a few years old, so just by nature of it being a younger company, they have the opportunity to start from this green field like, “What should we do? What’s the fastest thing to do?” And so it’s very easy to go grab relatively modern things and look for what’s that blueprint that would help us get off the ground quickly.

Kyle Kirwan: (31:14) If you’re a multi-decade-old company, maybe you’ve made acquisitions, you’ve got a bunch of different technologies that you’ve got to manage. Some of them haven’t been decommissioned or sunsetted yet because there’s a bunch of critical stuff riding on those, there’s just more legacy there. And there’s nothing wrong with that. But any company of any size could adopt the modern data stack. It may just start out as a new initiative or a younger initiative. And then I would expect to see some of that legacy data imported into it over time. So that maybe some of those legacy technologies you could transition off of.

Satyen Sangani: (31:45) There’s been a lot of writing out there about the modern data stack. And it’s certainly helpful to understand, as you’re designing technology internally and data stacks internally to understand it. There are other maybe competing or complementary constructs like data fabric and data mesh, which have overlapping paradigms. Do you see these things all converging into a single thing? Maybe an unfair question, because I don’t know how much you’ve read about those concepts, but how do you think about this modern data stack, its longevity, the construct relative to other potential ways of thinking about building?

Kyle Kirwan: (32:23) I think a lot of things in life are over the long run, probably cyclical. So I wouldn’t be surprised if we start to see some bundling or if we start to see some movement onto this sort of singular data stack thing. One of the things that was really advantageous about the data platform at Uber is we were able to bottleneck and centralize things. And so for example, every single query run on Presto or on Vertical, either from a user or a service account or the ETL system, all went through one proxy. And so what that meant is we were able to construct lineage from parsing out all the queries run by the ETL system, or we could do statistics about which users were running the heaviest or the least-performant queries. And we could go reach out to them.

Kyle Kirwan: (33:11) We had a dashboard called Opportunity Champions and it had the folks with the total highest number of aggregate query hours at the top. And we could go reach out to them and say, “Hey, do these queries need to be this heavy? Or is there a way we could optimize them?” But those were advantages we got from bottlenecking and centralizing and saying all queries run on our systems are going to go through this one proxy. But when I go talk to a company, they have Snowflake, they have Redshift, they have BigQuery, they have multiple Redshift warehouses, they’ve got Teradata and all these things are in active use. And so I can see it being very challenging to say, “Well, just centralize everything and get it put through a bottleneck.” That’s a pretty tall order in that type of environment.

Kyle Kirwan: (33:53) So I think that there will probably be some movement toward centralization, toward modern tools over time. But I think it would be a little bit hubris for me to say, “Oh yeah, it’s going to go that way for sure. And then we’re going to stay there and it’ll be steady state.”

Satyen Sangani: (34:07) Obviously the technologies and certainly the big-step function technologies like Snowflake and cloud computing tend to have a ripple effect on other things. I guess there’s this idea of what the template looks like. And then there’s the idea of: how do I go about implementing the template? And one thing that you said to me, I think it was when we first met, which I found interesting, was this concept around what you should implement first. And I think you said to me, something like, “Oh, of course you should implement a data catalog first.” And I think you even said that earlier in this discussion and people always ask me about that question.

And some people are like, “You should do governance first.” And some people are like, “You should do quality first.” And some people are like, “You should do integration first.” How do you think about that problem and why, and when should a catalog be first and when should something else be first? And how would you counsel people given your experience in using these tools?

How Stakeholders Inform Structure

Kyle Kirwan: (35:00) That’s a great question. I think to answer it, I actually want to step back a little bit, because earlier you asked me, just a minute ago, about data mesh or data fabric. We didn’t have a name for it at Uber. We didn’t call it data mesh or fabric or anything specific like that. If anything, I’d call it kind of hub-and-spoke, but we had this central platform, and that platform was tasked with providing data tools to the rest of the org.

So my team built and maintained and managed freshness, lineage, quality, instant management, discovery, all these types of challenges for data folks. But we were not the ones going into the data catalog and creating documentation. This was a self-service thing. And so you’d have these embedded analysts, data engineers, et cetera, in each major team around the company.

Kyle Kirwan: (35:51) The rider team, for example, would have a few data engineers, a few data scientists They would use the platform’s tools. So they would come into the data catalog and document their models, where they would use the pipeline system to create new pipelines and build their own star schema. And then of course that would be available to the rest of the company to query it if they wanted to. They could go into the catalog and look up the star schema that was put out by the rider team and see documentation about it.

Kyle Kirwan: (36:18) So the reason I come back to that and I don’t know if that fits the data mesh model or not — I don’t know enough about it to say — but I think that probably informs what you need in your data stack as well. If you are going to take this stakeholder-oriented model and say, “Okay, as a data platform team, who are the most important orgs within the company for us to serve?” At Uber, the rider and driver team are pretty, pretty important. Marketing’s pretty important. Growth is pretty important. So we would treat those as high-priority stakeholders and whatever the deficiencies were in meeting the needs of those stakeholders, that would help inform our prioritization as an org.

So if, for example, documentation is actually pretty good, maybe you’re lucky to work in an organization where even in the absence of a data catalog, you’ve got teams really, really thoroughly documenting their models. Maybe the models don’t evolve so rapidly that the documentation can’t be kept up-to-date. In that case, maybe data quality is really the highest order bit.

Kyle Kirwan: (37:20) And so you go buy a data observability product or something just to make sure everything’s working. Maybe in that environment, the documentation or discovery aspect is adequate for the needs of the organization. I don’t see that happening very frequently. That is why I advocate for a catalog early and often. But as an example, that is letting your stakeholders’ needs and the reality of the organization drive prioritization on the data platform. Because at the end of the day as a data platform team, we are there to serve the needs of the rest of the organization, trying to use data for a business purpose.

What’s the Future of Data?

Satyen Sangani: (37:56) I think the insight there is that on some level, yeah, as a data platform team, you’re trying to build the available or make available data. But really what you’re trying to do is produce knowledge. And the data without context is really just a bunch of numbers inside of tables. And I think it’s something that people tend to forget because you get so focused on the physical thing versus what you’re really trying to affect in people’s minds.

I would love for you to take us out with a prediction, where do you see change evolving? What’s going to be different in five years? And give us a prediction about how the world is going to be different maybe tomorrow than it is today.

Kyle Kirwan: (38:37) This might be selfish or myopic to my area, but I think one of the biggest things that I see changing is the application of intelligence and automation and stochastic processes and data science to infrastructure. There was a dedicated data science team on the data platform org at Uber. And this was not a trivially small group of data scientists, specifically focused on applying science to our infrastructure. So how can we automatically identify similar or related data sets even in the absence of explicit lineage? Or how can we make suggestions for how to remap or lay out a data model? Where are we missing indices or where could we partition something differently to improve performance? Or where can we apply batching methodologies to queries or things like that?

I’ve seen plenty of research on this recently that’s super fascinating around translating a SQL query into an English description of what the query does or going the reverse route. Taking a human sentence, like “query, query, meaning question,” and generating a SQL query from it that is actually executable.

So these are all fairly new frontiers, I think. Because historically, so much of the emphasis has just been on getting the basic machinery to work in the first place. So there just hasn’t been capacity available to ask questions about, “Well, what can we do with intelligence on top of that infrastructure?” But when I look at Bigeye, for example, at the end of the day, what are we doing? We’re doing signal processing on characteristics of the data.

Kyle Kirwan: (40:19) And so what are the things that you can do with a year’s worth of four-times-a-day granularity time series about literally thousands of attributes across your entire data warehouse? And the answer is you can do some pretty interesting stuff. And that’s just the one sector of infrastructure that I’m paying attention to, but there’s tons of others. There’s queries and analytics and database performance and all these other things. I think over the next 10 years, we’ll start to see a much stronger investment of intelligence and automation and data science thinking applied to the infrastructure itself. And I think that’ll unlock a lot of new possibilities.

Satyen Sangani: (40:57) There’s always going to be tension between risk and velocity. As I reflect on my conversation with Kyle, it’s my hope that we can move at the speed of disruption while also minimizing risk. The vendors in our field today have created a wealth of tools that allows you to make more intelligent data driven decisions. And, with so many tools available, everyone has the freedom to find what best fits their organization.

Satyen Sangani: (41:21) I believe we’re still living in the Dark Ages of Data, but this conversation with Kyle gives me the confidence that the Data Renaissance is just around the corner. Thank you to Kyle for joining us for this episode of Data Radicals. This is Satyen Sangani, co-founder and CEO of Alation. Thank you for listening.

Producer Read: (32:40) This podcast is brought to you by Alation. Catch us at Snowflake Summit this June. Learn how folks use Alation with Snowflake to wield data powerfully. Innovate ahead of the curve and build data products that give you a competitive edge. Snowflake Summit runs from June 13th to the 16th. Attend in person in Las Vegas or virtually from home. See you there. Learn more at snowflake.com/summit.

Other Episodes You Might Like :

Start with Story, End with Data

Ashish Thusoo

Ashish Thusoo

Founder of Qubole and Creator of Apache Hive

Subscribe to the Data Radicals

Get the latest episodes delivered right to your inbox.

Marketing by