We Spent 4 Years Building AI Products: Here's What Actually Mattered

Published on October 17, 2025

what we learned building AI data agents for 4 years

A reflection on 4 years of building in an AI startup

After four years of riding the AI rollercoaster at Numbers Station, through pivots, iterations, and eventually our acquisition by Alation, I've had the luxury of stepping back and asking myself: What the hell actually mattered?

The AI space moves at warp speed. Remember when GPT-3 was mind-blowing? That feels like a decade ago. In this environment, the only superpower that matters is your ability to iterate and adapt faster than your competition. Most of what we built is now obsolete (RIP to our beautifully engineered multi-agent framework, you were too pure for this world). But the three foundations we built around have not only survived, they've become even more critical.

Here's what lived, what died, and why you should care.

Pillar 1: Metadata is the air your AI breathes 💨

What lived on: The metadata layer, or as we called it at Numbers Station, the "knowledge layer" (sounds fancy, right?).

Great agents aren’t “smart” in the abstract; they’re well briefed. You can have the most sophisticated LLM in the world, but if it doesn't know that your "cust_rev_q3_final_FINAL_v2" table actually contains quarterly revenue data broken down by customer segment, it's about as useful as a chocolate teapot.

When you’re dealing with structured data, that brief is your metadata: table purpose, column semantics, lineage, governance policies, owners, quality signals, and relationships. At Numbers Station, we called the combination of a semantic layer plus a knowledge graph our knowledge layer, a system that describes not just what your data is, but how to talk about it.

A semantic layer provides the common language to interpret data consistently across tools and teams. It defines business terms, metrics, and rules in a human-readable way. The knowledge graph, meanwhile, maps relationships among data assets, users, and policies. Together, they form an agentic knowledge layer that gives AI context: it’s how agents understand what data means, how it should be used, and what decisions can be made from it.

We spent and continue to spend an extreme amount of engineering hours figuring out how to build and maintain this knowledge layer. How do you automatically extract meaningful descriptions from cryptically named database columns? How do you incorporate user feedback so the system gets smarter over time? How do you make it a living, breathing organism that evolves with your organization instead of becoming another piece of technical debt?

The answer involves a delicate dance between AI automation and human curation. You need AI to do the heavy lifting of initial metadata extraction and maintenance, but you need humans in the loop to validate, correct, and enrich it. It's not as sexy as full agentic automation, but it actually works.

A former exec at Tableau dropped a truth bomb on me that I haven’t been able to shake: “Data catalogs were built for AI before AI was a thing.” And the more I’ve thought about it, the more I’ve realized they were right. That insight ultimately led us to Alation.

At Numbers Station, we were working to get enterprise AI into production, and in the process, we found ourselves rebuilding what was, in essence, a data catalog from the ground up: a metadata layer with governance, security controls, and all the enterprise-grade bells and whistles our customers expected. At some point, we had to ask ourselves: why reinvent the wheel when we could join forces with the people who’ve been perfecting it for years? That’s ultimately why we decided to join forces with Alation. In the AI era, metadata is king.

What didn't matter: That beautiful multi-agent orchestration framework we spent months building.

In early 2024, we bought into the vision that the future belonged to armies of specialized AI agents working together through complex orchestration. To make this a reality, we built an entire agent framework to coordinate these digital workers. It had message passing! State management! Elegant error handling! We had no other choice at the time; nothing on the market had all the features our customers needed.

Then reality hit us with a one-two punch.

First, frameworks like Pydantic and LangChain blew past us. When we started, they couldn't handle our requirements. But while we were busy being a startup, trying to do this and solve our customers' end-level problems, they were laser-focused on solving this one problem. The lesson: specialized vendors will always outpace you on features that aren't your core business. Every time.

The knockout blow came when OpenAI dropped o1 and basically said, "Or... you could just give one really smart model a bunch of tools and let it figure things out." Orchestration complexity? That was yesterday's problem for most of the things we had been looking at. Today's reasoning models can handle tasks that previously required complex multi-agent choreography. Our elegant framework became technical debt overnight.

Pillar 2: Evaluations are your north star ⭐

What lived on: A robust evaluation framework for AI agents.

Here's a dirty secret about AI development: Most teams are flying blind. They tweak prompts based on vibes, deploy "improvements" based on cherry-picked examples, and pray their changes don't break something else. We were guilty of this too… until we learned a bitter lesson.

Evaluations are to AI agents what unit tests are to traditional software. But evaluations are infinitely more complex because you're not dealing with deterministic outputs. You need sophisticated judges (sometimes other LLMs, sometimes custom logic) to determine if an agent's response is "good enough." You need to handle edge cases, measure performance across diverse scenarios, and track regression over time.

We became so obsessed with this that we ended up productizing our internal evaluation tools as part of our Agent Builder release at Alation. If we suffered through figuring this out, why should everyone else have to?

The beautiful thing about solid evaluations is that they give you superpowers. New model drops? Run your eval suite and know within hours to see if it's worth upgrading. Want to try a different prompting strategy? Your evals will tell you if you're actually improving or just fooling yourself. The LLM landscape changes every month, but good evaluations let you surf those waves instead of being crushed by them.

What didn't matter: The agents themselves.

I know, I know, this sounds insane. We spent years perfecting our SQL query agent, agonizing over every word in our prompts. We had countless debates about its inner workings, how it should interact with the semantic layer, and what kinds of data models it should work with.

None. Of. It. Mattered.

The truth is, writing a prompt for a SQL agent is now trivial. What really matters is the metadata that provides context and the evaluations that tell you whether it's working. The agent itself has become a commodity.

Here's the key insight: a one-size-fits-all agent will never be ideal. Every customer needs their own SQL agent, tuned to their specific needs and quirks. That's exactly why we made it dead simple to build one in Agent Builder.

The magic is no longer in prompt engineering. The real innovation lies in the infrastructure that makes these agents actually work: the metadata layer that understands your data and the evaluation framework that keeps them honest.

Pillar 3: Integration is everything 🔌

What lived on: Meeting users where they already work.

After countless customer conversations, we had our 'come to Jesus' moment. We realized we shouldn't be building yet another SaaS app. People view SaaS apps as walled gardens with fixed, one-size-fits-all functionality. So what did they actually want?

The answer: AI capabilities that plug into their world. Sometimes that's existing tools like Slack. Sometimes it's the custom data apps they're building. Yes, even that janky internal React app held together with duct tape and prayers.

We pivoted hard toward embedded AI analytics. Instead of forcing users to come to our walled garden, we brought the garden to them. This philosophy got turbocharged with the emergence of protocols like A2A and MCP, which made integration stupidly easy.

Here's the future as we see it: AI democratizes building. Everyone becomes a developer. Nobody wants one-size-fits-all SaaS apps anymore. Instead, organizations want to build custom apps on top of well-governed AI capabilities. Our job isn't to build the perfect UI; it's to provide the AI infrastructure and get out of the way. We're selling a better way to build, not another app.

What didn't matter: Fancy UIs for data visualization and workflow orchestration.

We learned two lessons about what didn’t matter to us:

First, on fancy data visualization UIs: Nobody wants another SaaS provider to build dashboards. The world doesn't need another walled-garden dashboarding tool. Thankfully, we never fell into this trap; we always focused on embedding into existing tools like Tableau and Power BI rather than competing with them.

Second, on workflow orchestration UIs: We knew these would become commoditized, just like multi-agent frameworks. Still, we couldn’t resist trying. We built one, only to kill it unceremoniously in June. Why? Because we saw the writing on the wall, and OpenAI proved us right this week with AgentKit. Building workflow UI is now table stakes (shoutout to n8n, which is still the best of breed tool here in my opinion).

The lesson? Don't build for things that will inevitably become commoditized. Focus on the hard problems that actually create value. In our case, that's making it brain-dead simple for users to build custom agents on their metadata and embed them anywhere.

The meta-lesson: Build for change, not perfection

These three pillars (metadata, evaluations, and integrations) are now core design principles in Alation's Agent Builder. But the real insight isn't about these specific technologies. It's about building products that can evolve as fast as the AI landscape changes.

The graveyard of AI startups is littered with companies that built beautiful, perfect solutions for problems that stopped existing six months later. The winners are those who built platforms for adaptation, tools that help enterprises deploy agents quickly, evaluate them ruthlessly, and evolve them continuously.

The space is moving too fast for any other strategy. You can't build a moat around a specific AI capability anymore. You build a moat around your ability to help customers adapt to whatever comes next.

So here's my advice: Stop trying to build the perfect AI app or agent. Start building the platform that lets others build what they need. Stop obsessing over prompt optimization. Start obsessing over knowing whether your prompts actually work. Stop forcing users into your product. Start embedding your capabilities into theirs.

For most of us, the AI revolution is not going to be about building the smartest model or the cleverest agent. It's about building the infrastructure that lets everyone else build what they need, when they need it, and evolve it as fast as the technology allows.

And maybe, just maybe, have enough metadata to know what the hell "cust_rev_q3_final_FINAL_v2" actually means.

Curious to learn more? Explore our new AI capabilities:

Pillar 1: Metadata is the air your AI breathes 💨
Pillar 2: Evaluations are your north star ⭐
Pillar 3: Integration is everything 🔌
The meta-lesson: Build for change, not perfection

We Spent 4 Years Building AI Products: Here's What Actually Mattered

Pillar 1: Metadata is the air your AI breathes 💨

Pillar 2: Evaluations are your north star ⭐

Pillar 3: Integration is everything 🔌

The meta-lesson: Build for change, not perfection

Contents

Tagged with