Building Data App Infrastructure: The 4 Critical Layers for 2026

Published on January 27, 2026

A deeper look inside the coding of a computer with some grid graphics overlaying it

In this article, weʼll explore the four layers — catalog, lineage, pipeline & agent — of the technology stack expected to play a key role in defining data apps in 2026 (and beyond). Join us for a pragmatic, step-by-step look at the advantages, tools, and implementation strategy for each element. 

Traditional ETL systems were rigid monoliths that forced centralized teams to manage entire pipelines end-to-end. Modern data applications embrace a more flexible, modular architecture where each stage — extract, load, transform (ELT) — has its own dedicated toolset. For example, Fivetran handles data ingestion, cloud data warehouses store data, and dbt manages transformations. With this setup, individual teams can handle their part of the data pipeline without relying on others, speeding up innovation and reducing dependencies. According to Gartner, CDOs should shift their operating model from project-centric to product-centric. Embracing this data-product mindset will set us up nicely to dive into the world of tiered stacks!

The data catalog aggregates all assets (datasets), their ownership details, licensing constraints, and usage metrics. Data users — including developers, analysts, and even AI agents — can use this central source to find relevant data quickly and learn more about it. Different types of people will have different needs within the catalog, but behind each personaʼs unique experience lies a set of powerful API endpoints designed explicitly for governing and automating processes related to data management.

When thereʼs no central place to find trustworthy data assets, folks resort to finding answers wherever they can — often digging through Slack conversations, Google Sheets, or wiki articles. For engineers who need specific tables, this lack of visibility leads to time wasted on searches and growing backlogs of requests. When those responsible for building products donʼt have access to reliable data, product releases get pushed back, and data quality suffers.

This surge in demand for composable solutions using real-time analytics and genAI underscores a larger transformation at play within organizations. Itʼs indicative of a growing interest in embracing product thinking, which demands evolving both software & mindset.

Yet even if people processes are optimized, organizations still need to select appropriate technologies at each level — catalog, lineage, ELT/dbt pipelines, and AI agent tooling. Each choice has implications for cost and complexity; choosing wisely sets folks up for success later on when they start building AI agents. 

•  Catalog — centralized place to find and manage data

•  Lineage — comprehensive data traceability

•  Pipelines (ELT + dbt) — composable data transformations

•  AI Agents — independent creators and users

Additionally, weʼll dive into how agents can be orchestrated (supervised vs. unsupervised), governed (centralized vs. decentralized), integrated across cloud & on-premises landscapes, and adopted incrementally through pilot programs, federation, and eventual productization. This comprehensive approach helps minimize both cultural and technological barriers. 

Decoding the 4 layers of modern data application infrastructure

Now that weʼve identified all the layers within a data app framework, letʼs dive deeper into each one. We will unpack where these pieces fit into the overall puzzle and offer practical use cases along the way 

Layer #1: Data catalog

When people use personal spreadsheets, instant messaging, and wikis to share important work details, organizational knowledge becomes fragmented. This creates a challenging environment for data teams who struggle to find the correct table or metric definition amidst numerous inconsistent sources. When folks interpret data differently due to these discrepancies, distrust in analytics arises, leading to time-consuming reconciliations and reluctance to embrace data-driven decisions. 

This new foundation assumes one simple thing: A centralized place to find all data, alongside its owners and popular uses. Data catalogs emerged to play this role. They index databases, BI reports, ETL processes & more, pulling in both technical (schema) and business (description) metadata. Who owns what? Answering this question becomes trivial when viewing a list of resources enriched with owner details. Which data is used by whom? Catalogs often display popularity metrics reflecting how many people use specific data elements within a given time frame. In this way, they guide folks towards the most trustworthy and widely adopted options.

Maintaining static spreadsheets of cloud resources is time-consuming and error-prone. As soon as source details change, the accuracy of those spreadsheets diminishes. Todayʼs modern organizations often struggle to manage their ever-growing list of cloud assets (which now total in the tens of thousands) across multiple clouds, which shift and evolve at breakneck speed. 

Ideally, those closest to the data — i.e., the domain team — should be responsible for creating and updating its metadata. Otherwise, valuable business context around the data will fade away over time, leading to mistrust (and more help desk requests). A collaborative approach ensures that everyone has access to accurate, trustworthy information.

Take the example of a marketing analytics team that used a catalog to resolve ambiguity around campaign metrics. When their organization was rapidly growing, people were increasingly unclear on which metrics existed (and what those metrics meant). To address this challenge, the team documented definitions, staleness policies, and dependency diagrams for critical tables within the catalog. In doing so, they empowered users across domains — including product and finance — to find and use data confidently, without relying on time-consuming service requests. This demonstrates how embracing a catalog facilitates seamless communication across teams and speeds up access to actionable intelligence.

When designed properly, a catalog layer will route inquiries to the correct owners, speeding up troubleshooting and reducing support ticket volumes. Each entry should map a dataset to its steward(s) and contract(s), so folks know who to contact with questions or bug reports. This transparency around ownership details accelerates problem resolution and minimizes time spent on vague customer support requests.

Scalable. Exposes open API endpoints. Interoperable. Stores metadata in its natural format (e.g., using tables like Apache Iceberg). Secure & interoperable. Integrates with existing security policies and enables federated search across data catalogs, data meshes, and BI tools. Open API endpoints support automated ingestion/egress of metadata via scripting languages; storing metadata natively eliminates duplicative work and optimizes performance for modern analytics use cases. Integration with role-based access control (RBAC) enforces appropriate authorization at all levels (row, column), supporting compliant self-service analytics across diverse environments.

Some argue there are drawbacks to this approach. Potential disadvantages include time spent curating data catalogs, inconsistent tagging, and siloed business glossaries. When catalog adoption stalls due to overwhelming manual curation efforts or divergent tagging practices, organizations struggle with incomplete datasets and lack of confidence in their findings. Yet fear not! In the following sections, weʼll explore how features such as lineage, quality rules, and additional metadata layers play a crucial role in reinforcing standards and reducing the burden of maintaining them over time.

Scaling catalogs: common pitfalls & how to avoid them

Yet scaling a product catalog introduces new challenges — stale metadata, lack of curator involvement, etc. To mitigate this risk, implement automated crawlers alongside defined data contract expectations & regular quality checks. Otherwise, distrust will become commonplace as duplicative listings emerge and who owns what remains opaque.

The role of a catalog in the data-app stack

Data catalogs form the base layer of this architecture. By centralizing information on data assets — including their technical details and business context — they support data discovery, facilitate communication around data, and enforce accountability (who owns what data, and under what guidelines?). Data catalogs also play a key role in driving automation across the data landscape. Serving as active metadata repositories, these catalogs supply valuable context to higher levels of the technology stack. 

Cataloging versus ad-hoc documentation

Unlike catalog-first approaches, which automatically discover metadata across systems, ad-hoc solutions like spreadsheets and wikis require time-consuming manual curation, making them impossible to scale.

Layer #2: Lineage

Weʼve talked about finding data so far. Now, letʼs dive into why folks who use data need transparency around its origin, transformation history, and trustworthiness. When people donʼt know where data came from or how it changed, their faith in that data erodes. Changes made upstream (but hidden downstream) can break dashboards, leading to unexpected errors — and expensive rework! Regulations like GDPR, SOX, and HIPAA have strict requirements, requiring an end-to-end view of data lineage for audits. Knowing where data comes from helps reduce time spent debugging issues and builds trust across teams.

Just like retailers track produce back to its source to handle potential contaminations efficiently, data users have traceability needs too! With Monte Carlo, if a dashboard metric has been impacted by a compromised upstream job, folks get notified via Slack so they can inform others accordingly. Additionally, when new issues arise within a pipeline, those who use affected models will be alerted through Slack messages, empowering them to address problems quickly and prevent further complications. 

Adoption comes down to three simple questions:

•  Graph completeness. Lineage should represent all dependencies (upstream & downstream), preferably at the finest granularity possible (down to columns). This ensures accurate impact assessment of changes

•  Timeliness. Lineage should be delivered rapidly (within minutes) so folks can identify problems immediately upon pipeline completion. 

•  Easy to visualize. Interactive, clickable diagrams enable both developers and non- developers alike to navigate dependencies within a system. This empowers folks across domains — engineering, product, etc. — to comprehend data lineage visually, fostering broader accessibility and understanding. 

For trust, provenance vendors need to manage collection agents, share metadata, and flag issues directly within the catalog. Secure, lightweight producers built into common data engineering tools (like Spark, Airflow, or dbt) can emit lineage using limited permissions; meanwhile, the catalog displays job status warnings alongside owner contact info, empowering folks to address challenges efficiently.

Soon, weʼll be able to map these properties to standard formats like OpenLineage and Marquez. OpenLineage provides a standardized format for datasets, tasks, and executions so different programs can use the same definitions; Marquez is a cataloging tool for storing and visualizing this metadata. With native support for popular frameworks like Airflow, Spark, dbt, Flink, and Dagster, users will gain consistent lineage visibility across their entire stack without needing to build custom integrations.

Essential components of data lineage

Comprehensive: Ingests lineage data from various sources, including data warehouses, data lakes, BI tools, and ETL pipelines, using open standards like OpenLineage to visualize all upstream & downstream dependencies within each system. For example, enterprise metadata management solutions like Alation consume lineage events emitted by Snowflake, BigQuery, dbt models, etc., to give users a comprehensive view of dependencies without requiring manual configuration.

Version-aware: Stores immutable run IDs and complete lineage/schema history so users can examine past states of data products & jobs. For example, Alation displays prior versions of a given tableʼs schema; similarly, OpenLineage event payloads contain unique IDs referencing distinct runs of pipelines and processes.

Real-time: Emits lineage events during job execution so consumers have up-to-date views without needing to play catchup later. For example, if using OpenLineage, data lineage agents will emit events as tasks execute; this allows systems like Marquez and DataHub to consume those events and update their own representations of lineage accordingly. 

Queryable: Provide API & UI access to lineage graphs so people can search, filter, and export lineage details for on-the-spot investigation. Today, many modern tools will give you both a GraphQL endpoint alongside a visual interface where folks can navigate dependencies and download lineage data for their own unique use cases (reporting, root cause analysis, etc.).

Contextualized: Enrich nodes & edges with owner details, SLA links, security classifications, and more. This empowers folks to grasp the business significance behind data transformations, rather than getting lost in tech jargon! With OpenLineage features, publishers can augment base lineage with valuable metadata, which systems will then use to share important info across dependencies.

Secure: Ensure data privacy through encrypted communication (in transit) and storage (at rest). Implement role-based access control (RBAC) and policy-based masking. Look for features like Transport Layer Security (TLS) encryption, Zero Trust architecture, and Attribute-Based Access Control (ABAC), which offer robust security measures without compromising usability or regulatory compliance.

These characteristics reflect the growing need for software supply chain security and traceability (SLSA). Just as SLSA provides different levels of assurance around software components — ranging from simple attestation to cryptographically secured builds — so too will data lineage offer varying degrees of trustworthiness based on its properties. Indeed, frameworks like Atlas already borrow this thinking from the world of software development to help establish trust and transparency within machine learning pipelines. Embracing all six dimensions of lineage discussed above prepares data teams to satisfy the demands of DevSecOps professionals who seek similar assurances around data assets as they do software packages. In doing so, organizations can achieve automated processes and ensure compliance requirements are met.

 This tight integration supports data discovery and automates lineage-to-owner mapping (more on this later). For example, tools like Marquez consume OpenLineage events to display lineage diagrams along with metadata details such as descriptions, owners, and governance rules. This empowers users to perform root cause analyses and route alerts automatically to the appropriate owner.

Lineage increases data reliability in enterprises

How does Lineage support trustworthy AI? By driving data trustworthiness! With automated impact analysis, responsibility assignment, and accelerated problem solving, organizations using Lineage benefit from fewer errors making their way into models. When implementing new schemas, users gain valuable previews of potential consequences, allowing them to proactively address problems before they affect live systems. Additionally, end-to-end traceability connects data assets to their custodians, who receive automatic notifications when something goes wrong, encouraging ownership and quick action. And if an issue arises, interactive diagrams help trace back to the source, speeding up diagnosis and repair.

Best practices for automated lineage capture

Key capabilities include using small, standardized SDKs (like OpenLineage) to monitor tasks without needing heavyweight agents; incorporating database event listeners (like Debezium CDC topics) to detect when tables have changed; securing data encryption at rest with customizable storage duration settings; establishing who owns what data, along with automatic regular reviews, and versioning lineage assets via CI/CD pipelines for dependable upkeep.

Layer #3: Data pipelines & transformation (ELT + dbt)

A challenge has haunted software development teams for decades: how can we build flexible, adaptable ETL processes? Traditional ETL solutions often use rigid SQL procedures chained together to process data sequentially. When a database changes its format (schema) or a new data source appears, every step after that initial one needs to be updated and tested again. This creates a fragile system where small upstream changes have large ripple effects across the entire pipeline. Such systems increase time-to-value for new data products, lengthen troubleshooting times, and obscure true failure points within opaque code blocks.

“The mission for dbt was to take all of these software-engineering best practices and bring them to the data analytics workflow.” - Drew Banin, co-founder, dbt Labs (June 2025)

Compliance demands will only grow stricter — alongside growing data volume. To meet this challenge, organizations need to separate their ELT processes into three distinct steps: extract, load, and transform. This way, they can ensure traceability without compromising efficiency. With ELT, all source data remains stored within the cloud data warehouse (immutable), so regulators have full visibility into how decisions were made around sensitive data subject to GDPR, HIPAA, etc. Furthermore, if someone exercises their “right to be forgotten” or asks you to redact all personal identifiable information (PII) related to them, youʼll have the ability to rewind time and fulfill those requests. Each stage of the process becomes traceable, repeatable, and individually testable.

A key benefit of this architecture comes down to scalability and flexibility at each individual step within the pipeline. For example, using cloud storage solutions like Snowflake, teams can ensure that their ELT processes (extraction, loading, and transformation) have independent scaling capabilities based on demand. This prevents bottlenecks and optimizes resource utilization across different stages of the process. Additionally, organizations gain the freedom to choose the most appropriate engine for each specific task without compromising the integrity of the entire system.

Remember how we discussed catalogs & lineage? Those systems consume pipeline metadata (dbt outputs manifest.json and run_results.json , which OpenLineage translates into standardized lineage events). Services like Alation use these events to enrich their respective knowledge graphs. Tools like Datafold leverage this lineage context to drive automated quality checks and prevent broken builds. And finally, policy-as-code comes into play! When data fails to meet certain standards or comply with policies, pipelines get blocked at the source. Itʼs worth noting that this mirrors traditional software development practices where code changes often fail to pass unit tests.

Which policies should be applied globally (e.g., PII masking)? Which should be specific to a given model? Answering these questions takes careful consideration — and communication! — between platform and domain teams. Whatever decisions are made around global policies, those choices need to be enforced consistently across every model that uses them. With dbt, organizations can enforce column-level masking through YAML policy tags. Automated daily reconciliation ensures that tagging remains compliant over time. Importantly, some roles will still require non- masked data, while others will have fully masked views. This demonstrates how different types of policies can coexist within the same system. Only one masking policy can ever be active on a given column at a time. Therefore, successful implementation demands strong alignment across teams.

 Once more, this leverages the wisdom of crowds principle – the system taps into collective knowledge to drive changes (contextualized within the realm of business operations), yet remains easily maintained. With code, governance rules, and supporting documents all residing in the same repository, folks across roles (data users & developers) engage in joint peer reviews of proposed updates. Such shared responsibility guarantees alignment with ever-changing business needs without compromising quality software development practices. In this way, both agility and sustainability are achieved.

The future of pipelines and transformations

Data pipelines have traditionally been managed manually at the level of individual tasks. Increasingly, such tedious work will be automated by software agents capable of translating abstract objectives (e.g., “materialize table X whenever Y changes”) into concrete schedules and actions. An example of this shift is Dagsterʼs new feature called Auto-Materialize Policies. With these policies, users can specify how fresh their data needs to be; the system then figures out when to execute each asset based on those constraints. And looking ahead, we see even more potential for automation through models trained to convert human language directly into runnable DAGs (see Google DeepMindʼs recent paper, Prompt2DAG).

Todayʼs modern orchestrators have access to valuable metadata around data age expectations (freshness), budget limitations, and job priorities. For example, Google Cloud Dataflowʼs FlexRS will delay running lower-priority ETL pipelines if more affordable spot resources arenʼt immediately available; similarly, CAST AI continually chooses the cheapest cloud instance at which to execute queries and scale those instances up or down in response to demand. Furthering this trend, Dagster offers a FreshnessPolicy feature that leverages upstream dependency changes to materialize views only when needed, thereby minimizing wasted storage costs. And on the observability front, OpenLineage integrates with tools like dbt and Dagster to track key metrics — including query cost, runtime, and success rate — so users gain insight into both model lineage and potential performance degradation. Thus far weʼve described a system driven by declarative intentions (what should be done), executed by a controller (how it gets done) and observed through lineage data (did it work? at what cost?). This creates a powerful closed-loop system for optimizing costs and SLAs.

Yet traditional batch job scheduling software has become a bottleneck for many modern data pipelines. While time-driven workflow builders like Apache Airflow have been used for years to process batches of data, they often struggle with the demands of continuous stream processing. Challenges include inflexible crontab schedules, wasteful polling sensors, and slow DAG initialization times that undermine low-latency use cases. To address this challenge, the market has responded with event-driven orchestrators, which can trigger workflows based on events rather than fixed timelines. A prime example is Airflow 3.0, which now offers built-in support for Google Cloudʼs Pub/Sub messaging service.

In the ideal world, engineers would be able to specify their desired data outcome, and have the platform take care of the rest – scheduling, scaling, enforcing policies, etc. Features like Dagsterʼs concurrency limits, fenced executions, and automatic materialization flags demonstrate how policy & traceability concerns can become embedded within each individual job. Similarly, lineage integrations (like those offered by dbt-ol and openlineage-dagster) will generate metadata around each pipeline run so that all actions taken on behalf of automation requests remain auditable.

Layer #4: AI agents

Weʼve discussed discovery, observability, and pipelines. Now we turn our attention to the technology surface needed to support self-governing AI agents who both produce and consume data. Such agents will require more than simple plumbing (data ingestion & egress). They will need specialized resources like knowledge stores, compute engines, message brokers, guardrails, monitoring tools, and container orchestrators to give them context, agency, and control.

What personas benefit from AI agents?

•  For data engineers: pipeline automation, schema management, quality testing

ML engineers: data exploration, feature creation, model auto-retraining

Platform engineers: governance (policy enforcement), provenance tracking, service level agreement (SLA) monitoring

•  Business analysts: natural language search (SQL), data visualization

•  Executives: key metric tracking, outlier detection, actionable narratives

•  Support teams: conversation topic mining, customer attrition & revenue impact assessment

•  For all roles: natural conversation, awareness of current situation, compliance with rules

The three capability planes of an agent-powered platform:

•  Data-Infrastructure Plane: A secure, scalable execution environment that packages up all the resources needed to support agent operations (Kubernetes pods, vector/SQL databases, messaging services, GPUs, cost & policy aware schedulers, etc.) so they can be deployed anywhere. Provides reliable computing power alongside controlled access to organizational data; this prevents agents from becoming fragile “glue” code due to inconsistent deployments, lack of auto scaling, or insecure communication channels. 

•  Agent Development Experience (ADX) Plane: A unified environment for building, testing, and managing AI Agents. Features include prompt & graph builders, evaluation frameworks, CI/CD pipelines, and an Agent Registry. ADX enables rapid iteration on agent designs within a standardized format shared across dev, ops, and governance teams. It bridges the gulf between prototyping and productionizing AI Agents.

•  Governance Layer: Uses tools like Open Policy Agent (OPA) to enforce policies around budgets, approved software, prohibited content, and auditing requirements. This governance layer helps ensure that AI agents behave securely, compliantly, and efficiently within defined boundaries. It also provides transparency to security, compliance, and financial teams who need to oversee these systems. 

Conclusion

This four-part system — Catalog, Lineage, Pipelines (ELT + dbt), and AI Agents — provides a human-and-tech framework for delivering trusted, AI-ready data products. Active metadata connects these components, creating a governance layer that supports discoverability, trustworthiness, and process automation. Itʼs worth noting that success hinges on both expertise and mindset around domains & products. Not only does this modular architecture reduce time-to-value, improve quality outcomes, and establish an organization's ability to rapidly prototype using AI; but also positions those organizations ahead of their competition.

Book a demo with us to learn more.

    Contents
  • Decoding the 4 layers of modern data application infrastructure
  • Scaling catalogs: common pitfalls & how to avoid them
  • Layer #2: Lineage
  • Layer #3: Data pipelines & transformation (ELT + dbt)
  • Layer #4: AI agents
  • The three capability planes of an agent-powered platform:
  • Conclusion
Tagged with

Loading...