By Mendelsohn Neil Chan
Published on March 18, 2024
As a recognized market leader in data governance and active metadata management, Alation’s vision is to empower everyone in an organization to find, understand, and trust data in an easy and intuitive way. As a single source of reference, Alation plays a pivotal role in centralizing data quality and data observability insights in one unified platform. This centralized hub serves as a comprehensive resource for technical teams of data engineers and analytics engineers to monitor and enhance data quality, while communicating these insights to non-technical business stakeholders and fostering a shared understanding and trust in the data across the entire organization.
This walkthrough will demonstrate how Alation’s Open Data Quality Framework provides the flexibility to leverage best-of-breed data transformation, data quality, and data observability tools in the modern data stack to connect Alation to enhance collaboration between technical data teams and business data consumers seamlessly.
Alation, Snowflake, dbt Labs, and Anomalo are proud members of the modern data stack. This collaborative alliance and ecosystem brings together best-of-breed solutions to efficiently handle data through ingestion, storage, processing, and insights generation on a seamlessly integrated cloud-native data architecture.
This solution walkthrough will consist of the following five steps:
Step 1: Data Ingestion and Storage (Snowflake)
Step 2: Data Transformation and Orchestration (dbt Cloud)
Step 3: Data Quality and Observability (Anomalo)
Step 4: Data Curation and Cataloging (Alation)
Step 5: Data Governance (Alation + Snowflake)
In this first step, we will load our raw data sets from an external stage (e.g. AWS S3, ADLS Gen2, GCS) into Snowflake using the COPY INTO command. The goal is to incrementally transform our data across three major zones or layers in our data warehouse architecture.
Figure 1. Snowflake interface showing three raw tables for this use case
Now that our raw data is in Snowflake, we will use dbt Cloud to transform and shape our data for analysis. One key benefit of using dbt for transforming data is its ability to streamline and modularize complex SQL scripts into smaller chunks that are easier to maintain, debug, and productionize.
As part of this modern and modular approach to data modeling, we have created six discrete dbt models organized in three folders representing the major zones in the data architecture: bronze, silver, and gold.
Figure 2. File structure of the dbt Cloud development repo
Another key capability of dbt is its ability to incorporate data validation tests to your data pipeline via its dbt test command. These data tests are assertions you make about your tables, fields, and other components in your dbt project to ensure the reliability and integrity of your data.
In dbt, there are two main types of tests: generic and custom. Generic tests are a set of generic, pre-defined tests that can easily be applied to your dbt model, while custom tests ensure specific business rules are adhered to based on an SQL query that you define. The great news is that both types of tests can be seamlessly integrated with Alation using the Data Health tab to surface and display data quality rules from integration partners!
As shown below, generic tests can easily be configured by adding a few lines of YAML.
Figure 3. Example of a schema.yml file used to define generic schema tests in dbt
Another way to validate data in dbt is by writing singular data tests in SQL that will return failing records. For example, the test below aims to flag records that have a negative payment amount, since the [amount] field should always be greater than zero for a sales transaction in this example.
Figure 4. The name of this test is the name of the file: assert_amount_is_positive.sql
Lastly, Alation integrates these validation and test results into a comprehensive Data Health tab that summarizes these findings. Alation serves as a pivotal bridge, effectively communicating data quality insights to a wider organizational audience, including those not involved in technical, data engineering, or analytics engineering related functions. Additionally, Alation's Open Data Quality API framework allows technical users to access the data quality tests surfaced in Alation by clicking on embedded hyperlinks that take them to their dbt Cloud IDE for deeper, technical analysis.
Figure 5. Alation Data Health tab showing data quality rules defined by dbt on a Snowflake table
Another complementary component to dbt Cloud in building a holistic data quality and data observability solution is Anomalo, a complete data monitoring platform that leverages machine learning and AI to automatically detect and explain issues in your enterprise data. Anomalo’s unique value proposition is its ability to continuously monitor thousands of tables and billions of records in your Snowflake data warehouse without you having to write a single line of code. The example below demonstrates how Anomalo can track a critical data observability metric called data freshness. There are two primary aspects to consider in this use case:
Frequency of Ingestion: This focuses on how often new records are loaded into the data warehouse to gauge the timeliness of the data. For example, in a common scenario where new records arrive daily, monitoring data freshness ensures the expected volume of inserts and updates are occurring as intended every day.
Refresh Interval: This ensures that recently updated data adheres to the anticipated refresh schedule. For instance, many companies aim to complete the transformation of yesterday's data and make it accessible by 9:00 AM on the current day, coinciding with the typical start of the workday. If new data does not arrive before this deadline, an alert will be triggered to the data engineering team to remediate.
As shown below, Anomalo provides an intuitive interface to allow data practitioners to configure Data Freshness checks along with other key data observability metrics including data volume, missing data, table anomalies, key metrics, and validation rules:
Figure 6. Anomalo table overview interface to configure data monitoring checks
Once again, Alation seamlessly integrates with the data monitoring checks configured on Anomalo and displays it on the Data Health tab to communicate that no new data has arrived on February 5th (the “status” shows this data freshness check ran on the morning of February 6th, synced with the dbt Cloud data pipeline running on Snowflake).
Figure 7. Alation Data Health tab displaying a data freshness validation from Anomalo
Once the data observability issue has been identified, the next step would be to examine it, identify the root cause of the issue, and remedy the problem. By clicking on the new data has arrived validation rule, technical users like data engineers and analytics engineers can triage the problem directly in the Anomalo workspace.
Figure 8. Detailed analysis of a data monitoring rule in Anomalo
Once the integration between Alation, Snowflake, dbt Cloud, and Anomalo has been configured and productionized, everything will run seamlessly behind the scenes in an automated way. The next step in this process is to further curate the Alation catalog with technical metadata and business context. Alation’s role as your single source of reference provides invaluable insights to stakeholders, enabling them to swiftly locate and understand relevant data assets. This enhances data-driven decision-making, boosts operational efficiency, and fosters compliance with regulatory requirements by ensuring accurate, consistent, and reliable data usage across the organization. In the illustration below, everything comes together in Alation as technical metadata from dbt Cloud, Snowflake, and Anomalo are displayed on the catalog page. Moreover, business context is added by domain owners and data stewards, along with PII and security-related classification fields.
Figure 9. An Alation catalog page showing the various metadata associated with a specific table
Last but certainly not least, data governance, a critical part of any holistic data strategy, is seamlessly facilitated through the integration between Alation and Snowflake. There are two common use cases that both platforms jointly support to accelerate data governance for organizations:
Object tagging in Snowflake facilitates data governance by enabling data stewards to track sensitive data systematically. This has applications in data discovery, data protection, regulatory compliance, and resource tracking. Using Alation, you can easily assign a tag value to a Snowflake object (e.g. table, view, column) and it will synchronize the tag value with Snowflake automatically.
Figure 10. Alation enables association of Snowflake tags with Snowflake objects from within Alation
In addition to object tagging, Alation complements Snowflake by serving as an abstraction layer to propagate data policies at scale to Snowflake objects using a no-code user interface. This allows non-technical business users to take advantage of Snowflake’s powerful data governance features in an intuitive, easy-to-use interface.
Figure 11. Alation and Snowflake working together to dynamically mask sensitive PII columns
Congratulations on completing the steps above! By utilizing the joint strength of Alation, Snowflake, dbt Cloud, and Anomalo, you have just implemented a powerful and business-impacting solution that's seamlessly integrated. We hope you enjoyed this hands-on walkthrough and found it helpful. Don't hesitate to contact your Alation account manager if you have any questions, and keep an eye out for more cool Alation blog tutorials coming your way soon.