Understanding Alation Data Quality Architecture

Alation Cloud Service Applies to Alation Cloud Service instances of Alation

Alation Data Quality leverages a distributed microservices architecture designed specifically for Alation Cloud Service environments. The system integrates with the Alation Data Catalog and employs various specialized services to enable comprehensive data quality checking and reporting.

Alation Data Quality Monitoring Concepts

Understanding these fundamental concepts is essential for effective use of Alation Data Quality:

  • Check: A validation rule applied to specific data metrics, returning pass, fail, or error status based on defined thresholds. Examples include accuracy checks (validating numeric ranges), completeness checks (ensuring no missing values), or validity checks (format validation).

  • Monitor: A container grouping one or more checks tied to tables and their attributes, executing on a defined schedule.

  • Asset: A table and its columns from a data source or BI report actively monitored for quality metrics.

  • Health Score: An aggregated indicator reflecting the pass or fail ratio across all checks within a monitor or asset.

System Components and Architecture

The Alation Data Quality architecture is composed of several key components designed to ensure robust data governance and user experience. It integrates the Alation Core Container for cataloging and user interactions, the Data Quality Manager Service for configuring monitoring, and the Airflow DQ Pod for executing data quality checks. AI-powered recommendations are provided by Amazon Bedrock, while the Data Quality Ingestion Service handles the processing and storage of check results. Amazon Timestream stores historical metrics, and the Data Quality Notification Service manages all alerts.

The architecture consists of the following key components:

  • Alation Core Container: Houses the Data Catalog and coordinates user interactions.

  • Data Quality Manager Service: Stores and manages monitor configurations.

  • Airflow DQ Pod: Executes data quality checks using pushdown SQL queries.

  • Amazon Bedrock: Provides AI-powered check recommendations via Claude 3 Sonnet.

  • Data Quality Ingestion Service: Processes and stores check results.

  • Amazon Timestream: Time-series database for historical quality metrics.

  • Data Quality Read Service: Fetches the results from the Amazon Timestream and displays them in the user interface.

  • Data Quality Notification Service: Manages alerts and notifications.

Architecture

../../_images/Alation_Data_Quality_Architectural.png

Monitor Creation

  1. The user begins by interacting with the Alation Data Catalog (Alation Core Container), which contains the metadata of the cataloged data assets, to create a monitor on the Data Quality interface. Alation Data Catalog stores the metadata using the metadata extraction (MDE), lineage for the BI reports, and query log ingestion (QLI) from the data source connectors.

  2. Alation Core Container interacts with the Data Quality Manager Service:

    1. For manual checks, Alation Core Container sends and stores the monitor information to the Data Quality Manager database.

    2. For AI-driven checks (Recommend Checks), the Alation Core Container sends the column name and column type to the Amazon Bedrock service, fetches the recommended checks and then stores the monitor information to the Data Quality Manager Database.

Running and Scheduling Monitors

  1. When a user runs a monitor, the Alation Core Container creates a Celery job and calls the Airflow DQ Pod.

  2. The Airflow DQ Pod fetches

    1. The monitor definition from the Data Quality Manager

    2. The credentials from the data source connectors in the Alation Core Container.

  3. The Airflow DQ Pod then creates a pushdown SQL query based on the check rules as defined by the user or recommended by AWS Bedrock and passes it to the Query Service of the data source connectors for execution. The Query Service executes the data quality checks against the target data source. It authenticates using the default service account configured for that data source. For a monitor to run successfully, this service account must have been granted SELECT privileges on all tables being monitored.

  4. The Airflow DQ Pod then fetches the results of the monitor run and sends them to the Data Quality Ingestion service

  5. The Data Quality Ingestion Service stores the results in Amazon S3.

  6. The Data Quality Ingestion Service calls the Ephemeral Data Quality job scheduler service.

  7. The Ephemeral Data Quality scheduler service:

    1. Creates one more pod on demand to store these results based on the timestamp in the Amazon Timestream database.

    2. Send the data to the Data Quality Notification Service.

Data Quality Score Presentation

  1. Amazon Timestream then sends the results to the Data Quality Read Service.

  2. The Data Quality Read Service sends the results to the user.

  3. The Data Quality Notification Service sends the notifications and alerts to the user.