Configure Sampling and Profiling

Alation Cloud Service Applies to Alation Cloud Service instances of Alation

Customer Managed Applies to customer-managed instances of Alation

Enhanced Connector Enhanced connectors add extended capabilities and require a separate entitlement in addition to your Alation platform license.

The MaxCompute connector supports data sampling and profiling to provide insights into your data directly in the Alation catalog.

Overview

  • Data Sampling: Retrieves sample rows from tables to display in the catalog.

  • Data Profiling: Calculates column statistics such as min, max, value distributions, and null counts.

Both features use JDBC connections to execute queries against MaxCompute tables.

Prerequisites

Before configuring sampling and profiling:

  • The MaxCompute data source must be configured with valid Access Key credentials in General Settings.

  • The RAM user must have permissions to read data from the target tables.

Note

Sampling and profiling use the service account credentials (Access Key ID / Access Key Secret) configured in General Settings. No per-user credential setup is needed. This is different from Compose, which requires each user to provide their own Access Key credentials.

The following permissions are required:

  • odps:Read - Read access to projects and tables

  • odps:List - List access to project resources

  • odps:CreateInstance - Permission to submit SQL jobs for sampling and profiling queries

  • odps:Select on INFORMATION_SCHEMA.PARTITIONS - Required for partitioned tables so the connector can resolve the latest partition automatically. Grant this per MaxCompute project that contains partitioned tables:

    GRANT Select ON TABLE <project_name>.INFORMATION_SCHEMA.PARTITIONS TO USER <ram_user>;
    

At the project level, grant the RAM user one of the following roles: Admin, Super_Administrator, or role_project_reader. Alternatively, grant specific permissions:

GRANT Read, CreateInstance ON PROJECT <project_name> TO USER <ram_user>;

For more information, see the Prerequisites page.

Additionally, metadata extraction must be completed for the tables you want to profile.

Configure Sampling and Profiling

  1. On the MaxCompute data source Settings page, go to the Sampling and Profiling tab.

    ../../../_images/MaxComputeSamplingTab.png
  2. Configure the following settings:

    ../../../_images/MaxComputeSamplingConfig.png

    Setting

    Description

    Enable Sampling

    Turn on to enable data sampling for tables.

    Sample Size

    Number of rows to retrieve for sampling (default: 100).

    Enable Profiling

    Turn on to enable column profiling.

  3. Click Save.

Running Profiling

To run profiling on a table:

  1. Navigate to the table’s catalog page in Alation.

  2. Click the Samples tab to view sample data.

  3. Click Profile to run column profiling.

    ../../../_images/MaxComputeProfilingResults.png

Alternatively, you can schedule profiling to run automatically:

  1. On the Sampling and Profiling tab, configure the profiling schedule.

  2. Select the frequency (daily, weekly) and time.

  3. Click Save.

Profiling Metrics

Note

To view enhanced profiling metrics (Min, Max, Mean, Null Count, Null Percentage), the Profiling V2 feature flag must be enabled in Alation.

  • Alation Cloud Service (ACS): This flag is enabled by default. No action required.

  • Customer-Managed Alation: Enable the flag by running the following command from the Alation shell (requires backend access):

    alation_conf alation.feature_flags.enable_profiling_v2 -s True
    

    Restart Alation for the change to take effect.

    If you do not have backend access, raise an SRE ticket to enable this flag.

The connector collects the following profiling metrics for each column:

For All Column Types:

Metric

Description

Row Count

Total number of rows sampled for profiling.

Value Distribution

Top values and their counts (histogram). Shows the most frequent values in the column. For example, a STRING column might show “Active”: 500, “Inactive”: 200.

Null Count

Number of NULL values in the column.

Null Percentage

Percentage of NULL values in the column.

Additional Metrics for Numerical Columns (INT, BIGINT, FLOAT, DOUBLE, DECIMAL):

Metric

Description

Min

Minimum value in the column.

Max

Maximum value in the column.

Mean

Average (mean) value in the column.

Example: Numerical Column Profiling

For numerical columns, the profile shows Min, Max, Mean, Null Count, Null Percentage, and Value Distribution:

../../../_images/MaxComputeProfilingNumerical.png

Example: Non-Numerical Column Profiling

For non-numerical columns (STRING, DATE, etc.), the profile shows Null Count, Null Percentage, and Value Distribution:

../../../_images/MaxComputeProfilingValueDistribution.png

Note

For partitioned tables, the connector queries INFORMATION_SCHEMA.PARTITIONS to find the most recently created partition per table. This requires the RAM user to have odps:Select on INFORMATION_SCHEMA.PARTITIONS in each MaxCompute project that contains partitioned tables. See the odps:Select entry in Prerequisites for the exact grant.

If the RAM user does not have this permission, sampling and profiling for partitioned tables will fail with ODPS-0130013: Authorization exception, followed by the ODPS-0130071 full-scan error as the fallback sample query runs without a partition predicate. Grant the permission and retry.

Dynamic Sampling (Per-User Profiling) (Beta)

Available from MaxCompute OCF Connector version 1.0.8.

The MaxCompute connector supports dynamic sampling, which allows profiling and sampling queries to run using individual user credentials instead of the service account. The profiling results are scoped to the user’s permissions: they can only profile tables they have access to.

../../../_images/MaxComputeDynamicSampling.png

When Dynamic Sampling is disabled, sampling uses the service account credentials configured in General Settings. The connector automatically connects to the correct project for each table, so no extra setup is needed. Dynamic sampling is disabled by default.

When Dynamic Sampling is enabled, sampling uses a user-selected connection from the Choose a Connection option on the table page. Users can either use an existing saved connection or click + Add New to create one.

If Compose is enabled, users can use a saved connection from the Compose tab. In that connection, users should provide:

  • Username: Access Key ID

  • Password: Access Key Secret

If Compose is not enabled, users can still click + Add New from the Choose a Connection option on the table page and add a JDBC connection in this format:

odps:http://service.<region>.maxcompute.aliyun.com/api?project=<project_name>

Replace <region> with your Alibaba Cloud region (for example, us-east-1) and <project_name> with the target MaxCompute project. Users should provide their own credentials:

  • Username: Access Key ID

  • Password: Access Key Secret

Note

Dynamic sampling requires that the user has a configured connection with valid Access Key credentials. If the user has not configured credentials, sampling fails with an error. The connector does not fall back to the service account.

Custom Profiling (Beta)

Available from MaxCompute OCF Connector version 1.0.8.

Custom profiling allows users to provide their own SQL query for profiling a table. This is useful when you want to:

  • Profile only a subset of rows (for example, SELECT * FROM my_table WHERE region = 'US' LIMIT 100)

  • Profile computed or derived columns (for example, SELECT price * quantity AS total FROM my_table)

  • Profile joined or aggregated data

Custom Query-Based Sampling for Tables

As a Data Source Admin, you can specify custom queries for sampling individual tables:

  1. On the data source Settings page, go to the Per-Object Parameters tab.

  2. Navigate to the table level by selecting the schema (project) that contains the table you want.

  3. Locate the table and click the edit icon in the Sampling Query column to open the query editor.

  4. Add the custom query in the query editor and click Save. The next sampling job will use this query to sample the table.

Custom Query-Based Profiling for Columns

To specify custom queries for profiling individual columns:

  1. On the Per-Object Parameters tab, navigate to the column you want to profile by selecting the table that contains it.

  2. Locate the column and in the Profiling Query column, click the edit icon to open the query editor.

  3. Provide a custom query and click Save. The next profiling job you run for this column will use the saved query to profile the column.

Note

  • Custom profiling queries are sanitized before execution (trailing semicolons removed). The query timeout and max rows settings from the profiling request are applied to the custom query.

  • There is no SQL validation for custom queries. Any errors will be logged in the Sampling Job Status table on the Data Sampling tab. Test-run custom queries in Compose before saving them to verify correctness.

  • Individual column profiling queries on the Per-Object Parameters tab take precedence over default profiling queries.

For more information about custom query-based sampling and profiling, see Configure Sampling and Profiling for OCF Data Sources.

Per-Project Connections

The MaxCompute connector uses per-project JDBC connections for profiling. When a table is profiled, a connection is created to that table’s specific project. This eliminates the need for cross-project permissions.

Limitations

  • Profiling large tables may take significant time and consume MaxCompute resources.

  • Profiling queries use full scan mode and may scan all partitions.

  • Complex data types (ARRAY, MAP, STRUCT) have limited profiling support:

    Data Type

    Supported Metrics

    Not Supported

    ARRAY

    Null Count, Null Percentage, Row Count

    Min, Max, Average, Histogram (value distribution)

    MAP

    Null Count, Null Percentage, Row Count

    Min, Max, Average, Histogram (value distribution)

    STRUCT

    Null Count, Null Percentage, Row Count

    Min, Max, Average, Histogram (value distribution)

    Note

    To profile data within complex types, consider flattening the data into separate columns using views with EXPLODE or LATERAL VIEW statements, then profile the view instead.

Troubleshooting

Profiling Failed

  • Ensure the RAM user has read permissions on the table.

  • Check MaxCompute quota and resource availability:

    How to check:

    1. Log in to the Alibaba Cloud MaxCompute Console.

    2. Navigate to Management > Quota Management to view your project’s quota usage.

    What to check:

    Resource

    Description

    Compute Unit (CU) quota

    Verify sufficient CUs are available. Profiling queries consume compute resources. If CU quota is exhausted, queries will be queued or fail.

    Concurrent query limit

    Check if the project has reached its maximum concurrent query limit (default: 100 per project).

    Storage quota

    Ensure the project has not exceeded its storage quota, which may affect query execution.

    SQL job status

    In the MaxCompute console, go to Job Management to check if profiling jobs are queued, running, or failed.

    For more information on quotas and limits, see MaxCompute Quotas and Limits in the Alibaba Cloud documentation.

Slow Profiling

  • Reduce the sample size for large tables.

  • Profile tables during off-peak hours.

  • Consider profiling only critical tables.