Data Profiling: What It Is and How to Perfect It
By Jim Barker
Published on April 18, 2023
For any data user in an enterprise today, data profiling is a key tool for resolving data quality issues and building new data solutions. In this blog, we’ll cover the definition of data profiling, top use cases, and share important techniques and best practices for data profiling today.
What is data profiling?
Definition and purpose of data profiling
Data profiling is the process of analyzing and assessing the quality, structure, and content of data. Data profiling technology examines data values, formats, relationships, and patterns to identify data quality issues, dependencies, and relationships.
This is not to say it’s a tedious task for the modern data user. On the contrary, data profiling today describes an automated process, where a data user can “point and click” to return key results on a given asset, like aggregate functions, top patterns, outliers, inferred data types, and more.
How do other thought leaders define it? DAMA defines data profiling as: An approach to data quality analysis, using statistics to show patterns of usage, and patterns of contents in an automated manner.
Gartner defines data profiling as: A technology for discovering and investigating data quality issues, such as duplication, lack of consistency, and lack of accuracy and completeness. This is accomplished by analyzing one or multiple data sources and collecting metadata that shows the condition of the data and enables the data steward to investigate the origin of data errors. The tools provide data statistics, such as degree of duplication and ratios of attribute values, both in tabular and graphical formats.
What do you learn about data through profiling? Most would say the following values form the basics of profiling:
# of Null Records – a Record Count
Inferred Data Type
Pattern Analysis – w/Record Count
Range of Values – w/Record Count
Matches LOV – List of Values – a Record Count
Details on these values empower data users to get a high level of understanding of the data. This, in turn, helps them to build new data pipelines, solutions, and products, or clean up the data that’s there.
It bears mentioning data profiling has evolved tremendously. In the past, experts would need to write dozens of queries to extract this information over hours or days. Today, data profiling tools have made this process instantaneous, accelerating peoples’ ability to deliver new data products and troubleshoot quality issues.
Top use cases for data profiling
Data governance describes how data should be gathered and used within an organization, impacting data quality, data security, data privacy, and compliance. Data profiling helps organizations understand the data they possess with an eye to its quality level, which is vital for effective data governance.
Data profiling helps organizations holistically comprehend their data landscapes, as it documents data assets and dependencies, including data lineage, metadata, and data relationships. This documentation is crucial for effective data governance, as it enables organizations to understand how data moves through their systems, who owns the data, and how data is used across the organization.
Business data stewards benefit from having a breakdown of the patterns that exist in the data. With these details, a data steward focused on fixing a data quality issue can drill into it, find the issues, and then partner with the technical steward to leverage or create the tools needed to improve data quality efficiently and effectively. Data profiling details can be used to build out more extensive data policies and data quality rules, as well, to support more robust governance.
In summary, data profiling is a critical component of a comprehensive data governance strategy.
Data quality issue resolution
When things break, experts often reach to data profiling as the first step to finding a fix.
“This is one of the use cases that I’ve used across my career,” shares Marlene Simon, Lead Professional Services Consultant at Alation. “If you have records that are failing on a data load, the easiest thing to do is profile the data and then figure out what’s causing it. It may have an unexpected character in it, or something breaks the pattern; maybe a target should only have three values but someone upstream introduced a fourth. Data profiling allows you to pinpoint anomalies very quickly as opposed to going on a hunting expedition.”
Data profiling allows you to pinpoint anomalies very quickly as opposed to going on a hunting expedition.
Modern data profiling will also gather all the potential problems in one quick scan. It can locate the ten things that may cause a problem instead of just one thing. This enables data engineers to re-run the profile and troubleshoot as they go, ultimately saving them time.
“I come from a background where I didn’t have profiling tools to start,” reveals Simon. “Current profiling tools are point and click, which free up my time for analysis. It’s a huge time saver, and it gives you so much information. You can focus on other things instead of writing the code.”
Data pipelines and DataOps
For data engineers seeking to build new data products, profiling can help them understand data conditions, including:
BI Data Validation Reports
Data Cleaning Needs – programmatic
Data Cleaning Needs – manual
The details data profiling provides can empower engineers to work more efficiently, halving the time it takes to build out new data products.
Digital transformation is ongoing. Modern developers need to upgrade systems, be it bespoke, ERP, SaaS, or some other platform type. To achieve this, these developers need to build data pipelines that migrate data.
Yet data migration is often viewed as: (1) the Achilles heel of new software implementation; (2) the long pole to completion; or (3) the project headwind.
However, by using data profiling (and borrowing pipeline techniques), developers can lead effective migrations. This process involves conducting an early assessment, predicting the level of effort needed, and planning accordingly. This approach increases the confidence of the project team and enables them to deliver within a reliable timeline.
Master data management (MDM)
The rollout of a new MDM application is very similar to an ERP system. However, it often takes much longer as integrations need to take place across many applications.
With data profiling, teams can see all the systems being integrated by the MDM project. They can then scope out the level of effort needed, and use the profile results to accelerate building cross-reference tables (XREF) otherwise known as crosswalk tables.
This allows rapid discovery of the uniform set of comparative values, and helps to reduce the time to deploy. Using profiling proactively for MDM can also catalog enterprise information before the MDM tools are purchased, aligning human resource needs.
How to conduct data profiling
Data profiling will reveal data quality issues and patterns. After first analyzing profiling results, start thinking about the next steps. From there you can ask intelligent questions of the data producers and other stakeholders.
5 best practices for effective data profiling
The key first step is to use data profiling as a tool, period. Too often, data experts overlook the benefits of profiling when troubleshooting quality challenges or building a new data product. Once you’ve determined to use profiling, you should:
1. Define Data Profiling Scope
Define the scope of your profiling needs and identify the data assets to be analyzed. This ensures that data profiling efforts are focused.
2. Establish Rules
Establish clear data profiling rules that specify what data elements to analyze and how to analyze them. This can include rules for data completeness, consistency, accuracy, and validity.
3. Use Multiple Data Profiling Techniques
Use a range of techniques to analyze data. Statistical analysis and pattern matching are good places to start.
4. Validate Results
Compare your unique results with the expected results and assess whether they align with the business requirements. Use this process to identify errors or inconsistencies in your data profiling workflow.
5. Incorporate Data Steward Feedback
Share results with your team, and incorporate feedback from data stewards and other stakeholders to improve the accuracy and reliability of your results.
Ensuring the accuracy and reliability of data profiling results is critical for building trust in data. By defining the data profiling scope, establishing data profiling rules, and using multiple data profiling techniques, leaders can build trust and fortify their governance framework with a pressure-tested profiling pillar.
Techniques for data profiling
Your profiling technique will largely depend on your use case and what you seek to accomplish. Marlene Simon recommends you look at the basic profile to determine your next steps, which can be broken down into questions:
Do you need to explore cross-column profiling?
Do you need to define a data quality rule and add that to the profile?
Do you need to find overlap between two tables – and use cross-table profiling?
Do you need to add metadata to information to put it in a data lake?
Do you need to migrate data from one system to another?
“You run the profiling, then you do the analysis on the results,” Simon points out. “Then based on the type of data, you can start asking questions. If it’s product data, you may ask, do these dimensions make sense? Does it have a color? Should it only have a specific color? Or should there be 15 different versions of how you spell blue? Once you run the profiling and get the results, you can ask key questions to help your analysis and recommendations.”
Addressing data quality Issues found through data profiling
How to resolve data quality issues with a team
Successful data profiling projects demand collaboration with data stewards and data owners. Here are some best practices to make data profiling for superior data quality a team sport:
Define Roles and Responsibilities
What are the duties of each stakeholder? Establish who is responsible for which aspects of the project.
Establish Communication Channels
Support collaboration between data stewards, owners, and other stakeholders with regular updates. This may include recurring meetings, email updates, or progress reports.
Share Results and Findings
Share data profiling results and findings to gather feedback and insights. Use your team’s wisdom to refine data quality rules and standards as you mature.
Prioritize Data Quality Issues
Work with stakeholders to prioritize critical data quality issues. Which have the greatest impact on the organization’s goals? Address high-priority data quality issues first to mitigate risks and improve governance practices.
Develop Actionable Recommendations
These recommendations may include new data quality standards, remediation strategies, or new data quality monitoring processes.
In summary, effective collaboration with data stewards and data owners is critical for data profiling. By defining roles and responsibilities, establishing communication channels, sharing results, prioritizing data quality issues, and developing actionable recommendations, organizations can improve their data governance framework to better support decision-making.
Benefits of data profiling with a data catalog
Data profiling with a data catalog provides several benefits, like:
Enhances metadata management
A data catalog allows users to manage and store metadata, like data definitions, lineage, and ownership. By linking metadata with data profiling details, data users can better understand their data assets.
By enabling key stakeholders to share information about data assets and data quality, organizations can establish a more comprehensive data governance framework.
A data intelligence platform can streamline the data profiling process by automating certain tasks, such as data discovery and metadata management. This frees up data professionals to focus on more strategic initiatives.
Data profiling makes it easier to find and fix data quality issues, like missing or inconsistent data. This ability improves the accuracy of analytics, resulting in better business decisions.
In conclusion, by supporting better metadata management, collaboration, efficiency, and decision-making, data profiling in a data catalog can help organizations use data as a strategic asset more efficiently.
Learn how Alation enables data quality partners to customize solutions on our open framework.
- What is data profiling?
- Top use cases for data profiling
- How to conduct data profiling
- Addressing data quality Issues found through data profiling
- Benefits of data profiling with a data catalog