What is Data Lineage?
By Jason Rushin
Published on August 20, 2021
When you think of lineage, what typically comes to mind is one’s ancestry or pedigree. It’s a family tree that traces a path back past your parents, grandparents, and more, showing from whom you descended and how you’re related.
The same can be said for data, too. Data lineage shows the history of the data you’re looking at today, detailing where it originated and how it may have changed over time. It’s a reflection of the data life cycle, the source, what processes or systems may have altered it, and how it arrived at its current location and state.
Data lineage helps you understand the data you’re using, what may have happened to change it over time, and who was involved in any processing.
There are many reasons a data asset may change. It could have been transformed through filters or calculations. New data may have been added or deleted along the way, or new updates may have been made.
In this way, data lineage is a powerful tool for understanding. Knowing the data lineage helps you find the source of errors, determine the data’s suitability for usage, understand how processes can be optimized or improved, speed the time to insights, and much more.
Why is Data Lineage Important?
Data lineage helps you holistically understand the data you intend to use. Was it updated? Did it go through a particular process? Has the quality been evaluated or improved? Was it used by a specific team or in a specific report? The more you know about data lineage, the better decisions you can make about data usage.
Here are a few ways data lineage can improve data usage:
Data assets are bound to change over time. New data might be added, old data deleted, or alterations made. The way data changes can affect how you use the data, and the value of the data to your business. Understanding changes to data ensures you get the most benefit from the data.
Data is required for strategic business success. Today, every company and every department needs to become more data driven and build a data culture. But that data must be understood to be best used, make better decisions, and keep the business viable. Data lineage helps you understand how and when data should be used.
Data migrations rely on understanding the data. You wouldn’t want to migrate bad or misunderstood data to new systems. Similarly, you need to understand how data has been sourced, stored, and altered to know how to migrate it and where to move it to. Data lineage provides these details so data migrations are more efficient and successful.
Understanding data life cycles is critical to data governance. Data lineage adds to the value of audits and highlights areas of potential risk. Lineage can even reveal areas of noncompliance and conflicts with internal policies or external regulatory requirements.
Tracking data as it moves through your organization informs IT of applications, gaps, and usage patterns so they can be more responsive to business needs. This interactive data lineage also helps IT understand the scope of data assets so they can optimize their own efforts and better prepare for future requirements.
The Future of Data Lineage
The future will include more data. Every day brings more technology and devices generating more data points. Businesses are also relying more on cloud-based services and technologies to generate, process, store, access, and analyze said data. Winning organizations will be those that quickly find ways to understand the data and then put it to good use; this will require more insights into data lineage.
And, as businesses migrate to the cloud, new capabilities emerge. Today, businesses are learning they can scale data governance and management efforts using the cloud. Tomorrow, they’ll seize on opportunities drawn from machine learning, automation, and AI. Many already are, as these technologies can automatically process and alter data. But, they can also log every transformation—in the form of data lineage—making the data, and the data lineage, more crucial and valuable.
Organizations need to understand data lineage before data can be effectively used. That’s already a fact today, and the future will only expand the gap between the data in front of you and the original data, making data lineage even more important tomorrow.
Data Lineage Techniques
While data lineage may seem straightforward, capturing and documenting data lineage can be done in many different ways. Here are just a few data lineage techniques.
Pattern-based lineage relies on metadata to determine data lineage, looking for patterns in the metadata and related data to classify or define different data. Pattern-based data lineage does not involve adjacent systems or tools, but only looks at the data itself. This provides the advantage of flexibility, as this method can determine the lineage of any data across any system or technology. But, since it is a relatively simplistic method, it might miss or misunderstand some data complexities or nuances.
Data Lineage by Tagging
Data lineage by tagging is sometimes called self-contained data lineage. It’s applicable when data is stored, processed, and managed within a single, self-contained system that tags data as it moves through its life cycle. The advantage is that the tagging is built into the system; the disadvantage is that it only works when you’re using a single system.
Data Lineage by Parsing
Data lineage by parsing is a more advanced method of data lineage. It employs software and automation to understand the logic used to process or transform data. This technique is good for capturing change across systems because it tracks data as it moves. The drawback is that this approach is more complex, and requires advanced understanding of the tools, programming languages, and systems used across the entire data life cycle.
How to Use Data Lineage to Ensure Data Quality
Considering the number of data assets and complexity of data infrastructures in most enterprises down to even midsize businesses, it’s nearly impossible to manage data lineage manually. A dedicated data lineage tool is a more prudent choice, and can automatically and intelligently map data lineage from the point of origin to its current location. But, regardless of which data lineage tool you choose, the process for getting started is similar.
Steps to Use Data Lineage to Support Data Quality
Identify your data landscape with a list of data assets, systems, and locations. Consider surveying workers to ensure no data elements are missed or related processes are overlooked.
Trace data back to its origin to create a connection between where data is now and where it was generated.
Pinpoint data sources and interactions along the way, including details on links and systems.
Generate a data lineage map at the individual system level and at the macro organization level.
Building your data lineage map is not a trivial task. This project may take a considerable amount of time and effort, depending on how well your data is currently documented and the breadth of your data landscape. Again, a data lineage tool can be very helpful in accelerating your data lineage efforts by using automation and AI.
Alation’s Data Lineage Tools
Effective data lineage requires accuracy, completeness, and detail. Alation provides advanced data lineage tools for comprehensive visibility and understanding of the data life cycle across your enterprise. Capabilities include automated, table-to-table lineage, as well as lineage APIs for manual augmentation of data flow graphs.
In partnership with Manta, Alation automates the process of generating enriched, column-level lineage across data sources and reduces the need for technical resources to develop end-to-end lineage. The integrated Alation+Manta solution offers advanced, cross-system, column-level lineage to ease regulatory compliance adherence, perform impact analysis, and notify stakeholders of upstream data changes in real-time.
To learn more about Alation for data lineage, sign up today to get a demo.
- Why is Data Lineage Important?
- Data Lineage Techniques
- How to Use Data Lineage to Ensure Data Quality
- Alation’s Data Lineage Tools