Features of a Modern Data Catalog
As big data grows exponentially, data catalogs are a modern approach to find, understand, and trust your data
What is a Data Catalog?
Data catalogs make it easier for your analysts to find, understand and trust the data they need to perform self-service analytics.
As big data grows exponentially, new approaches to finding, understanding and establishing trust in that data are emerging. A data catalog is a reference application available to any user of data as the first stop on a self-service data discovery task. Data catalogs include a number of features. Data catalogs aggregate metadata on datasets that are available for analysis. This metadata describes standard database objects such as tables, queries, and schema stored in systems such as a data warehouse or data lake. It can be enriched with annotations and sample projects created in business intelligence (BI) or analytic applications and shared by users of the data through the catalog. In its physical form, a data catalog is either a cloud-based or on-premise server that automatically indexes data systems and provides a data inventory of those assets that can be accessed from a single source. Like Google, a data catalog crawls databases and business intelligence systems and provides a single point of reference for enterprise data.
Offering a data catalog as a common point of reference, enables users, including business analysts, data analysts, data scientists, and data stewards to find, understand and collaborate on the data in a data warehouse or data lake, through annotation that enriches the data with context. Some data catalogs rely on machine learning to provide additional behavioral context around how the data is being used. By performing log analysis on the logs, a data catalog can make certain assumptions about the usefulness or quality of the data being accessed by the data catalog. You can see such things as how often a particular table or schema is being accessed, how recently it was used, and by whom. In this way, a data catalog adds additional context that can’t be determined from the data alone.
Instead of having to know a connection string or path to connect to a data source, a data catalog provides a client application that is purpose-built for for the consumption of data and employs certain conventions adopted from widely used online consumer catalogs such as Yelp, Pinterest Wikipedia and Spotify. Users find data by browsing, searching or by surfacing recommendations rather than by typing obscure commands.
Collaboration features, such as the ability to annotate data assets or hold threaded discussions allow for a grassroots approach to data governance and allow every user to contribute their knowledge to the data catalog.
A data catalog makes it easier for non-technical users to consume data productively. The goal is not only to allow users to inventory data but to find the right data for non-technical users with natural language search, saved queries and the ability to easily browse through data assets in a catalog format. In addition, a data catalog surfaces recommendations from other users working with the same data sets.
Traditional database systems require the user to know the location of a data source’s documentation in order to understand its intended use. A data catalog is self-documenting and the documentation resides side-by-side with the data it is documenting, not in a separate system. Also, since a data catalog can also access Wiki pages, the experience of reading documentation is identical to that of examining the data. In addition, instead of having to track down the expert or team responsible for the data in order to get a question answered, a user can immediately see who is engaging with the data and reach out to them for help. In this way, tribal knowledge is made available for discovery and reuse.
Another big advantage for those working with lots of different kinds of data is that, unlike extract, transform, load (ETL) tools, the data in a data catalog remains in its native format so it is simple to go back to the original application that created it if necessary.
A data catalog enables data discovery and exploration for self-service analytics by providing a single source of reference and a simple way for data consumers to access the data they need to perform their jobs. A data catalog can aid in data quality and data governance by allowing users to collaborate in a single self-service environment. A data catalog can help your company go from data-rich to data-driven.