Data catalogs have quickly become a core component of modern data management. Organizations with successful data catalog implementations see remarkable changes in the speed and quality of data analysis, and in the engagement and enthusiasm of people who need to perform data analysis. By contrast, organizations without a data catalog often have these questions: What is a data catalog? Why do we need a data catalog? What does a data catalog do? These are all good questions and a logical place to start your data cataloging journey.
What Is a Data Catalog?
A Data Catalog is a collection of metadata, combined with data management and search tools, that helps analysts and other data users to find the data that they need, serves as an inventory of available data, and provides information to evaluate fitness data for intended uses.
This brief definition makes several points about data catalogs—data management, searching, data inventory, and data evaluation—but all depend on the central capability to provide a collection of metadata.
Data catalogs have become the standard for metadata management in the age of big data and self-service analytics. The metadata that we need today is more expansive than metadata in the BI era. A data catalog focuses first on datasets (the inventory of available data) and connects those datasets with rich information to inform people who work with data. Figure 1 illustrates the typical metadata subjects contained in a data catalog.
Datasets are the files and tables that data workers need to find and access. They may reside in a data lake, warehouse, master data repository, or any other shared data resource. People metadata describes those who work with data—consumers, curators, stewards, subject matter experts, etc. Search metadata supports tagging and keywords to help people find data. Processing metadata describes transformations and derivations that are applied as data is managed through its lifecycle. Supplier metadata is especially important for data acquired from external sources, informing about sources and subscription or licensing constraints. I’ll dive deeper into catalog metadata in an upcoming blog.
Why Do We Need a Data Catalog?
The data management benefits of a data catalog become apparent by reflecting on the value of metadata and the capabilities that are created with comprehensive metadata. The greatest value, however, is often seen in the impact on analysis activities. We work in an age of self-service analytics. IT organizations can’t provide all of the data needed by the ever-increasing numbers of people who analyze data. But today’s business and data analysts are often working blind, without visibility into the datasets that exist, the contents of those datasets, and the quality and usefulness of each. They spend too much time finding and understanding data, often recreating datasets that already exist. They frequently work with inadequate datasets resulting in inadequate and incorrect analysis. Figure 2 illustrates how analysis processes change when analysts work with a data catalog.
Without a catalog, analysts look for data by sorting through documentation, talking to colleagues, relying on tribal knowledge, or simply working with familiar datasets because they know about them. The process is fraught with trial and error, waste and rework, and repeated dataset searching that often leads to working with “close enough” data as time is running out. With a data catalog the analyst is able to search and find data quickly, see all of the available datasets, evaluate and make informed choices for which data to use, and perform data preparation and analysis efficiently and with confidence. It is common to shift from 80% of time spent finding data and only 20% on analysis to 20% finding and preparing data with 80% for analysis. Quality of analysis is substantially improved and organizational analysis capacity increases without adding more analysts.
What Does a Data Catalog Do?
A modern data catalog includes many features and functions that all depend on the core capability of cataloging data—collecting the metadata that identifies and describes the inventory of shareable data. It is impractical to attempt cataloging as a manual effort. Automated discovery of datasets, both for initial catalog build and ongoing discovery of new datasets is essential. Use of AI and machine learning for metadata collection, semantic inference, and tagging, is important to get maximum value from automation and minimize manual effort.
With robust metadata as the core of the data catalog, many other features and functions are supported, the most essential including:
- Dataset Searching—Robust search capabilities include search by facets, keywords, and business terms. Natural language search capabilities are especially valuable for non-technical users. Ranking of search results by relevance and by frequency of use are particularly useful and beneficial features.
- Dataset Evaluation—Choosing the right datasets depends on ability to evaluate their suitability for an analysis use case without needing to download or acquire data first. Important evaluation features include capabilities to preview a dataset, see all associated metadata, see user ratings, read user reviews and curator annotations, and view data quality information.
- Data Access—The path from search to evaluation and then to data access should be a seamless user experience with the catalog knowing access protocols and providing access directly or interoperating with access technologies. Data access functions include access protections for security, privacy, and compliance sensitive data.
A robust data catalog provides many other capabilities including support for data curation and collaborative data management, data usage tracking, intelligent dataset recommendations, and a variety of data governance features.
Managing data in the age of big data, data lakes, and self-service is challenging. Data catalogs help to step up to those challenges. Active data curation is a core element of data catalog success and a critical practice for modern data management. In my next blog I’ll answer the question: What Is Data Curation?