By Ibby Rahmani
Published on September 23, 2021
A data catalog is a centralized storage bank of metadata on information sources from across the enterprise, such as:
Datasets
Business intelligence reports
Visualizations
Conversations
The data catalog also stores metadata (data about data, like a conversation), which gives users context on how to use each asset. It offers a broad range of data intelligence solutions, including analytics, data governance, privacy, and cloud transformation.
Data catalogs come with various features, but the core capabilities needed for a well functioning data catalog are:
Data search and discovery
Curation
Governance
Collaboration
Analytics
With these features, a data catalog can address a range of data intelligence needs within an enterprise.
When considering a data catalog solution for an enterprise, there are many potential features to consider. Some data catalogs are built for a specific infrastructure and some are built for governance suites. But only a data catalog built as a platform can empower all employees who work with data across your organization.
With a data catalog as a platform, users will have the confidence to find, understand, and govern data. There are many other features and functions, but to verify that a data catalog functions as platform, look for these five features:
Intelligence – A data catalog should leverage AI/ML-driven pattern detection, including popularity, pattern matching, and provenance/impact analysis. It should also use automation to build a lexicon of your enterprise’s terminology while allowing for manual business glossary suggestions.
Collaboration – Data people need each other’s expertise. Collaborative catalog features, like ratings, reviews, and conversations, can crowdsource that expertise so folks learn from each other and build on one another’s work. This crowdsourcing allows for a fully searchable Q&A so knowledge isn’t lost in email or other communications.
Guided Navigation – Guided navigation provides intelligent suggestions, which guide correct usage of data. Behavioral intelligence, embedded in the catalog, learns from user behavior to enforce best practices through features like data quality flags, which help folks stay compliant as they use data.
Active Governance – Active data governance creates usage-based assignments, which prioritize and delegate curation duties. It also allows for deeper analytics and visibility into people, data, and documentation. Active governance should include a stewardship dashboard, designed to make stewardship fast and easy.
Deep Connectivity – If you’re implementing a data catalog, it should utilize pre-built connectors to a wide variety of sources, as well as an open connector SDK to connect to any other source. Lastly, Query Log Ingestion (QLI) provides deep insights into how queries are used, surfacing handy advice for query writers in real time.
Who will use the catalog? With what tools must the catalog integrate? There are different uses for a data catalog tool and different skill levels required to make good use of it. In most cases data catalogs are used in three distinct ways.
Data Catalogs for Data Science & Engineering – Data catalogs that are primarily used for data science and engineering are typically used by very experienced data practitioners. These catalogs classify information in data lakes and provide tons of information to data users, but due to their complexity, these catalogs usually don’t result in self-service business intelligence.
Data Catalogs for Specific Vendors or Tools – Data catalogs built for specific vendors or tools help data users find and analyze data, but are still limited by the lack of connectivity to other data systems. In this case, the data catalog has one function and does not address the full scope of enterprise data efficiency.
Enterprise Data Catalogs – An enterprise data catalog is designed to empower the widest class of users. This catalog-as-platform connects to all data sources (and their assets) within the enterprise. This allows for enhanced data efficiency and accuracy, as well as the fostering of a strong data culture.
All data consumers stand to benefit from the right data catalog. Encourage your various teams, including IT, analytics, and data governance leads, to evaluate each data catalog’s suitability to their purpose, as it relates to role.
When looking for a data catalog that will fit your needs, it’s important to know not just how it will apply to the workforce of an enterprise, but also how it’s built and what it’s meant to accomplish. There are four main data catalog types that offer different functions based on the needs of your enterprise:
Standalone – A standalone data catalog allows for the cataloging of data sets and operations, data set search, evaluation, and requires a high level of interoperability for a seamless user experience.
Integrated with Data Preparation – A data catalog integrated with data preparation creates a conducive environment for finding, evaluating, and preparing data. It also catalogs datasets and operations that includes data preparation features and functions.
Integrated with Data Analysis – A data catalog integrated with data analysis supports basic data preparation; it allows data users to easily find and analyze data and catalog operations. It also provides a catalog of datasets, including data analysis and visualization features. If users desire advanced data preparation, a high level of interoperability with data preparation tools will deliver preparation functions.
Fully Integrated Solution – A fully integrated data catalog is what most data catalogs strive to be or mature into. These catalogs are fully connected to all enterprise data assets providing data preparation, analysis, visualization, governance, and security. Integrated catalogs provide ease of use throughout the analytics lifecycle.
If you’re like most data consumers, you’re up to your ears in it! Automation can ease the burden of manual cataloging efforts with AI and ML. Here are the features of a data catalog that stand out in the top catalog solutions.
Data Curation – Data curation in the catalog empowers crowdsourcing within a data team and across the enterprise at large. Curation supports data quality, insights, and reuse. Curation can power key insights. Your data catalog should pull and display usage information around top users, popular data, and joins/filters. Sharing these details builds trust in data, enforces smarter usage, and supports data culture.
Native Integrations – Native integration simplifies what was once an overwhelming amount of data sources down to a single source with data catalog connectors. This creates a single place to access all data sets. If you select a catalog that includes native integration, there will be no extra tools or setup required. In this sense, native integration capability is the bare minimum requirement to connect to data sources.
SQL Query Log Analysis – SQL analysis provides insights as to the usage log and the behavior of people accessing data sets. At minimal function, a data catalog should provide metrics that show who is querying a data set and how often.
Simple SQL Access – Simple SQL enables self-service analytics for users that don’t necessarily have highly technical knowledge or training. This allows data analysts or stewards to write and save queries – which even less trained users can utilize.
Search and Discovery – Search and discovery allows all enterprise users to derive insights from their data. The main goal of this feature is to create a system that makes data discovery and search efficient and comprehensive.
We return to the title: What do you actually need from a data catalog tool?
You need speed, and you need each other. You need the ability to collaborate around data. When questions arise, you need to find experts with answers quickly. You need to know which data is the most trusted for reports, especially when those reports inform major business decisions. And you need the guiding hand of active governance — so you’re not living in fear of violating the GDPR — and incurring an eight-figure fine!
Convinced? Then it’s time to go catalog shopping. To find a data catalog solution that works for your enterprise, consider three key steps. First, identify a data catalog solution that functions as a platform, supports your team’s needs, and includes the features common in catalogs that drive success. Address your business goals and work through catalog types, team roles and finally address the features that you need to fix pain points.
Hungry for further information about how to evaluate a data catalog? Check out our evaluation guide for a step-by-step evaluation walkthrough.