5 Features to Consider When Evaluating Data Catalogs
As organizations learn to deal with information overload, they are finding that it isn’t so much the volume of the data that is the problem, it’s their ability to find the data they need and use it effectively
While a data catalog can make it easier to find, understand, and establish trust in data, not all data catalogs are created equal. Beyond just providing a data inventory that helps you trust your data, they can help you achieve unprecedented insights into the behavioral context of your users. Though data catalogs have a lot of benefits, there are many considerations to keep in mind when determining which catalog is right for your organization.
Understanding behavioral context
Context is king in understanding the data you have in your organization. Deriving meaning from your data is a function of the ability to enrich data with context, understand the behavioral context of your data, and make the data accessible to everyone in your organization. A typical data catalog pulls in all of the metadata associated with these data assets, providing additional context.
By definition, a data catalog inventories everything from tables, queries, and schemas to business intelligence (BI) workbooks, articles, conversations and more, as well as provides a capability to query that data from within the data catalog. Additionally, a data catalog may provide the ability to annotate and rate the value of the data.
Here are five things to consider when choosing a data catalog:
In order to derive the most insight from your data, you’ll need to have usage information around top users, popular schemas/tables/columns, and joins/filters that harness the power of the community for data curation. Much of this information can be crowdsourced, just as in consumer catalogs. Look for a data catalog that leverages the lessons learned from Yelp, giving end users the ability upvote/downvote data assets and a sense of where the data came from, indicating whether or not it can be trusted. Wikipedia’s ability to be self-documenting, improving the quality of sourced data over time, can also help you achieve more trust in your assets. While Pinterest and YouTube provide a method for saving and sharing a list of items with like-minded individuals.
Data curation is the process of managing data in a way that enables data discovery and retrieval, helps maintain data quality, allows users to derive insights, and provides for re-use across a broad spectrum and number of users.
- Native Integrations (BI & Data Sources)
Today’s organizations are challenged with having to keep up with what can sometimes feel like an overwhelming number of data sources. Your data may be coming from a data warehouse, a data lake, or any number of additional inputs. A data catalog can normalize all of these sources and provide a single source of reference for all of your enterprise data, creating an inventory of your data and a single place to access all of your data assets.
A data catalog should also be able to connect to the business intelligence (BI) tools you are using. Make sure that the data catalog you choose has native connectors built into the platform and supported by the platform itself. Otherwise, you’ll be stuck integrating, testing, and validating these integrations yourself. Native integration means that there are no additional tools required to setup, maintain, or ensure compatibility with different versions. Support comes from a single vendor and there is typically no additional cost to add integrations.
The ability to connect to these sources and ingest data is the bare minimum needed. Support for native workflows and tight integration between products is essential and you should consider whether the data catalog provides additional features such as the ability to surface error notifications or provide lineage.
- SQL Query Log Analysis
Part of the context that a data catalog can provide comes from the ability to parse a usage log and track the behavior of people that are accessing the data sets. At a bare minimum a data catalog should provide metrics as to how many times a data set has been queried and by whom. Ideally, you’ll choose a data catalog that provides behavioral statistics such as which schemas, tables, columns, filters, joins and queries are the most popular. Some data catalogs also surface machine learned usage patterns along with the technical metadata to provide a more complete picture. Look too, for the ability to trace the lineage of the data. Lineage is important because it can show you all of the steps in the data pipeline that might have an impact on analysis. the ability of a data catalog to add lineage can contribute greatly to the speed and accuracy of the analysis.
- Simple SQL Access
In order to enable self service analytics for business users, it is necessary to provide them with tools that don’t require a lot of training or technical knowledge. A modern data catalog lets users immediately begin to write queries in Standard Query Language (SQL) and can make “smart suggestions” that guide users to the best filters and joins, popular columns, and more. Data analysts or stewards can write and save queries for use by less technical users. scaling data discovery across an organization.
- Search and Discovery
One of the intended goals of any data catalog is to make it easier for users to derive insights from their data. Data discovery is more valuable when it is available to everyone in an organization who needs to work with data. When a data catalog is developed as an add-on to an existing governance product, it may lack the ability for business users to get the true benefits of data discovery. Data discovery may be limited to users that are in governance roles rather than the rank and file users that are performing self service analytics. A modern data catalog doesn’t just support compliance, it enables governance-for-insight, a more agile approach that doesn’t limit users.
As organizations learn to deal with information overload, they are finding that it isn’t so much the volume of the data that is the problem, it’s their ability to find the data they need and use it effectively. A data catalog solves these problems and others by providing a single point of reference that allows everyone from chief data officers and data stewards to less-technical business users to collaborate on data. A crowdsourced, grassroots approach to data governance is a perfect fit with the move to self service analytics enabled by visualization and BI tools.