How does a data catalog work?
By Claudia Imhoff
Published on March 27, 2020
The architectures of the past for BI and analytics – the Corporate Information Factory or the Bus Architecture – are now only one part of a complete analytical environment. We now have multiple areas of analysis and a plethora of technologies to support and perform these analyses. Figure 1 gives you a good idea of the massive complexity found in most enterprises’ analytical environments.
Without the red box in the middle of Figure 1, the entire assortment of data sources and analytical components becomes chaotic and uncontrollable. Specifically, the data management box – governance, security, and of course, location and access of data, reports, visualizations, and analytics – is critical to the success and maintenance of all analytic environments. Sadly, this data management function is often less than optimal. I think Tim Berners-Lee, inventor of the Internet, put it best:
“Any enterprise CEO really ought to be able to ask a question that involves connecting data across the organization, to be able to run a company effectively, and especially to be able to respond to unexpected events. Most organizations are missing this ability to connect all the data together…”
So, what’s an organization to do if these data assets are to be easily located and accessed for critical decision-making? Here are some of the solutions I’ve seen:
Manual “Hunt and Peck” Method: The person seeking information and analytics must personally search through multiple databases, data management technologies, and through analytic and reporting technologies. They spend more time searching than they do analyzing, which means wasted time, redundant efforts, and non-productivity.
Creation of “Reference” Capabilities: Reference resources are created in spreadsheets of metadata, system data dictionaries, internal wikis or ontologies, and even enterprise search engines. These capabilities consisted of multiple, almost instantly out-of-sync sources and left most information workers frustrated.
Implementation of a Cross-Platform Metadata Management System, better known as a Data Catalog: The data catalog manages the metadata about data and analytics in one, central place. By scanning and mapping all metadata for each data system (EDW, data lake, or other analytic data sources), data manipulation technology (ETL, data prep, etc.), and analytical or BI technology (data visualization, reporting, etc.), the data catalog creates living, single source of reference for finding and understanding data. This consolidation means that businesspeople can quickly locate and access data and analytics through a simple interface that supports all skill levels – ultimately, helping to drive successful self-service analytics.
Let’s dive deeper into how the data catalog works by defining it and illustrating the benefits an enterprise gains from its implementation.
So, what is a data catalog?
David Stodder, TDWI Research Director, defined the data catalog as:
“A kind of Rosetta stone that enables users, developers, and administrators to find and learn about data – and for information professionals to properly organize, integrate, and curate data for users.”1
The data catalog acts as the enterprise’s guide to the oftentimes mystifying world of data, analytics, visualizations, and reports. The data catalog shows businesspeople how to navigate the multiple analytic components, all while giving them necessary information about each analytic offering to ensure that they select the appropriate data asset(s) for their immediate needs. It assists BI developers by giving them immediate access to what data is available and accessible, what reports have been created, who is affected by changes in the upstream processing of the data, and what assets are no longer being used.
The data catalog classifies and organizes all data assets across cloud, whether private, public, or hybrid, or on-premises storage. And it supports all types of data consumers from highly technical data scientists to business users, who might never have written a query in their lives. At the same time, by centralizing the information about analytical assets, the data catalog supports governance, data security, and compliance across the enterprise.
Benefits of a Data Catalog
Easier, faster search capabilities: These lead to better, more accurate, reliable analytic assets. Businesspeople can quickly find their needed information and get better business insights. This leads to more trust in the environment and therefore, higher adoption rates of analytic assets. This also enables first-time and future users to ramp up quickly.
Better compliance with internal policies, like security and privacy, and external regulations, like GDPR and CCPA: AI and ML capabilities can detect “sensitive” data like HIPAA or PII fields while usage tracking can determine potential access or illegal usage patterns and create an audit trail for compliance.
Cost savings: A significant amount of effort and time is spent creating analytical assets that ALREADY exist. Business users often reinvent the wheel when they can’t easily find a report. Data scientists create redundant data sets when their existence is not readily obvious. The inability to find what you want causes wasted time and effort in addition to redundancy of data and data assets. These all cost significant amounts of money. A data catalog highlights redundancy and inconsistencies and supports streamlining this overly complex environment.
Collaboration and annotation features: Context is critical to understanding and trusting these assets, making collaboration and annotation significant capabilities in data catalogs. These provide mechanisms for businesspeople to determine if the analytical asset is appropriate for their needs, to find like-minded people creating assets they need, to form collaborative units with these like-minded people.
I hope this blog has been useful in giving you a solid foundation for what a data catalog is and the numerous benefits you will garner from its implementation. I’ve only listed four here, but to hear more about data catalogs, including more detail on what they are and their benefits as well as what to look for in a catalog offering, I hope you watch my webcast with Alation’s Aaron Kalb, titled “Taming BI and Analytics Chaos: The Data Catalog to the Rescue!“
1 “Data Cataloging Comes of Age.” David Stodder, TDWI Research
- So, what is a data catalog?
- Benefits of a Data Catalog