Finding the right data asset in a modern enterprise can be like trying to find a book in a massive library. In the 20th century, libraries leveraged card catalogs to make it easier for book seekers to search by title, author, or category. In the digital age, Amazon and Google eclipsed their predecessors, largely by developing superior catalogs.
A 21st century data catalog should:
- Go beyond a mere inventory to be a truly consumable catalog, with sufficient context to help end-users find and utilize data assets relevant to their business needs
- Not simply attempt to document prescriptive rules, but rather offer rich, descriptive information about how data assets are used by different individuals and teams
- Combine machine learning, crowdsourcing, and stewardship by experts to optimize both quantity and quality of information
Inventory vs. Catalog
Context is King
Suppliers think about inventory. Consumers look at catalogs.
Like an inventory, a catalog should list everything available for consumption (and nothing which isn’t), but that’s not enough. An Amazon product page includes pictures, specs, reviews, and recommendations. These bits of information cumulatively help the user decide what to buy. Consuming data also requires rich context: before embarking on a research project, an analyst needs to understand the shape of the data set, where it came from, whether it is up-to-date, who else has used it, and how it was used. To address these needs, a catalog should provide data samples and statistical profiles, lineage, lists of users and stewards, and tips on how the data should be interpreted.
Prescriptive vs. Descriptive
Following the trail of usage…
Some catalogs may try to be a source-of-truth for the right table to consult for a given purpose, the right categorization of a given value, or the right way to calculate a given metric. If universally consulted and respected, such prescriptive catalogs could hypothetically help everyone within an organization to align and bring about an overall reduction in disparities and confusion. In practice, however, we find that prescriptivism is a challenge in large enterprises.
Here’s a customer example: For tax and legal purposes, Hawaii is clearly part of the US. However, the operations team cares less about legal boundaries and more about how the islands are inaccessible by rail, so they use a taxonomy which groups the other 49 states together with Canada, and buckets Hawaii with Puerto Rico and Guam.
If a data catalog were to pick a side, it would lose support from folks in Finance, or Legal, or Operations. And if any one of these groups abstained from using it, then the catalog, by definition, could not be the “single source of truth.”
A more useful and achievable goal is to carefully document all the ways data is actually used. If a catalog describes the various regional taxonomies available in some detail, and includes usage information (e.g. that Al, Bonnie, and Clyde from the supply chain management team have historically used definition #1, whereas Xavier, Yvonne, and Zelda from Accounting have instead used definition #2), visitors can figure out which version best suits their needs.
Of course, while it’s possible to have many different versions of “right,” there are many more versions which are just plain wrong. But taking a descriptive approach helps handle those as well. If I see a metric in a report, I want to be able to look it up, even if it shouldn’t have been used in the first place. When I observe that it hasn’t been used in a long time or is used by a tiny or comparatively less-trustworthy group relative to the approved versions, that sends a strong signal.
Crowd-Sourcing vs. Stewardship vs. Machine-Learning
Optimizing for Quality and Quantity
When large groups of people are able to contribute to a data catalog—or when more content is automatically imported or learned from computer systems—broader coverage can be achieved, but accuracy may suffer. For data consumers, some information is better than no information, but when reliable and unreliable information are indistinguishable, all information becomes unreliable.
As a consequence, traditional data documentation systems limit contribution permissions to a small, trusted group. The result is more accurate documentation for a few data assets, but far less breadth of coverage. This method is also slow, and the documentation often gets stale.
A well-designed and engineered system should make it possible to achieve both coverage and accuracy. LinkedIn provides a good model. Instead of automatically adding connections or skills to your page, LinkedIn’s “People You May Know” algorithm suggests connections which you can confirm or reject (and your feedback, in turn, improves the algorithm). Similarly, the “Skills & Endorsements” section is first crowd-sourced and then confirmed.
To produce a 21st century data catalog, advanced algorithms—supplemented by crowd-sourcing techniques—create a comprehensive portrait of data assets and how they’re used in an organization, with “guesses” clearly marked as such. Experts can then confirm, reject, or amend the guesses, to teach the computers and provide gold-standard knowledge for end-users.
The Bottom Line
It’s 2015. Yesterday’s challenge was collecting relevant data for analysis and producing relevant reports. By now, many organizations possess the data and computational resources to answer almost any analytical question. However, finding the most relevant, trustworthy data sets and metrics can be like finding a needle in a haystack.
An inventory can support data storage and management, but only a true catalog can provide end users with the context to put that data to use.