Best Practices for Data Curation
The increasing need to make sense of a seemingly endless and confusing world of information has led to the adoption of curation techniques that wouldn’t seem too dissimilar to our forebears
In the consumer world, the increasing need to make sense of a seemingly endless and confusing world of information has led to the adoption of curation techniques that wouldn’t seem too dissimilar to our forebears. There’s arguably even more need for this in the enterprise where data is coming at us from all directions simultaneously and in different forms, some structured some semi-structured, and some that is still raw.
Up until recently, the priesthood (in other words the IT managers who have long controlled the dissemination of data), has always determined which data was of value and which data assets would be made available for use. But we are now seeing a shift from the old way of doing things (single source of truth data warehouses with top-down data governance) to a more modern form of data management (distributed, self-service analytics with grassroots management). A approach, commonly referred to as self-service analytics allows data analysts, scientists, and even less technical business users to freely discover and explore their own data.
In the past, the data warehouse was considered the single source of truth for enterprise data. But that no longer works when data is also coming from files, streams, wikis, data dictionaries, metadata management tools, raw web content, emails, chats and many other forms of data communication. What’s needed is a tool that can draw upon all of these things and provide users with just the right data delivered to the right people, at exactly the right time.
There’s certainly a wealth of data available. But, in order to know which data can be trusted, you’ll need to understand such things as how it maps to your business processes, how recently it has been accessed, and how it is being used. You’ll need assurances that the data is high quality, and is useful for the kind of analysis you are performing.
Lessons from consumer catalogs
Just as Yelp serves as a guide to all of the restaurants in a given place, a data catalog catalogs all of the data assets that are spread across a company’s various systems. A data catalog documents tribal knowledge and best practices by presenting the data in-context.
A crowdsourced approach to curation means that data analysts can move at the speed of business. Consider Wikipedia. By all accounts, Wikipedia is more accurate and up-to-date than the Encyclopedia Britannica because it is constantly being updated by a community of people, many of which are subject matter experts and professionals within their given domain.
Another example is Pinterest where you follow people with similar interests to you and have the ability to add their pins to your list of saved pins. Or Amazon, which has built a complex algorithm based on the determining what you’ll purchase in the future based on what you’ve purchased in the past.
Like these consumer catalogs, the value of a data catalog comes from its ability to surface the connections and context around different sets of data. A data catalog may let you upvote/downvote specific data assets, it may let you annotate these assets or deprecate them, and it may let you follow particular users and have conversations around the data.
Is Data Curation the Same as Data Governance?
Top-down data governance, with policies imposed by IT has been the solution to ensuring legal compliance. But this has been at odds with the need for businesses to extend access to data to a broader set of users and to move with more agility. This situation only gets worse as companies need to become more data-driven.
As companies adopt self-service analytics, data curation can help resolve this disconnect between strict data governance and business requirements. With data curation, any analyst in the organization can be a curator of data knowledge, and subject matter experts are encouraged to do so (not unlike Wikipedia). Data curation is part of a new approach, Governance-for-Insight, that balances the need for compliance with the need to deliver faster insight.
Data curation promotes reuse of the data knowledge that already exists in the organization and results in higher analyst productivity. With the ability to build on existing knowledge, analyst time is freed up to focus on ideation and new analysis. Collaboration comes more naturally as analysts have more time to share and document their work. So the coverage and accuracy of data knowledge in the organization also increases. Finally, data curation creates broader awareness throughout an organization of how data can be applied to decisions. Ultimately investing in data knowledge can inspire individuals to be more data-driven.