Data curation is a term that has recently become a common part of data management vocabulary. Data curation is important in today’s world of data sharing and self-service analytics, but I think it is a frequently misused term. When speaking and consulting, I often hear people refer to data in their data lakes and data warehouses as curated data, believing that it is curated because it is stored as shareable data. Curating data involves much more than storing data in a shared database.
What Is Curation?
Let’s set data aside for a moment and consider the meaning and the activities of curating. The traditional use of the word is associated with collections of artifacts in a museum and works of art in a gallery. More recently we’ve started to use the term to describe managed collections of many kinds such as curated content at a website, curated music and videos available through streaming services, and curated apps through download services. Wired.com has described Apple’s App Store as “curated computing.”
Curation is the work of organizing and managing a collection of things to meet the needs and interests of a specific group of people. Collecting things is only the beginning. Organizing and managing are the critical elements of curation—making things easy to find, understand, and access.
What Is Data Curation?
If curated describes collections of things that are selected and managed to meet the needs of a specific group, then curated data is a collection of datasets that is selected and managed to meet the needs and interests of a specific group of people. Note that the focus here is datasets – files, tables, etc. – that can be accessed and analyzed. The distinction between “collections of data” and “collections of datasets” is subtle but significant.
Data curation, then, is the work of organizing and managing a collection of datasets to meet the needs and interests of a specific groups of people. Collecting datasets is only the beginning. That is what we do when we store data in data warehouses or data lakes. But organizing and managing are the essence of data curation. Making datasets easy to find, understand, and access is the purpose of data curation—a purpose that demands well-described datasets. Data curation is a metadata management activity and data catalogs are essential data curation technology. Data catalogs are rapidly becoming the new “gold standard” for metadata management, making metadata accessible and informative for non-technical data consumers.
Who Are the Data Curators?
A typical organization has many people doing data curation work with varying degrees of responsibility and corresponding time commitment. Everyone who works with data has the opportunity to curate by sharing their knowledge and experiences. Crowdsourcing of tribal knowledge is an important part of curation practice. Collaborative data management is a necessity in the self-service world and knowledge sharing is the first step in creating collaborative culture. Curation collaborators will be large in number with a modest level of responsibility and time commitment.
Domain curators have subject expertise in specific data domains such as customer, product, finance, etc. Domain curators record and share data domain knowledge that helps data analysts to understand the nature of data that they work with. The number of domain curators is substantially smaller than the number of collaborative curators, with greater level of responsibility and time commitment.
Most organizations will have one or very few lead curators who are responsible for moderating data catalog content much as wiki moderators manage content. Lead curators have a high level of responsibility for metadata and catalog quality – responsibilities that require substantial time commitment.
What about Data Stewards?
I frequently am asked about the differences between data curators and data stewards: Are they two names for the same role? Can data stewards be your data curators? Why do we need both stewards and curators? These are good questions that are important when considering how to fit data curation into your organization. It is practical for the same individual to have both curation and stewardship responsibilities, especially at the level of domain curators. It is important, however, to recognize curation and stewardship and distinctly different roles, each with a unique perspective about managing data. Some of the key differences are shown in the table below.
The roles of data steward and data curator are related and somewhat overlapping. Stewards and curators working together is a combination that maximizes the value of data across all use cases from enterprise reporting to analytics and data science. Stewardship and curation are both metadata management activities and data governance roles. Data curation and data cataloging are important elements of modern data governance. They are complementary disciplines that are both essential in the age of self-service analytics.
Data curation is a metadata management activity and data cataloging is metadata management technology. But both approach metadata very differently from metatdata management practices of the past. In my next blog I’ll address the question: Where Do Data Catalogs Fit in Metadata Management?