What is Data Curation?
Data curation is a means of managing data that makes it more useful for users engaging in data discovery and analysis
According to the University of Illinois’ Graduate School of Library and Information Science, “Data curation is the active and on-going management of data through its lifecycle and interest and usefulness to scholarship, science, and education; curation activities enable data discovery and retrieval, maintain quality, add value, and provide for re-use over time.”
What is the Role of a Data Curator?
In practice, data curation is more concerned with maintaining and managing the metadata rather than the database itself and, to that end, a large part of the process of data curation revolves around ingesting metadata such as schema, table and column popularity, usage popularity, top joins/filters/queries. Data curators not only create, manage, and maintain data, but may also be involved in determining best practices for working with that data. Data curators often present the data in a visual format such as a chart, dashboard or report.
Data curation starts with the “data set.” These data sets are the atoms of data curation. Determining which of these data sets are the most useful or relevant is the job of the data curator. Being able to present the data in an effective manner is also extremely important. While some rules of thumb and best practices apply, the data curator must make an educated decision about which data assets are appropriate to use.
It’s important to know the context of the data before it can be trusted. Data curation uses such arbiters of modern taste as lists, popularity rankings, annotations, relevance feeds, comments, articles and the upvoting or downvoting of data assets to determine their relevancy.
The New Way to Curate Data
Data curation was much more manageable when enterprises only had a few data sources to extract data from. Today though, with the proliferation of big data, enterprises have many more disparate data sources to extract data from, making it much more difficult to maintain a consistent method to curate data. Further complicating the problem is the fact that much of today’s data is created in an ad hoc way that can’t be anticipated by the people intended to use data for analysis.
In a modern data catalog, all of this metadata is collected along with information about the assets themselves and organized within a catalog interface that is more readily searched and browsed than the legacy systems mentioned previously.
Data curation can also be described as the process of adding value to data. A data-driven organization will naturally want to maximize the value of that data. Therefore, establishing people, processes and tools for data curation should be a part of any technical manager’s plans. This may mean establishing strict rules about which data can and should be used, as well as putting business rules or other metrics in place that apply to all data sets no matter where they physically reside.
Data curation is a necessary endeavor for any organization attempting to enable self-service analytics because it provides data consumers with a faster on-ramp to the data that they need to make intelligent business decisions that impact the enterprise.