The New World of Data Curation
As important as the role of data curator is to self-service analytics, data curation best practices are still in their infancy
In part one of this article, we took a look at the role curation has played in the evolution of our society. The evolution of communications technologies created an environment ripe for evolving the methods and features of the activity of curation. As new technology allowed for more publishers and created a higher volume of content, information curation thrived.
We are now seeing a similar transformation in the world of data, where the tension between the old world (single source of truth data warehouses with top-down data governance) and the new world (distributed, self-service analytics with grassroots management) is occurring. In organizations of all sizes, self-service reporting and analysis is becoming the norm. Where previously, people were given data in the form of a packaged report; today, people are free to discover and freely explore their own data.
Exploring the New World of Data Curation
In today’s data-driven world, struggling with high volumes of often redundant data, we are searching for our Wikipedia. Self-service analytics is a fragmented reality. There is no single source of truth. The data warehouse, which was once considered to be that source of truth, now shares the stage with data from files, streams, wikis, data dictionaries, metadata management tools, raw web content, emails, chats and many other forms of data communication.
Organizations know that to make trust-based decisions, these sources of data must be used with data knowledge. Data knowledge is an understanding of the nuances of the underlying physical data assets. It is comprised of business descriptions and explanations of how the data has been used historically. It includes an understanding of the quality of data and how applicable the data might be for different use cases.
The goal of a self-service analytics organization—to give employees a one-stop shop for data knowledge—requires that organizations can control for quality of data. Data curation is a technique that helps with documenting and ensuring the quality of data so the business can make decisions.
Finding the Path to Successful Data Curation
Getting started with data curation, however, can be a challenging endeavor due to the broad distribution of data knowledge across an organization. Pieces of data knowledge are often spread across wiki pages, data dictionaries, email, chat, social and raw web content, which the data curator needs to identify, understand and propagate. Some challenges for the data curator include:
- Documentation: the priorities of what data to document aren’t initially obvious. Data knowledge is distributed in too many places, and the data is changing too rapidly. The velocity of data growth outpaces the rate at which people can be assigned to document knowledge of the data.
- Propagation: it is hard to make data knowledge easily discoverable at the right point of time. The periodicity of use of data is not always predictable and data assets are often redundantly replicated in different formats and in multiple storage locations.
- Data Quality: it is hard to distinguish the high-quality data from resources which are inaccurate or stale- doing so often requires subject matter expertise in the business function associated to that data. So it can be nearly impossible for a non-expert to know which data source is the accurate one.
- Data Definitions: even an accurate, up-to-date data asset can be used in different ways by different teams. For instance, a product team might analyze clickstream data in two-minute intervals whereas the marketing team considers two-day sessions. Both methods are valid but result in different numbers for the same metric.
Humans and Machines Working Together
The consumer world of the Internet faced similar challenges. Initially we expected machines to do the lion’s share of the work to automate all Internet content, ensure that it was accurate, and propagate it to the masses. What we now know is that curation needs human input, especially when it comes to evaluating and labelling the quality of content. We can’t completely automate curation. Organizations on a self-service analytics path need to identify where humans must offer input and where computers can automatically document.
Here are four steps to finding your organization’s optimal balance between humans and machines in data curation:
- Browse the Data. Machines can be effectively trained to pattern match and find the most important data. Utilizing machine learning can save a data curator, and an organization, a tremendous amount of time.
- Create Context for Data Knowledge. The key to creating context is to document the data effectively and provide the most useful information possible to enable appropriate use. This is not just about documenting technical information (i.e. column, labels, tables), but actually creating context with an understanding of how people should use the information. There may be hundreds of different uses of one data source. For example, when defining what constitutes a “US state” – the Shipping Department might not include the island of Hawaii, for it is a shipping exception. But the Finance Department would include it in a list of states as a revenue source.
- Share the Data Knowledge. You also need to make the data discoverable via push methods such as emails and alert notifications, as well as just-in-time methods such as a suggestion-oriented query tool, and pull methods such as data catalogs.
- Update the Data Knowledge. Finally, you need to propagate changes to the data knowledge; that is, you need to stay on top of technical changes to the data. For example, as a data curator updates a column label, it should be automatically updated within the other tables and sources that use that same data source. This is difficult to do without technology.
As important as the role of data curator is to self-service analytics, data curation best practices are still in their infancy. Organizations are experimenting with how to integrate machines into the data curation process, yet still give data curators the appropriate amount of control.