What Is Data Curation? Why Is It Important?
By Dave Wells
Published on February 13, 2020
This blog was last updated in September 2023
Data curation is a term that has recently become a common part of data management vocabulary. Data curation is important in today’s world of data sharing and self-service analytics, but I think it is a frequently misused term. When speaking and consulting, I often hear people refer to data in their data lakes and data warehouses as curated data, believing that it is curated because it is stored as shareable data. Curating data involves much more than storing data in a shared database.
What Is Curation?
Let’s set data aside for a moment and consider the meaning and the activities of curating. The traditional use of the word is associated with collections of artifacts in a museum and works of art in a gallery. More recently we’ve started to use the term to describe managed collections of many kinds such as curated content at a website, curated music and videos available through streaming services, and curated apps through download services. Wired.com has described Apple’s App Store as “curated computing.”
Curation is the work of organizing and managing a collection of things to meet the needs and interests of a specific group of people. Collecting things is only the beginning. Organizing and managing are the critical elements of curation—making things easy to find, understand, and access.
What Is Data Curation?
Data curation, then, is the process of organizing of organizing and managing a collection of datasets to meet the needs and interests of a specific groups of people. If curated describes collections of things that are selected and managed to meet the needs of a specific group, then curated data is a collection of datasets that is selected and managed to meet the needs and interests of a specific group of people. Note that the focus here is datasets – files, tables, etc. – that can be accessed and analyzed. The distinction between “collections of data” and “collections of datasets” is subtle but significant. Collecting datasets is only the beginning. That is what we do when we store data in data warehouses or data lakes. But organizing and managing are the essence of data curation. Making datasets easy to find, understand, and access is the purpose of data curation—a purpose that demands well-described datasets. Data curation is a metadata management activity and data catalogs are essential data curation technology. Data catalogs are rapidly becoming the new “gold standard” for metadata management, making metadata accessible and informative for non-technical data consumers.
If curated describes collections of things that are selected and managed to meet the needs of a specific group, then curated data is a collection of datasets that is selected and managed to meet the needs and interests of a specific group of people. Note that the focus here is datasets – files, tables, etc. – that can be accessed and analyzed. The distinction between “collections of data” and “collections of datasets” is subtle but significant.
Why is Data Curation Important?
Data curation involves managing, labeling, and organizing data to ensure its quality, accessibility, and usability. Organizations deal with a continuous influx of internal and external data — from traditional business applications or cutting-edge IoT devices. It also arrives in a mix of structured, unstructured, and semi-structured formats.
Without proper data curation, organizations risk drowning in data, making it difficult to track datasets and impeding users' access to critical information. This leads to wasted time and resources spent on data searches, compromised analytics, erroneous decision-making, missed opportunities, and overall suboptimal business performance. Data curation unites disparate data sources to make them accessible and usable, which safeguards against the pitfalls of data overload and ensures that data remains a valuable asset rather than a potential liability.
What are the Main Steps of Data Curation?
While the specific steps in data curation can vary, depending on the organization and its data needs, there are several common key steps that form the foundation of this practice:
Gather data from various sources, ensuring it aligns with organizational goals and standards.
Process and load collected data into a central repository or data warehouse.
Data Quality Assessment
Evaluate data for accuracy, completeness, consistency, and reliability; clean and transform it as needed.
Develop comprehensive metadata to provide context and understanding of the data.
Create an organized catalog or inventory of available datasets for easy discovery and access.
Data Access Control
Implement security measures to control and restrict data access based on roles and permissions.
Develop documentation, including data dictionaries and transformation logic, to aid users in understanding and using the data effectively.
Establish policies and practices to ensure data compliance with regulations and alignment with organizational goals.
Data Maintenance and Updates
Regularly maintain, update, and refresh data to keep it accurate and relevant.
These steps are the primary steps in the data curation process. They help organizations manage their data assets effectively, ensure data quality, and make informed decisions based on trusted data.
Data Curation vs. Data Management
Data curation and data management are related but distinct processes when it comes to handling data:
Focus: Ensures data quality and usability with activities that improve data’s value and usefulness.
Activities: Includes data cleaning, data transformation, metadata creation, and documentation to make data more understandable and accessible.
Goal: The primary goal is to prepare data for analysis, decision-making, and broader use by improving its quality, context, and relevance.
Scope: Data curation often applies to specific datasets or collections within an organization, emphasizing thorough management of selected data assets.
Role: Data curators maintain and enhance the quality of data, ensuring it aligns with organizational objectives.
Focus: As a broader discipline, data management encompasses the entire lifecycle of data, from creation and storage to retrieval and disposal. It deals with data as a strategic asset.
Goal: The primary goal is to establish a comprehensive framework and processes for handling data efficiently and effectively across an organization.
Scope: Addresses all data-related aspects within an organization, covering data governance policies, data infrastructure, and data strategy.
Role: Data managers oversee the strategic aspects of data within an organization, ensuring that data is used strategically to support business objectives.
In short, data curation enhances specific dataset quality and usability, while data management encompasses all data-related activities and assets in an organization, taking a broader strategic perspective. Both are crucial for organizations to extract value from their data while ensuring data integrity and compliance.
Who Are the Data Curators? What Do They Do?
A typical organization has many people doing data curation work with varying degrees of responsibility and corresponding time commitment. Everyone who works with data has the opportunity to curate by sharing their knowledge and experiences. Crowdsourcing of tribal knowledge is an important part of curation practice. Collaborative data management is a necessity in the self-service world and knowledge sharing is the first step in creating collaborative culture. Curation collaborators will be large in number with a modest level of responsibility and time commitment.
Domain curators have subject expertise in specific data domains such as customer, product, finance, etc. Domain curators record and share data domain knowledge that helps data analysts to understand the nature of data that they work with. The number of domain curators is substantially smaller than the number of collaborative curators, with greater level of responsibility and time commitment.
Most organizations will have one or very few lead curators who are responsible for moderating data catalog content much as wiki moderators manage content. Lead curators have a high level of responsibility for metadata and catalog quality – responsibilities that require substantial time commitment.
What about Data Stewards?
I frequently am asked about the differences between data curators and data stewards: Are they two names for the same role? Can data stewards be your data curators? Why do we need both stewards and curators? These are good questions that are important when considering how to fit data curation into your organization. It is practical for the same individual to have both curation and stewardship responsibilities, especially at the level of domain curators. It is important, however, to recognize curation and stew
The roles of data steward and data curator are related and somewhat overlapping. Stewards and curators working together is a combination that maximizes the value of data across all use cases from enterprise reporting to analytics and data science. Stewardship and curation are both metadata management activities and data governance roles. Data curation and data cataloging are important elements of modern data governance. They are complementary disciplines that are both essential in the age of self-service analytics.
Data Curation vs Data Governance: A Comparison
Data stewards and data curators play pivotal roles in effective data management, but their work is also closely connected to data governance. Data curation and data governance work in tandem to improve overall data management.
Data Curation Supports Understanding
Data curation involves carefully improving data quality and making it useful for decision-makers. Data curators are like data caretakers, working to clean, enrich, and organize data for better use, saving time for those who need data-driven insights.
Data Governance Defines Data Management Structure
On the other hand, data governance sets the overall guidelines and policies on how data is managed, protected, and leveraged. It's the framework that ensures data is handled in compliance with regulations and aligns with business objectives. Data governance defines roles, responsibilities, and standards for data management, including those of data stewards and curators. Collaboration ensures data is curated in line with organizational rules, boosting data quality, security, and compliance.
When data curation and data governance work well together, they create a strong system for managing data. This system enhances data reliability and accessibility while ensuring that data remains compliant with legal and regulatory requirements. This synergy between curation and governance propels organizations toward better-informed decision-making and improved business outcomes.
Challenges in Data Curation
Data curation is valuable but not without challenges that can hinder seamless data management and use. Here are five common challenges and strategies to overcome them:
1. Data Quality Assurance
Challenge: Maintaining data accuracy and quality is essential but can be demanding, especially with data from various sources.
Solution: Enforce data quality standards, implement data profiling tools, and regularly audit data for inconsistencies and errors.
2. Data Security and Privacy
Challenge: Protecting sensitive data and ensuring compliance with data privacy regulations is crucial and complex.
Solution: Develop robust data governance policies, and implement access controls, encryption, and monitoring tools to safeguard data.
3. Data Volume and Variety
Challenge: Handling vast volumes of diverse data types, from structured to unstructured data, poses challenges in categorization and organization.
Solution: Implement data cataloging tools and automated tagging systems, and prioritize data based on its relevance and value.
4. Data Accessibility
Challenge: Balancing easy access for authorized users with security requirements can be tricky.
Solution: Establish a centralized data repository with role-based access controls and user-friendly data discovery interfaces.
5. Metadata Management
Challenge: Managing comprehensive metadata for all curated datasets can become overwhelming.
Solution: Employ metadata management tools to automate metadata capture and updates, ensuring consistent adherence to metadata standards.
Addressing these five core challenges strategically will significantly enhance an organization's data curation efforts and maximize the value derived from curated data assets.
Best Practices in Data Curation
Data curation best practices are crucial for maintaining high-quality data assets. These practices involve defining clear objectives, assessing data quality, establishing metadata systems, ensuring security and compliance, promoting collaboration, and adapting to evolving technology.
If you want to know more about how people and machines work together in data curation, read the blog "New Age of Data Curation: Challenges, Best Practices, and Solutions." It's a valuable guide for organizations seeking to optimize their data management efforts.
Real-Life Examples of Data Curation & Its Use Cases
Data curation plays a vital role in improving data management practices and is widely applicable across various industries. Here are some real-life scenarios showcasing its significance:
Scientific Research Advancements
Data curation is vital in scientific research, ensuring data preservation, management, and access. In fields like genomics, climate studies, and particle physics, researchers use curated data repositories for collaboration, faster discoveries, and scientific innovation.
Enhanced Healthcare Services
In healthcare, data curation is important for managing patient information, including medical histories, diagnoses, treatments, and outcomes. By meticulously curating this sensitive data, healthcare providers ensure its accuracy, security, and accessibility. This not only helps patients receive better care, it helps healthcare professionals to make well-informed decisions, ultimately improving quality of care and even saving lives.
Robust Financial Integrity
In finance, data curation plays a critical role in financial instruments like transactions, investments, and loans. Curation activities ensure financial data is secure, managed well, and auditable. That minimizes the risk of fraudulent activities, and financial markets benefit from greater transparency and reliability.
Preservation of Government Records
Within the public sector, data curation is instrumental in preserving essential government records. This includes census data, legal documents, and historical records. By curating and maintaining these records meticulously, governments ensure their availability and usability for future generations, contributing to historical continuity and informed decision-making.
These real examples show how useful data curation is in every industry. Well-curated data is a catalyst for progress, innovation, and responsible data management.
What’s Next? Future Trends in Data Curation
As data curation continues to evolve, it's crucial to keep up with the emerging trends and technologies that are shaping its future. Artificial Intelligence (AI), automation, blockchain, and advanced metadata management tools are at the forefront, revolutionizing data curation practices. To gain a deeper understanding of this evolving landscape, read the blog post, Where Do Data Catalogs Fit in Metadata Management?
- What Is Curation?
- What Is Data Curation?
- Why is Data Curation Important?
- What are the Main Steps of Data Curation?
- Data Curation vs. Data Management
- Who Are the Data Curators? What Do They Do?
- What about Data Stewards?
- Data Curation vs Data Governance: A Comparison
- Challenges in Data Curation
- Best Practices in Data Curation
- Real-Life Examples of Data Curation & Its Use Cases
- What’s Next? Future Trends in Data Curation