The Data Scientist’s Guide to the Data Catalog
By Joe Hilleary
Published on July 19, 2022
These days, data scientists are in high demand. A shortage of talent, especially at legacy firms, has caused them to struggle in competition with new, digital disrupters. Across the country, data scientists have an unemployment rate of 2% and command an average salary of nearly $100,000. At the same time, organizations have struggled to realize all of the promised benefits of these expensive experts.
As they attempt to put machine learning models into production, data science teams encounter many of the same hurdles that plagued data analytics teams in years past:
Finding trusted, valuable data is time-consuming
Obstacles, such as user roles, permissions, and approval request prevent speedy data access.
Understanding data takes time, and demands answers to questions like:
Why was this data set created in the first place? Who made it?
Is this data trustworthy? How do I know it can be trusted?
How can I use this data? Who has used it in the past? What is their advice?
For these reasons, finding and evaluating data is often time-consuming. The result? Instead of spending most of their time leveraging their unique skillsets and algorithmic knowledge, data scientists are stuck sorting through data sets, trying to determine what’s trustworthy and how best to use that data for their own goals.
Fortunately, just as data catalogs help solve the problems of discovery and exploration for data analysts, they can aid data science teams.
The Data Science Workflow
The traditional data science workflow, as defined by Joe Blitzstein and Hanspeter Pfister of Harvard University, contains 5 key steps:
Ask a question
Get the data
Explore the data
Model the data
Communicate and visualize the results
A data catalog can assist directly with every step, but model development. And even then, information from the data catalog can be transferred to a model connector, allowing data scientists to benefit from curated metadata within those platforms.
How Data Catalogs Help Data Scientists Ask Better Questions
A data scientist begins by asking a big question. This question captures what the data scientist wants to model or predict. It requires them to frame their project and articulate their goals. Corporate data scientists rely on the data their organizations possess, so their questions are limited by the data that’s available to them. That context is important: It’s no good asking a question that they don’t have the means to answer.
A data catalog provides a holistic, navigable view of an organization’s data. As a result, a data scientist can quickly browse through curated data to determine which questions they can answer, and which questions would require the acquisition of additional data.
For data scientists, a data catalog can also serve as a source of inspiration. Because a catalog allows domain experts to comment on and describe the data assets they know best, those experts can suggest potential questions a given data set might be able to answer. In this way, a data scientist benefits from business knowledge that they might not otherwise have access to. The catalog facilitates the synergy of the domain experts’ subject matter expertise with the data scientists statistical and coding expertise.
Finally, a data catalog can help data scientists find answers to their questions (and avoid re-asking questions that have already been answered). Modern data catalogs surface a wide range of data asset types. For instance, Alation can return wiki-like articles, conversations, and business intelligence objects, in addition to traditional tables.
For example, a new data scientist who is curious about which customers are most likely to be repeat buyers, might search for customer data only to discover an article documenting a previous project that answered their exact question. Documenting data science projects this way in a data catalog saves time and prevents data scientists from wasting resources on duplicate projects. It also empowers newcomers to onboard more quickly, with curated expertise to aid rapid understanding.
Get the Data
Increasingly, data catalogs not only provide the location of data assets, but also the means to retrieve them. A data scientist can use the catalog to identify several assets that might be relevant to their question and then use a built-in SQL query editor, like Compose, to access the data.Query editors embedded directly into data catalogs have a few advantages for data scientists. They leverage the metadata the catalog stores about the assets being queried, allowing them to suggest column titles, surface warnings or endorsements for particular data assets, or even pull up the relevant data usage policies. This creates a second layer of governance to ensure the data scientist is using the right data in ways that are permitted.
Explore the Data
Though most data scientists will ultimately want to plot the data directly in a Python or R notebook to play around with it, data catalogs give them a jump start on the exploration phase.
Data catalogs provide much more than just business knowledge about data assets, compiling a range of information, including statistical summaries of data assets. Rather than having to export the data to determine attributes, such as the cardinality, mean, number of nulls, and so on, a data scientist can see all of that information right on the assets profile. So instead of wasting their time recalculating that information (using pandas, numPy, or other statistics packages), the data scientist can dive right into the interesting parts of modeling their data more quickly.
Modern data catalogs also facilitate data quality checks. Historically restricted to the purview of data engineers, data quality information is essential for all user groups to see. Alation’s Open Data Quality Initiative is a great example of a data catalog partnering with purpose-built data quality tools to surface data quality metrics for all users, freeing engineers from playing data middlemen and gatekeepers.
Communicate and Visualize Results
Finally, data catalogs can help data scientists promulgate the results of their projects. As referenced above, modern data catalogs often support an asset type, like an article or a project page, which allows data scientists to capture their work as its own discoverable entity. The data scientist can write text, copy and paste code, and embed visualizations into these assets, creating a living document others can reference in the future.
Cataloging data science projects in this way is critical to helping them generate value for the company. Instead of findings disappearing into thin air after a presentation, the results and the methodology will be available and discoverable indefinitely – an invaluable asset to the corpus of knowledge about data in an organization.
Data scientists often have different requirements for a data catalog than data analysts. Although there is significant overlap between their workflows, data scientists often rely on raw data stored in a data lake rather than the highly structured data in a data warehouse, which is the realm of the analyst. Thus, it’s important to select a data catalog that can meet the technical needs of both groups.
A modern data catalog that extracts metadata from a wide range of sources and support many different data and asset types allows a single tool to serve both data scientists & analysts, along with a range of other users. In fact, by bringing together and documenting the expertise of all groups in a shared platform, a data catalog can help data analysts, engineers, and scientists learn from one another on a collaborative platform.
See how data scientists and analysts collaborate in the data catalog to deliver customer-centric marketing, like the team at Albertsons did, building a personalized digital flyer in days. Read the case study.
- The Data Science Workflow
- How Data Catalogs Help Data Scientists Ask Better Questions
- Get the Data
- Explore the Data
- Communicate and Visualize Results
- Closing Thoughts