Since we launched the company, Alation has delivered a unique way to catalog data for the enterprise. Inspired by Google’s ability to automatically catalog the Internet by observing consumer behavior, we built Alation 1.0 to catalog enterprise data by observing analyst behaviors. Our approach was contrasted with the traditional manual wiki of notes and documentation and labeled as a modern data catalog.
We envisioned and learnt from the early production customer implementations that cataloging data wasn’t enough. The important components of making a decision on how to use data and if a particular data asset was fit for your purpose, are captured in every query an analyst ran against a data set, and not just in the data or the description of it.
Queries reveal an analyst’s assumptions, calculations and methods, and cataloging those queries captures all of the tribal knowledge about data consumption than cataloging data alone. With an aggregate view of patterns in the decisions made by many analysts running queries against the same data, you could derive more depth into the intent behind the analysis and promote greater reproducibility, transparency and productivity with data.
This usage context is critical to answer data consumers’ and stewards’ questions.
For data consumers like analysts, data scientists and data engineers, this means answering:
- Where can I find data to answer my question?
- The data you need may be used by several different teams within an enterprise. It’s essential to know where that data lives and if you can access it.
- Can I trust this data?
- Information about who or what, including applications and users, are using this data, and how often and recently it is updated helps you trust your data.
- What are the data semantics in order to use it?
- Contextual information about how other applications and users have used this data paints a much clearer picture of data semantics.
- Who can answer my questions about this dataset?
- Information about popular users of data is very important, as it provides knowledge from someone else, often outside of a consumer’s immediate network.
For data stewards, this means answering:
- Where should I focus if there’s data to manage?
- Contextual information on regularly used datasets is useful in this process.
- How do I communicate semantic changes to consumers who rely on this data?
- Contextual information about who the users of a particular dataset are is very useful and enables targeted communications and completely avoids white noise in communications.
Experts who understand certain datasets often play the stewardship role of ensuring that data consumers can make accurate and effective use of data. More recently, data governance initiatives have started to assign formal stewardship responsibility.
But it is the contextual usage information of data that is critical to answer data consumers’ and stewards’ questions. Cataloging both data and queries provides insight into
- How recently the data was used
- How recently the data was updated
- What the mapping is of technical metadata to business descriptions
- We decided to address these needs for SQL engines over Hadoop in Alation 4.0. By using the same technology foundation that we have to inventory data and enrich that inventory by cataloging human interaction and behavior around data usage, parsing and analyzing database and Hive query logs, we’ll extend our capabilities to parse SQL syntax more deeply by connecting directly to the most popular database and Hadoop compute engines in the market. We call this extended capability, Alation Connect.
Alation Connect synchronizes metadata, sample data, and query logs into the Alation Data Catalog. Alation’s learning engine then automatically learns the usage context from all of the queries. This rich usage context is what makes our Data Catalog a powerful point of reference for data consumers and data stewards. It is also used across Alation’s applications, such as our SQL query writing interface, Compose, which produces SmartSuggestions. SmartSuggestions are SQL snippets recommended inline with the query writing experience, which makes it easy for query writers to find and use data more efficiently and accurately.
Alation Connect previously synced metadata and query logs from data storage systems including the Hive Metastore on Hadoop and databases from Teradata, IBM, Oracle, SqlServer, Redshift, Vertica, SAP Hana and Greenplum. We also recently announced support for the Teradata Unified Data Architecture, including QueryGrid and Teradata Aster Analytics.
In the release of Alation 4.0, we add connectivity for popular SQL query processing engines over Hadoop, which includes new connectivity to SparkSQL and Presto as well as Impala, and Hive. All connections allow for Alation Data Catalog to automatically inventory & catalog queries and these engines may be hosted and operated on-premise or in the cloud.
To expand our support for cloud platforms, in addition to our existing support for Altiscale’s Hadoop as a Service offering, we are also delivering in Alation 4.0 support for IBM Watson DataWorks.
Further, Alation Compose now benefits from the usage context derived from the query catalogs over Hadoop. Data analysts can now leverage Alation Compose to write queries over their preferred SQL engine over Hadoop and SQL queries written in Alation Compose leverage the unique optimizations for the chosen engine.
We’re excited to get this new release into the hands of our customers before the end of this calendar year.