A data catalog provides the capability to find, understand, and trust the data in your data lake; resulting in greater data discoverability and quicker insight
A data lake solves one of the basic problems of data management, providing a single source of data that can be queried and managed. But most implementations of a data lake stop there, they lack the ability for users to manipulate the data in the ways that are necessary to derive insight from that data.
What is a Data Lake?
Instead of storing data in multiple silos, a data lake is a single repository. Most data lakes are built using Hadoop, an open source framework used for distributed storage and processing of big data. Hadoop works across commodity clusters in which each node of the cluster includes its own storage. Due to its scalable distributed storage layer, general purpose parallel processing layer, and ability to be deployed on commodity hardware, Hadoop in many situations is less expensive than competing solutions.
A data lake allows data to be stored in its natural format and allows it to be transformed into whatever format is required for the purpose of reporting, visualization, analytics and machine learning. The type of data stored within a data lake can include structured data from relational databases (rows and columns), semi-structured data (CDV, logs, XML, JSON), unstructured data (emails, documents PDFs) and binary data (images, audio, and video). A key definition of a data lake is any large repository of data in which the schema and data requirements are not defined upfront but rather at the time the data is queried. Hadoop Distributed File System (HDFS) is used to manage the various files in the data lake.
A data lake may rely on one or more different approaches to discovering data. The first version of Hadoop did batch processing with MapReduce, which required that users have the technical expertise required to write queries and manage the resources stored in the data lake. MapReduce works in a linear fashion, performing each read/write operation before moving on to the next one. Spark is more performant, because it performs all operations in memory, instead of having to fetch data from spinning disk drives. Spark can act upon multiple operations simultaneously (data parallelism) and provides a level of fault-tolerance missing from MapReduce.
Apache Hive is an open-source software project originally developed by Facebook and now used by companies such as Netflix, Financial Industry Regulatory Authority (FINRA) and Amazon. Apache Hive is built on top of Apache Hadoop and provides a SQL-style query language (HiveQL) in order to query the data that resides in distributed databases and file systems that integrate with Hadoop. With HiveQL it is no longer necessary for analysts to perform queries using the low-level Java API. By implementing HiveQL within your data lake, you can leverage knowledge of SQL as well as bring SQL applications over to Hadoop. Likewise, Presto is an open source distributed SQL query engine designed for running interactive analytic queries against data sources of any size from gigabytes to petabytes.
Governing a Data Lake on Hadoop
Over time, improvements to Hadoop have made data lakes more responsive. By itself though, a data lake does little to enable useful data within your organization. In fact, many companies are finding that it makes sense to consider a technical infrastructure where a data lake is just one system among many. This might include business analytics or data mining tools. In particular, adding a data catalog (a means of crawling and indexing data assets stored across different physical repositories) makes it easier to find, understand and work with the data you have. If you want to organize your data lake, consider a data catalog. A data catalog simplifies data access and makes it possible for data collaboration to occur. Not only can you trace the lineage behind a particular BI report or visualization, but you can also participate in an ongoing dialog around the value of individual data sets, as well as perform natural language search across all your data assets.
Data lakes are becoming more and more central to an enterprise data strategy and can help address the challenges of today’s big data landscape. In order to optimize the benefits of a data lake, a data catalog provides the capability to find, understand, and trust the data in your data lake; resulting in greater data discoverability and quicker insight.