It is a little like a data version of the Cambrian Explosion, where data-centricity is giving rise to a rich variety of practices, each distinct and unique in its own way. Some of these practices will succeed and develop, while others will no doubt become blind alleys inevitably forgotten.
Even so, it is already possible to discern some practices that will become new disciplines. And among these, we find data acquisition.
Data Acquisition Defined
What is data acquisition? We define it as this:
Data acquisition is the processes for bringing data that has been created by a source outside the organization, into the organization, for production use.
Prior to the Big Data revolution, companies were inward-looking in terms of data. During this time, data-centric environments like data warehouses dealt only with data created within the enterprise. But with the advent of data science and predictive analytics, many organizations have come to the realization that enterprise data must be fused with external data to enable and scale a digital business transformation.
This means that processes for identifying, sourcing, understanding, assessing and ingesting such data must be developed.
This brings us to two points of terminological confusion. First, “data acquisition” is sometimes used to refer to data that the organization produces, rather than (or as well as) data that comes from outside the organization. This is a fallacy, because the data the organization produces is already acquired.
Second, the term “ingestion” is often used in place of “data acquisition.” Ingestion is merely the process of copying data from outside an environment to inside an environment and is very much narrower in scope than data acquisition. It seems to be a term that is more commonplace, because there are mature ingestion tools in the marketplace. (These are extremely useful, but ingestion is not data acquisition.)
The Data Acquisition Process
What is exciting about data acquisition to data professionals is the richness of its process?
Consider a basic set of tasks that constitute a data acquisition process:
- A need for data is identified, perhaps with use cases
- Prospecting for the required data is carried out
- Data sources are disqualified, leaving a set of qualified sources
- Vendors providing the sources are contacted and legal agreements entered into for evaluation
- Sample data sets are provided for evaluation
- Semantic analysis of the data sets is undertaken, so they are adequately understood
- The data sets are evaluated against originally established use cases
- Legal, privacy and compliance issues are understood, particularly with respect to permitted use of data
- Vendor negotiations occur to purchase the data
- Implementation specifications are drawn up, usually involving Data Operations who will be responsible for production processes
- Source onboarding occurs, such that ingestion is technically accomplished
- Production ingest is undertaken
There are several things that stand out about this list. The first is that it consists of a relatively large number of tasks. The second is that it may easily be inferred that many different groups are going to be involved, e.g., Analytics or Data Science will likely come up with the need and use cases, whereas Data Governance, and perhaps the Office of General Counsel, will have to give an opinion on legal, privacy and compliance requirements.
An even more important feature of data acquisition is that the end-to-end process sketched out above is only one of a number of possible variations. Other approaches to data acquisition may involve using “open” data sources or configuring tools to scan internet sources, or hiring a company to aggregate the required data. Each of these variations will amount to a different end-to-end process.
The Need For Metadata Tools
Given the characteristics of data acquisition, how should it be handled?
A fairly obvious conclusion is that because it consists of so many tasks and involves so many different organizational units, some form of tooling is required. The variety of metadata that is produced by the overall process — and the need to utilize it both within the process and after acquisition has been completed — makes it difficult to see how spreadsheets, emails and other end-user computing solutions will work.
Remember also that legal, privacy and compliance constraints will be discovered and evaluated in data acquisition. These need to be made available to the enterprise as a whole to prevent accidental misuse of the acquired data.
What we need are tools capable of storing the wide range of metadata that is produced during data acquisition, and a defined data governance process that ensures the process is followed in a standard way and metadata is captured appropriately. Such tools are beginning to appear in the marketplace, and data professionals engaged in data acquisition would do well to implement their processes in such tools.
Modernizing Data Governance With Data Catalogs
Organizations embarking on a data journey to leverage the business value of data across the information supply chain will need to navigate the unique challenges of self-service analytics. And the criticality of metadata management and data catalogs cannot be undermined.
We’re proud to partner with Alation to deliver new methodologies, including one focused on data acquisition, to power a more modern and agile approach to governance. Learn more about this strategic partnership and our commitment to meet the needs of Chief Data Officers and analytics leaders seeking to bring more trust to data-driven decision-making.
This was first posted on First San Francisco Partners blog. Access the original article here.