By Bill Franks
Published on July 28, 2021
As data science processes continue to become operationalized and embedded within business processes, the importance of governing those processes continues to rise. While governance has been a major focus for many years when it comes to managing data, governance focused on data science processes is still far less mature. That needs to change. This blog will discuss a couple of distinct areas of governance that organizations should consider.
When defining governance procedures and guidelines, it is necessary to account for ethical considerations up front. The reason is that once governance policies are put in place, they will incentivize and disincentivize various behaviors. Without accounting for the ethics of those behaviors, there is a risk of creating a terrifically managed and tightly governed process that does horribly unethical things.
Imagine that a company creates a process 1) using well-governed data on people’s behavior that has been 2) prepared with a well-defined and consistent set of computations to 3) generate summary metrics to feed into a model. Furthermore, the company monitors the performance, bias, and consistency of the model while also tightly controlling who has access to the output and what it is used for. Sounds like a very well governed process, doesn’t it? Now imagine that the model produced used that social media data to predict who is likely to commit a crime so that law enforcement can intervene as in the movie Minority Report.
Such a process may be well-governed, but it is horribly unethical. That is why I said that you can’t separate ethics from governance. To be truly effective, governance must be ethically sound as well as technically rigorous.
It is often necessary to perform audits to prove that a data science process is working appropriately. A common concern is that in order to provide a complete audit, it is necessary to reveal the “secret sauce” behind the process. While this concern is especially common if a 3rd party will be performing the audit, it does not have to be the case.
Consider beverage giant Coca-Cola. Only a couple of people in the entire world know the full recipe for a bottle of Coke, and none of those people have a regulatory oversight role. Yet, people are still comfortable that Coke products are safe to enjoy. Why is that? First, while the exact mix of ingredients in the recipe may not be known, they are all standard food products. So, both the company and oversight agencies can confirm that any given ingredient going into a Coke is safe and approved. Similarly, the final product can be checked for toxins, chemical composition, etc. to ensure that the ingredients were not somehow mixed in a way that caused unforeseen problems. In other words, it is possible to audit that a Coke is safe to drink without having to know the secret formula.
The same is true with machine learning and artificial intelligence. To validate that a process accurately predicts what it is attempting to predict, is free from bias in those predictions, and that the predictions are stable over time, it is not necessary to unveil the exact formulation of the underlying model. By passing a wide range of data to the model, we can demonstrate accuracy, consistency, and bias level while still maintaining the confidentiality of the secret sauce behind the model. It is possible to have algorithms that provide a competitive advantage, while providing strong governance and auditing of the process, without revealing the core IP that has been developed. Therefore, there is no reason to argue against auditing. I’m actually a fan of the idea of having 3rd party auditors involved much like is done in the accounting space. We may soon see a company rise to prominence by providing such services.
One thing those of us in the data science field are often guilty of is trying to build things ourselves, even if there might be something close to what we need already available. Rather than tweaking the existing approach to our needs, we start from scratch. The urge to do this should be resisted!
When it comes to governance as it relates to safety, quality, and audits, there are highly mature approaches in other disciplines that can be borrowed. Traditional product development and engineering teams have strong protocols that have been developed over many decades. While it is certainly true that engineering protocols for safety assurance will not translate directly to data science processes, it is also true that tweaking an engineering approach to fit within a data science context is probably a faster path to progress than developing and testing protocols from scratch.
One terrific example of a set of protocols that data science teams have adapted successfully is in the area of agile software development. While the agile protocols originally developed for software developers do not translate exactly to a data science context, many require little change. Data science teams now follow agile analytics protocols that take full advantage of the principles originally produced to support agile software development. Sure, there are some differences and additions, but the data science community is certainly better off for borrowing from a proven approach in a related discipline than if we tried to start a new grassroots approach on our own.
Governance is not nearly as interesting and engaging as creating awesome data science processes, but it is necessary. Do not assume the pain we face in tackling data science governance has to be long and painful with a lot of totally new protocols needing to be developed. The data science community can borrow and adapt much of what has been done by others in the areas of data governance, quality control, safety, and auditability. By resisting our urges to create bespoke approaches from scratch, we will not only accelerate our efforts, but we will avoid learning the same hard lessons that others learned as they built the governance processes we are borrowing from.
Originally published by the International Institute for Analytics