In prior blog posts challenges beyond the 3V’s and understanding data, I discussed some issues which hindered the efficiency of data analysts besides drastically raising the bar on their motivation to begin working with new data. Here, I want to drill into a few more experiences around use and management of data.
Which upstream data semantics changed?
Most engineers who maintained production data pipelines would have had to deal with debugging unexpected changes in data. Once a problem is noticed, it is often exasperating to actually diagnose the root cause. Is it because of some change in my team’s code producing the data, or is it due to changes in any of the upstream datasets? Sometimes, analyzing a sample of the result would yield some insight. Some other times, it can take longer ordeals.
Like many googlers, I owned and maintained production pipelines for data used by my team, and dependent teams. One day, a key metric that we closely monitored almost doubled. That started an intense fire drill, which lasted weeks. We analyzed all potential causes, such as our own code changes, and potentially relevant ads system changes which could have disturbed the distribution of our type of ads. At the end of the drill, we finally found out there was a semantic change in an upstream dataset that caused our metric to deviate. So, why did we not investigate changes in that dataset first and why did we do it so late?
- First, we were not a direct consumer of the upstream data which changed. We were not on the high-traffic mailing lists where semantics of change were communicated.
- Second, we knew that our colleagues in the adwords team relied on the same dataset. Our hypothesis was that if the change affected us then they should also affect similar metrics in other Ads teams. It turned out that our colleagues in the AdWords were directly communicated ahead of time to be prepared for the upcoming change, and our small team was left in the dark. Whew!
Imagine if such a semantic change didn’t cause a huge deviation in one shot but just a few points each day. We probably would not have even noticed the semantic change and continued operating normally. Similar issues often impact analysts and data scientists even if they are just consuming data and not producing it.
If only there were automated and targeted ways to communicate such changes to all downstream consumers of data, just in time when they are investigating. Alternatively, if there were a way to easily trace all relevant upstream data changes. Both of these mechanisms would, I think, help a lot of data consumers, in Google and elsewhere, debug a broad class of data issues more efficiently.
How do I set Data Retention Policies?
Most data and IT engineers who maintain databases or data pipelines which produce data periodically, always have a question of how much data do I retain? There is obviously an important trade-off in most cases. Retaining more data will potentially be useful to enable consumers analyze and sometimes debug subtle bugs in data pipelines. On the other hand, we pay in terms of storage and maintenance.
It would obviously be better to make an informed choice on the period of data retention. I have seen at many places, including Google, the time period of data retention is often determined by a) just the resource cost, or b) ad-hoc choices. It would be useful to understand how analysts are using the data to make a more informed decision. Is most analysis focused on the last 10 months of data? If so, we could decide to make this retention be around 10 months. Today, where can we find such usage information? I couldn’t find such information within any data environments that I worked until now.
In summary, there are many issues beyond the 3V’s (Volume, Velocity, Variety) that significantly hinder analysts. With the above posts, I tried to illustrate these issues based on my experiences as a data analyst, and a data engineer. Besides delaying analysts significantly, these issues can really deter analysts from trying to even get started on answering important questions because they know working with new data will involve a steep learning curve. I think the most important benefit the reduction of friction to find, understand, and use data has is that of encouraging analysts to take on using all of the data their company owns.