Using Data to Fight for Human Rights

With Tarak Shah, Data Scientist at the Human Rights Data Analysis Group

Tarak Shah, Data Scientist at HRDAG Tarak Shah
Data Scientist at HRDAG

Tarak Shah is a Data Scientist at the Human Rights Data Analysis Group (HRDAG), a RAFTO award-winning non-profit organization. His work has contributed to multiple systems estimation, a groundbreaking family of techniques for statistical inference; this tool is used to estimate human rights violations despite a lack of data.

Satyen-Sangai, CEO of Alation Satyen Sangani
Co-founder & CEO of Alation

As the Co-founder and CEO of Alation, Satyen lives his passion of empowering a curious and rational world by fundamentally improving the way data consumers, creators, and stewards find, understand, and trust data. Industry insiders call him a visionary entrepreneur. Those who meet him call him warm and down-to-earth. His kids call him “Dad.”

Tarak Shah, Data Scientist at HRDAG
Tarak Shah
Data Scientist, HRDAG

Tarak Shah is a Data Scientist at the Human Rights Data Analysis Group (HRDAG), a RAFTO award-winning non-profit organization. His work has contributed to multiple systems estimation, a groundbreaking family of techniques for statistical inference; this tool is used to estimate human rights violations despite a lack of data.


Satyen Sangani
Co-founder & CEO of Alation

As the Co-founder and CEO of Alation, Satyen lives his passion of empowering a curious and rational world by fundamentally improving the way data consumers, creators, and stewards find, understand, and trust data. Industry insiders call him a visionary entrepreneur. Those who meet him call him warm and down-to-earth. His kids call him “Dad.”

Tarak Shah, Data Scientist at HRDAG
Tarak Shah
Data Scientist, HRDAG

Tarak Shah is a Data Scientist at the Human Rights Data Analysis Group (HRDAG), a RAFTO award-winning non-profit organization. His work has contributed to multiple systems estimation, a groundbreaking family of techniques for statistical inference; this tool is used to estimate human rights violations despite a lack of data.

Satyen Sangani
Co-founder & CEO of Alation

As the Co-founder and CEO of Alation, Satyen lives his passion of empowering a curious and rational world by fundamentally improving the way data consumers, creators, and stewards find, understand, and trust data. Industry insiders call him a visionary entrepreneur. Those who meet him call him warm and down-to-earth. His kids call him “Dad.”

Satyen Sangani: (00:03)
These days, we take data for granted. After all, there are more data points on planet Earth than there are stars in the sky. On this show, we spend a lot of time thinking about how to harness this wealth effectively. Many data professionals struggle with an overabundance of information, and if you ask me, that’s a great problem to have.

Today, we’re going to talk to a data scientist with the opposite problem: An incredible lack of data. Our guest is Tarak Shah, a data scientist at the Human Rights Data Analysis Group, also known as HRDAG. HRDAG is a nonprofit, nonpartisan organization that applies rigorous science to the analysis of human rights violations around the world. Their work takes them from war-torn conflict zones to the inner cities of the United States.

Satyen Sangani: (00:53)
But no matter their focus, they’re often dealing with incomplete datasets, looking for information that no one is collecting. Over their 30+ years of existence, the group has developed fascinating methodologies to address these issues. Today, we’re going to discuss this important work, and how they do it. With Tarak as our guide, we’re exploring how data experts uncover human rights violations and how this field has evolved. It’s a super thought-provoking conversation, filled with valuable lessons for data radicals.

Producer: (01:32)
Welcome to Data Radicals, a show about the people who use data to see things that nobody else can. This episode features an interview with Tarak Shah, Data Scientist at HRDAG. In this episode, he and Satyen discuss HRDAG’s work in Guatemala, police brutality, and the origins of HRDAG’s pioneering methods. This podcast is brought to you by Alation. Our platform makes data easy to find, understand, use, and govern, so analysts are confident they’re using the best data to build reports the C-suite can trust. The best part? Data governance is woven into the interface, so it becomes part of the way you work with data. Learn more about Alation at A-L-A-T-I-O-N.com.

 


 

How HRDAG Helped Convict a War Criminal

Satyen Sangani: (02:14)
One of HRDAG’s earliest projects was in Guatemala. The country was recovering from a 36-year civil war that ended in 1996. In its aftermath, HRDAG helped to collect records of human rights violations. Their work paved the way for the conviction of Guatemala’s former president, General Efraín Ríos Montt, who was found guilty of genocide and war crimes. It was the first time a former head of state was found guilty of war crimes in their own country.

Tarak Shah: (02:46)
My colleague, Dr. Patrick Ball, was in Guatemala, working with human rights groups at the time. Guatemala had had an armed civil conflict that had lasted 36 years. So around the end of that time, he was working with groups who were trying to document and preserve evidence of human rights violations that had occurred during the conflict. So Patrick and his team were doing a few different things, and a couple of them, I think were important, both for our work and for the way human rights data is collected in general.

Tarak Shah: (03:18)
In particular, they were interviewing survivors, which is a very common way of doing human rights work, so people describe events that they witnessed or experienced, and annotators will read through those testimonies and extract incidents that are grave human rights violations, so things like killings, tortures, disappearances, and so forth.

The things that I mentioned that were somewhat novel about that work, one in particular was that at that time, they were relying on a number of different data sources, rather than just using one as kind of the official data source. So in addition to these interviews that they were conducting, they were collecting similar data from other human rights organizations in the area. They were also reading through newspaper articles from throughout the conflict, and extracting narratives of different human rights violations and encoding those too.

Tarak Shah: (04:12)
The other, I guess useful thing about that work at that time, they used the individual violation as the unit of analysis. A common thing, then and now, is to record human rights violation at the level of the individual victim, and an individual victim can be the victim of multiple human rights violations. For instance, a person can be abducted, tortured, and then killed. And when annotators are trying to roll that up to a single record and a single human rights violation, a common practice is to roll that up to what they consider the most serious violation, so in this example I’ve given, the killing, and the fact that this person was tortured kind of gets dropped by the time this story has become data. That has the effect of distorting patterns if you’re particularly interested in understanding the use of torture during the conflict.

 


 

The Birth of Multiple Systems Estimation

Satyen Sangani: (05:09)
So what did data analysis look like in a war-torn country in 1996?

Tarak Shah: (05:14)
It was not like a big data effort. It was not a lot of machine learning. It was a lot of handwork. It was databases on floppy disks, that were backed up on multiple computers. Eventually, they found that because they were incorporating data from so many different data sources, they found a lot of instances where the same human rights violation had been reported by multiple different sources of data, of documentation. At that time, they dealt with that by writing some code to do database deduplication, but through that work, they started analyzing the pattern of overlaps, and eventually, they had the insight that this data that they had collected could be useful in a family of statistical methods known as multiple systems estimation.

Tarak Shah: (06:00)
These tools are known as capture-recapture methods, and the reason this is so important is because in most conflict situations, the data that we have access to is a convenience sample. It is never a complete enumeration of all of the violations that occurred, because it’s extremely difficult to collect that kind of data in the context of conflict. It is also not necessarily a representative example of all that has occurred, and the reason that’s such an important observation is that the questions that we seek to answer in human rights tend to be about patterns and comparisons.

Tarak Shah: (06:42)
Think about like what you do when you’re collecting evidence of human rights violations. You’re doing interviews. Perhaps you’re looking up information. You’re doing various things to collect evidence. When does that process become the most difficult? When the conflict is the most extreme, when it’s not safe to go outside, when the power is out, and so forth. So we see in many cases, that when the violence is at its worst, the documentation of the violence can decrease or stay flat in any given source, right?

Tarak Shah: (07:13)
So they knew about this issue, and they were concerned, because again, in human rights, the question when we’re talking about accountability is about what actually happened, not just what was documented, right? So they discovered that the way they had collected data enabled this series of methods called multiple systems estimation, what that allows you to do is, using the patterns of overlaps from observed data sources, that allows you to make estimates about the number of incidents that occurred but were not documented. This was a huge step forward. It is kind of the birth of a lot of the methods that we continue to use throughout our work today.

Satyen Sangani: (07:56)
Which is such an interesting lesson for a lot of people who are doing analytical work, because on some level, one presumes all of this rich data collected, and then people, sometimes engineers, people who are not very familiar with the analytical process, are making decisions about which features to extract in order to be able to understand the data better. And in your case, I mean one, the data is really scarce and scant, and then the other part of it is you’re now making really, really critical decisions about which features to extract, and then based upon that extraction mechanism, trying to figure out how that tells you about what you don’t know and what you don’t have.

Satyen Sangani: (08:42)
In my interview with General Stanley McChrystal, he made a great point. Sometimes, we have data we don’t know we have. Stanley and his team discovered trash bags full of electronics seized from enemy combatants. This was valuable intelligence that illuminated hidden behavior. For HRDAG, the discovery of a warehouse in Guatemala did the exact same thing.

Tarak Shah: (09:05)
I’ll go back in time a little bit for another example, is the use of kind of administrative data. I think of it almost as like the exhaust generated by bureaucracies, in order to identify evidence not only of human rights abuses, but also of command responsibility. Again in Guatemala, years after the conflict had ended, in fact in I think 2005 or 2006, the government accidentally found an abandoned warehouse that turned out to be the historic archives of the Guatemalan National Police. This was a series of buildings. Inside, there was all of the paperwork generated by the Guatemalan National Police, going back for almost a century. These papers were not super organized. They were tied in bundles of string. Because this building had been abandoned, some of them were covered in rat or bat feces. They were kind of molding, et cetera.

Tarak Shah: (10:05)
As soon as human rights workers realized what it was, they knew it was going to be just a super important source of data for the kind of ongoing effort for accountability for the crimes of that conflict. So one of the things that happened after that is HRDAG and other researchers went down, and the archivists there estimated that it would take decades to read through all of these documents. I think there were so many they didn’t measure them in pages. They measured them in kilometers, so it was like eight kilometers of paper or something, like stacked up, you know.

Tarak Shah: (10:43)
So one of the first things that our team did was set up a sampling strategy, a topological sampling strategy, to get a kind of representative sample of documents from this collection. What that looked like for the people who were doing the sampling and review was they would get instructions that say like, “Go into building two, walk forward 15 feet, take a left turn, and go 12 feet, and then the pile that you find in front of you, look up 17 inches and pull the documents out of that, and that’s the sample,” and so on and so forth. And there’s actually like numerous papers written about all the details that went into this sampling. But what that allowed them to do was to start to get a sense of what types of documents that this collection contained.

Satyen Sangani: (11:34)
This Herculean effort paid off. Hidden in these records was the proof that HRDAG needed to solve the mystery of a murder that had taken place 20 years earlier.

Tarak Shah: (11:45)
A couple of things happened. One, historians identified documents that were describing a police raid that resulted in the disappearance of a student union leader named Edgar Fernando Garcia. The reason this was important was because since his disappearance, there had been no kind of official word about what had happened to him. The police denied any responsibility, and said that he had probably been killed in some kind of gang violence or something, and they had nothing to do with it. His family continued to look for him.

So these documents suddenly shed light on what had happened to him, because the researchers had been doing this kind of larger analysis of the documents and flow of documents. They extracted, manually, metadata from each document, who it was from, who it was to, when was it sent, that kind of thing, as well as encoded any human rights violations that occurred within the body of the documents, in a way similar to what I talked about earlier.

Tarak Shah: (12:44)
Through that analysis, researchers were able to show, not only that the police had this operation in the same exact place and time as Mr. Garcia was abducted, but also that this operation was not unique, that it was part of a pattern, and that operations like this had been normal flow of communication between superiors and subordinates, and was part of kind of, bureaucratically, the normal pattern of things.

Because of that, they were able to bring this evidence against, not only the police who had been immediately responsible for the abduction, but at a later trial, they presented that evidence against the former chief of the Guatemalan National Police, Héctor Bol de la Cruz.

Tarak Shah: (13:32)
In that trial, they were able to establish command responsibility for the disappearance of Edgar Fernando Garcia, based on this document analysis, and he was also convicted for that crime, which is a pretty… I don’t know. It’s not easy to establish command responsibility.

 


 

Data to Hold Corrupt Police Accountable

Satyen Sangani: (13:50)
So what does HRDAG’s work look like in 2022? Tarak and I discussed a project he’s currently working on, an ongoing investigation of gender-based violence at the hands of Chicago police.

Tarak Shah: (14:02)
I’m going to rewind a little bit to the early 2010s. There was a woman named Diane Bond, who lived in a housing complex known as Stateway Gardens, and Diane Bond was… One day, she and her son were attacked and abused by a crew of Chicago police officers known as the Skullcap Crew. This crew of officers was well known among their victims for abusing the residents of Stateway Gardens in a variety of ways, and with impunity. They had not been punished or taken off the force.

Tarak Shah: (14:37)
So Diane Bond decided to take her experience to what was then the agency that did police oversight in Chicago, which was called the Office of Professional Standards. This complaint eventually grew into one lawsuit and then two lawsuits, known as Bond v. Utreras and Calvin v. Chicago. For our purposes, it would be, one important outcome of those trials was that the courts rules that police misconduct data in Chicago is public information.

Tarak Shah: (15:08)
That was a very big moment. What it led to… Well, it led to a large number of things, but one of the outcomes of that was the development of a website called CPDP. This is created and hosted by The Invisible Institute, and that is a website where you can look up police officer misconduct histories for Chicago PD officers. It was a huge step forward.

Tarak Shah: (15:30)
Okay, so what’s the problem? If you go to CPDP, one of the things you will see is that for each complaint, there’s a little button that allows you to submit a FOIA request, and that was a really cool feature that the developers thought to add. It allows anybody to submit FOIA requests for additional documentation about any given incident that’s described in the database.

Satyen Sangani: (15:52)
What do you mean by a FOIA request? Because I’m not familiar with that language and term, so help define that too.

Tarak Shah: (15:59)
Freedom of Information Act request. Both federally and most states have various laws that empower citizens to request data from their government about important things. The reason we would want to make a FOIA request about one of these incidents is that the data that became public through Calvin v. Chicago and Bond v. Utreras are these spreadsheets, basically. They have one row per incidence, and all of the information about the allegation is in a way similar to actually what I described my colleagues doing in Guatemala so many years ago, is encoded into a data format, right?

Because the allegation begins as, for instance, somebody walking into an office or going onto a web form and explaining what happened to them, and that explanation kind of becomes data through some various choices, including this thing called the primary category. The entire allegation gets summarized down to one category, such as improper search of person, or operations and personnel violation, conduct unbecoming of an officer, et cetera.

Tarak Shah: (17:11)
So you might wonder what happens when there are multiple violations, which we see a lot of in the database. An example I remember seeing when I first started working with this data was officers entering a home unsolicited, pointing guns at all the people who were there, and shouting racial slurs at them, and taking their property, taking their wallets and stuff like this. That entire incident got summarized to one category, which was theft of property, right? And that’s just one example.

Tarak Shah: (17:41)
So what these FOIA requests enable us to do is get these supporting documentation, basically, the originally, the testimony, any kind of investigative files that were opened by the oversight agency or the police about this incident. This can include, for instance, interviews with witnesses, can include things like health records if somebody went to the hospital, and stuff like that.

Satyen Sangani: (18:04)
Even with the law in their favor, The Invisible Institute and HRDAG realized the power of a FOIA request could only go so far, but as they persisted with their work, an unexpected opportunity came up.

Tarak Shah: (18:17)
Our colleagues, or our partners at The Invisible Institute, in particular Trina Reynolds-Tyler, who is currently the director of data there, had already begun this investigation into gender-based violence. They were doing pretty deep reporting, and interviewing people who had experienced some kind of sexual violence at the hands of police. This included, but was not limited to, things like public strip searches, or groping and penetration that occurred under the kind of justification of a search, you know? Various other kind of sexual harassment and so forth.

Tarak Shah: (18:50)
But what they were finding in this process, when they were comparing kind of the documentation that they had put in records requests for with the data that was public, was that there wasn’t a clear way to know, just from the structured data, which incidents were of this kind of gender-based or sexual nature. For instance, a lot of the types of incidents that I just alluded to were getting coded as an improper search of person, and so the fact that… So if you’re trying to do this deep analysis of sexual violence, you run into this block.

Tarak Shah: (19:28)
Additionally, they were limited in terms of how many documents they could request from the city. There’s an important exception in the FOIA law, which is called undue burden, meaning if the public agency, if it would be too much of a burden for them to produce these records, that is one reason they can deny it. So roughly, what we were able to do at that time was get around documents for around 15 incidents per week. There’s tens of thousands of these incidents a year, and we were kind of butting our heads against this problem, of trying to figure out which ones to request, which… Like, if you think about it too long, it’s just like a deeply interesting problem.

Satyen Sangani: (20:13)
Yeah, it’s a sampling problem, right? I mean, it’s like being an auditor, and you’re basically looking at this, in this case, crooked account of… So you basically develop this competency in forensics and auditing, which is crazy.

Tarak Shah: (20:26)
Exactly, so meanwhile, while we were trying to figure out this tough problem, and we were kind of trying different strategies, we were trying to make sure that we… There’s like the explore and the exploit as you do the sampling, and so forth. But while we were doing that, there was another case that was kind of working its way up the Illinois court system. This case is also a Freedom of Information lawsuit, and it was brought by a man named Charles Green. Charles Green was wrongfully incarcerated as a teenager, and he was incarcerated, I think, for 20 or 30 years. He has been released, and continues to… This lawsuit that he had brought against the city was to make all of these… Well, one of the things he had requested was that all of this misconduct information, including the investigatory files, should be fully public information. We shouldn’t have this kind of 15-a-week bottleneck.

Tarak Shah: (21:26)
So we knew about this case, and Trina and I in particular, along with Patrick, my colleague, had been having conversations about what we would do if we had all the documents, you know? Like, that would really expand our ability to do this investigation. And Patrick and I had had some success using machine learning, to kind of sift through large amounts of kind of text data in the past, on different projects, and had told Trina about this. And she was like, “Could we use that to do this kind of work?” And we were like, “Yeah, we should definitely try it,” but this was all hypothetical.

Tarak Shah: (21:59)
In 2020, in early 2020, as the Green v. Chicago case was making its way up the courts, the city, for whatever reason, just missed a filing deadline. They were supposed to file a kind of piece of paperwork that’s supposed to give them extra time to prepare their defense or whatever, and they just didn’t. They missed the deadline. As a result of that, the judge, kind of by default, made the documents in question, from 2011 through 2015, public. So the documents that Green has requested and that that case is still in court right now, and it’s headed towards the Illinois Supreme Court, are for documents going back to 1967. We don’t have that, but what we do have is this slice from 2011 to 2015.

Tarak Shah: (22:47)
That became the source for this investigation. Because we suddenly had all of these documents, we expanded kind of the scope of the investigation to a variety of types of violations that disproportionately affect women and nonbinary people, so this includes kind of the sexual violations that I had described before, as well as things like neglect, home investigation, policing of parents and children when they’re together, as well as targeting of people based on… of LGBTQIA people or disabled people. So we kind of set out to build this machine learning classifier.

Tarak Shah: (23:24)
A small problem before we were able to do that was that what we received were hundreds and hundreds of thousands of pages of scanned documents. Some of them included like embedded audio in the PDF. It was just very unstructured.

Satyen Sangani: (23:41)
Embedded audio in the PDF?

Tarak Shah: (23:43)
Yeah.

Satyen Sangani: (23:43)
A dream for the analyst.

Tarak Shah: (23:46)
And they included like pictures, and it’s a bunch of scanned administrative documents from a couple of different large bureaucracies. You can imagine kind of how messy it is. So we actually went through a pretty detailed process, and this has been ongoing work that I’ve been involved in, as well as folks from The Invisible Institute, as well as, importantly, some volunteers, including from an organization called Lucy Parsons Lab in Chicago.

We have been working for some time on processing these documents, extracting bits of structured information that we can use to index and search through them, so before we could really do anything in terms of this gender-based violence investigation, we had to figure out how to get from this mess of documents to what we ended up focusing on, was several of the different types of documents that exist in that collection contain a section that has a narrative description of the allegations. In some cases, these are in the first person. These are allegations that the person submitted via web form, for instance. In some cases, they are short summaries of the allegations written by the person doing the intake when the complaint happens. But in any case, these proved to be richer sources of information about what was alleged to have happened than those primary categories codes that we had access to before.

 


 

How Data Insights Can Heal a Community

Satyen Sangani: (25:12)
What are the insights that all of the work has generated, and how prevalent is this? I mean, what percentage of officers in the Chicago PD, for example, have violations reported, and how distributed is this problem?

Tarak Shah: (25:29)
We’ve created a website called btsurface.com, and that’s where we are, in an ongoing way, presenting as we find them and confirm them, some of the findings from this work. As we work our way through them, we are kind of reporting our findings from each of these categories. What we’ve been able to report on to start with are specifically, we’ve done some deep reporting on allegations of neglect. These are incidents where people have come to police for help for something and have been denied that help.

In particular, we’ve done a lot of reporting on instances of neglect following a sexual assault and following domestic violence. These are people coming to the police and saying, “I was assaulted or I am at risk because of this domestic violence,” and were either, in the cases of rape, are blamed for what happened to them or told that they are lying or making this up. Similarly, with the incidents of domestic violence, people who are… Again, police are treating them poorly, laughing at them, and we found, importantly, many instances of these.

Tarak Shah: (26:36)
So we talk about patterns of violations. One of the things that folks who’ve been reading these documents have either complained about or just observed is that over time, they start to feel like they’re reading the same allegation over and over again. And in fact, we go and check, and they’re different allegations. It’s just that similar things happen to different people in different places. So it’s important for us to talk about these patterns of violation, because they point to shared experiences that people in Chicago are having across time.

Satyen Sangani: (27:07)
As the team worked through this incredible quantity of data, they began to see how widespread these systemic issues were. They also discovered some surprising findings.

Tarak Shah: (27:18)
When we started this work, I at least, I can’t speak for others, thought about those kinds of sexual violations that I described at the beginning as specifically targeting women. What we found through… because we used these search tools to search through them, was that they were happening often to men, to black men in particular.

So Trina, one of the things that she’s been really big on from the get-go was constantly having this work be in conversation with the community, in particular with black people who live in Chicago and people who are most affected by this type of violence. So she’d been having, even before we had any real findings, like community meetings and events to talk about the work, to talk about what we had seen so far, and ask people things like, “Are we asking the right types of questions? What are we missing? What part of the story are we not getting?” That kind of thing.

So when we started to notice those reports that this kind of violence was affecting black men, she had a meeting that was specifically only for black men, so that people who came could speak freely about their own experiences, without maybe feeling ashamed or worried.

Satyen Sangani: (28:34)
Yeah, but to be clear, that’s just good science, right? I mean, on some level, you had a hypothesis, and you found data that changed your hypothesis, which is sort of exactly what we would hope for.

What strikes me about your work is that you’re a data scientist, and I think from the outside-in, you hear the words data scientist, and you think about statistical techniques, and mathematics, and neural nets, and all of these really high-brow math and computing tools that people would use on a day-to-day basis. And your work, what percentage of your time are you spending doing that kind of work? Like, is that where you’re spending your time and effort? Because it strikes me that nothing you’re talking about lives in that world. Most of it lives in the world of just trying to establish what is fact.

Tarak Shah: (29:35)
I don’t have a good percentage breakdown for you, but my role in particular actually is definitely on the nerdier side of the spectrum for the work that’s involved overall in human rights cases. I personally do spend a lot of time doing the kind of things that you might imagine. I write a lot of code. I write a lot of machine learning code these days.

That said, it is different in a variety of ways. Like you were saying, we don’t start with the assumption that we have all of the relevant data, which I think is like… In some of my previous training before this role, there’s kind of a baseline assumption, like, “How do I fit a neural network,” and you’re assuming that the training data is representative of the data on which this model is going to be deployed at some point. That’s less true for me, but yeah, I don’t… I spend most of my time writing code. I should be fully clear about that.

Satyen Sangani: (30:34)
It is spent doing the things that we consider as data science, which is great, and yet the way in which you’re able to do that, or the data gathering process through which you’re able to do that, is super manual. And you know, what’s also, I think really interesting, is it requires a lot of judgment, and I imagine you are, in that feature-definition process, fairly vocal and active in at least framing the questions for what people should be looking for and what the findings are, and that probably, it sounds like it’s a super iterative process. Is that fair to say?

Tarak Shah: (31:11)
It’s definitely iterative. Our work is really dependent on kind of relationships with our partners, and one of the skills that we try to cultivate among ourselves is kind of, in those meetings, listening and being able to tease out questions that can be addressed through quantitative analysis, so our partners may not necessarily come with a fully fleshed out like, here is the exact data analysis question we have, and here is how we will know that we have the right answer, or something like that. So it’s definitely like iterative, like you said, going back and forth, “This is what we’re seeing. Are we looking for the right things? Does this make sense?”

Satyen Sangani: (31:50)
It’s so much fun to listen to your story, because there are so many themes that I think apply to data science work writ large, and a lot of times, and many of the data science jobs were obviously born in fully digital companies, where you’re watching people on the web, and they’re clicking things, and they’re gathering this massive stream of data, and you have so much data that it’s easy to formulate any question.

But in your case, you have the exact opposite circumstance. You are looking for patterns in things that are really, really hard to prove, where the datasets itself are perhaps faulty in many cases, and certainly understated in other cases. So there’s a aspect of just, with the organization, getting out of the building, and not being satisfied with what you’ve got in order to satiate your curiosity, and that, I think is really powerful.

Satyen Sangani: (32:48)
The other theme that I thought was really interesting was that you’ve got… Any photograph, any audio file, is sort of a partial representation of the truth. You don’t have full resolution on a 360° degree view of anything that happened in the world. Like, you can’t play back what happened, so there is a choice there, which you have nothing to do with, which is, “What evidence do I have?”

And then even within that, there’s another analytical choice, which is, “What features am I choosing to extract? What questions am I choosing to ask?” And depending on what your line of inquiry is, you could say, “Oh, well people wearing red shirts are…” I mean, there’s a million features that you can ask for, so that is a judgment. I think often, people are given datasets, and structured datasets, and they are like, “Well, this is the data that I got, and this is reality, this is what it says,” and what they don’t realize is that structuring of data, that assembly of that dataset, is a judgment call by somebody farther upstream, and that itself is kind of some form of an analytical exercise.

Satyen Sangani: (33:47)
And you’re very actively realizing that your line of inquiry starts at the event, and you’re trying to get sort of at full resolution of truth from the lens you see it, which I think is really exciting. So much of what you’re describing is forensics work, and I think it’s just super powerful, to hear your story of getting out there, and doing it, and really being actively engaged with the subject matter, and not taking no for an answer, for the entire organization.

The work HRDAG does is incredibly important, and I know many of you may feel motivated to donate money or volunteer your time. To close out the conversation, Tarak suggested a few ways you could help HRDAG and their partners.

Tarak Shah: (34:33)
Yeah. I guess in general, money is always welcome. I mean, definitely do give to HRDAG if that’s what you’re excited to do, but also, for the types of organizations that I’ve been talking about today that we partner with, there’s small nonprofits and human rights defenders in every community who are doing this kind of work to expose and hold accountable the powerful for their crimes, and so in terms of like finding… If this work kind of inspires you or these goals kind of inspire you, finding who’s doing that kind of work in your own community. The volunteer question is an interesting one. It’s super important, and some organizations that we work with have really good capacity to intake and manage volunteers.

Tarak Shah: (35:18)
I mentioned the Lucy Parsons Lab in Chicago, and that’s just one that I happen to have personal experience with, and I just can’t say enough good things about the volunteers there, and they’re just doing really amazing work. I know that there are other groups like that, so if you’re looking to plug in, kind of trying to plug into networks like those, because it’s through those projects that you kind of learn what the work is about, and you develop the relationships that will help you kind of get really involved.

 


 

When War Is at Its Worst, Documentation Disappears

Satyen Sangani: (35:48)
As Tarak mentioned, when war is at its worst, documentation disappears. In other words, the most egregious human rights violations are the least likely to be recorded. The fog of war leaves hazy memories.

In this field, lack of data can be a data point in a larger pattern too, and thanks to pioneering work from folks like Tarak, we now have a system to fill in those blanks.

Multiple systems estimation empowers researchers to predict the number of undocumented incidents that occurred. It’s a crucial foundation in data work for human justice today. Who knows what other hidden patterns it will illuminate?

To learn how you can support HRDAG’s work, please see the linked show notes below. This is Satyen Sangani, CEO and Co-Founder of Alation. Thank you, Tarak and HRDAG, for sharing your powerful story, and thank you for listening.

Producer: (36:47)
Alation empowers people in large organizations to make data-driven decisions. It’s like Google for enterprise data, but smarter. Hard to believe, I know, but Alation makes it easy to find, understand, use, and trust the right data for the job. Learn more about Alation at A-L-A-T-I-O-N .com.

Other Episodes You Might Like :

Start with Story, End with Data

Ashish Thusoo

Ashish Thusoo

Founder of Qubole and Creator of Apache Hive

Subscribe to the Data Radicals

Get the latest episodes delivered right to your inbox.

Marketing by