“It’s easier to program bias out of a machine than out of a mind.”
That’s an emerging conclusion of research-based findings – including my own – that could lead to AI-enabled decision-making systems being less subject to bias and better able to promote equality. This is a critical possibility, given our growing reliance on AI-based systems to render evaluations and decisions in high-stakes human contexts, in everything from court decisions, to hiring, to access to credit, and more.
It’s been well-established that AI-driven systems are subject to the biases of their human creators – we unwittingly “bake” biases into systems by training them on biased data or with “rules” created by experts with implicit biases.
Consider the Allegheny Family Screening Tool (AFST), an AI-based system predicting the likelihood a child is in an abusive situation using data from the same-named Pennsylvania county’s Department of Human Services – including records from public agencies related to child welfare, drug and alcohol services, housing, and others. Caseworkers use reports of potential abuse from the community, along with whatever publicly-available data they can find for the family involved, to run the model, which predicts a risk score from 1 to 20; a sufficiently high score triggers an investigation. Predictive variables include factors such as receiving mental health treatment, accessing cash welfare assistance, and others.
Sounds logical enough, but there’s a problem – a big one. By multiple accounts, the AFST has built-in human biases. One of the largest is that the system heavily weights past calls about families, such as from healthcare-providers, to the community hotline – and evidence suggests such calls are over three times more likely to involve Black and biracial families than white ones. Though multiple such calls are ultimately screened out, the AFST relies on them in assigning a risk score, resulting in potentially racially-biased investigations if callers to the hotline are more likely to report Black families than non-Black families, all else being equal. This can result in an ongoing, self-fulfilling, and self-perpetuating prophecy where the training data of an AI system can reinforce its misguided predictions, influencing future decisions and institutionalizing the bias.
It doesn’t have to be this way. More strategic use of AI systems – through what I call “blind taste tests” – can give us a fresh chance to identify and remove decision biases from the underlying algorithms even if we can’t remove them completely from our own habits of mind. Breaking the cycle of bias in this way has the potential to promote greater equality across contexts – from business to science to the arts – on dimensions including gender, race, socioeconomic status, and others.
The Value Of Blind Taste Tests
Blind taste tests have been around for decades.
Remember the famous Pepsi Challenge from the mid-1970s? When people tried Coca-Cola and Pepsi “blind” – no labels on the cans – the majority preferred Pepsi over its better-selling rival. In real life, though, simply knowing it was Coke created a bias in favor of the product; removing the identifying information – the Coke label – removed the bias so people could rely on taste alone.
In a similar blind test from the same time period, wine experts preferred California wines over their French counterparts, in what became known as the “Judgment of Paris.” Again, when the label is visible, the results are very different, as experts ascribe more sophistication and subtlety to the French wines – simply because they’re French – indicating the presence of bias yet again.
So it’s easy to see how these blind taste tests can diminish bias in humans by removing key identifying information from the evaluation process. But a similar approach can work with machines.
That is, we can simply deny the algorithm the information suspected of biasing the outcome, just as they did in the Pepsi Challenge, to ensure that it makes predictions blind to that variable. In the AFST example, the “blind taste test” could work like this: train the model on all data, including referral calls from the community. Then re-train the model on all the data except that one. If the model’s predictions are equally good without referral-call information, it means the model makes predictions that are blind to that factor. But if the predictions are different when those calls are included, it indicates that either the calls represent a valid explanatory variable in the model, or there may be potential bias in the data (as has been argued for the AFST) that should be examined further before relying on the algorithm.
This process breaks the self-perpetuating, self-fulfilling prophecy that existed in the human system without AI, and keeps it out of the AI system.
My research with Kellogg collaborators Yang Yang and Youyou Wu demonstrated a similar anti-bias effect in a different domain: the replicability of scientific papers.
Unbiased Prediction Of Replicability
What separates science from superstition is that a scientific fact that is found in the lab or a clinical trial replicates out in the real world again and again. When it comes to evaluating the replicability – or reproducibility – of published scientific results, we humans struggle.
Some replication failure is expected or even desirable because science involves experimentation of unknowns. However, an estimated 68% of studies published in medicine, biology, and social science papers do not replicate. Replication failures continue to be unknowingly cited in the literature, driving up R&D costs by an estimated $28 billion annually and slowing discoveries of vaccines and therapies for Covid-19 and other conditions.
The problem is related to bias: when scientists and researchers review a manuscript for publication, they focus on a paper’s statistical and other quantitative results in judging replicability. That is, they use the numbers in a scientific paper much more than the paper’s narrative, which describes the numbers, in making this assessment. Human reviewers are also influenced by institutional labels (e.g., Cambridge University), scientific discipline labels (physicists are smart), journal names, and other status biases.
To address this issue, we trained a machine-learning model to estimate a paper’s replicability using only the paper’s reported statistics (typically used by human reviewers), narrative text (not typically used), or a combination of these. We studied 2 million abstracts from scientific papers and over 400 manually-replicated studies from 80 journals.
The AI model using only the narrative predicted replicability better than the statistics. It also predicted replicability better than the base rate of individual reviewers, and as well as “prediction markets,” where collective intelligence of hundreds of researchers is used to assess a paper’s replicability, a very costly approach. Importantly, we then used the blind taste test approach and showed that our model’s predictions weren’t biased by factors including topic, scientific discipline, journal prestige, or persuasion words like “unexpected” or “remarkable.” The AI model provided predictions of replicability at scale and without known human biases.
In a subsequent extension of this work (in progress), we again used an AI system to reexamine the scientific papers in the study that had inadvertently published numbers and statistics that contained mistakes that the reviewers hadn’t caught during the review process, likely due to our general tendency to believe figures we are shown. Again, a system blind to variables that can promote bias when over-weighted in the review process – quantitative evidence, in this case – was able to render a more objective evaluation than humans alone could, catching mistakes missed due to bias.
Together, the findings provide strong evidence for the value of creating blind taste tests for AI systems, to reduce or remove bias and promote fairer decisions and outcomes across contexts.
Applications In Business And Beyond
The blind-taste-test concept can be applied effectively to reduce bias in multiple domains well beyond the world of science.
Consider earnings calls led by business C-suite teams to explain recent and projected financial performance to analysts, shareholders, and others. Audience members use the content of these calls to predict future company performance, which can have large, swift impact on share prices and other key outcomes.
But again, human listeners are biased to use the numbers presented – just as in judging scientific replicability – and to pay excessive attention to who is sharing the information (a well-known CEO like Jeff Bezos or Elon Musk versus someone else). Moreover, companies have an incentive to spin the information to create more favorable impressions.
An AI system can look beyond potential bias-inducing information to factors including the “text” of the call (words rather than numbers) and others such as the emotional tone detected, to render more objective inputs for decision-making. We are currently examining earnings-call data with this hypothesis in mind, along with studying specific issues such as whether the alignment between numbers presented and the verbal description of those numbers has an equal effect on analysts’ evaluations if the speaker is male or female. Will human evaluators give men more of a pass in the case of misalignment? If we find evidence of bias, it will indicate that denying gender information to an AI system can yield more equality-promoting judgments and decisions related to earnings calls.
We are also applying the ideas here to the patents domain, where patent applications involve a large investment and rejection rates are as high as 50%. Here, current models used to predict a patent application’s success or a patent’s expected value don’t perform much better than chance, and tend to use factors like whether an individual or team filed the application, again suggesting potential bias. We are studying the value of using AI systems to examine patent text, to yield more effective, fairer judgments.
There are many more potential applications of the blind-taste-test approach. What if interviews for jobs or assessments for promotions or tenure took place with some kind of blinding mechanism in place, preventing the biased use of gender, race, or other variables in decisions? What about decisions for which startup founders receive funding, where gender bias has been evident? What if judgments about who received experimental medical treatments were stripped of potential bias-inducing variables?
To be clear, I’m not suggesting that we use machines as our sole decision-making mechanisms. After all, humans can also intentionally program decision-making AI systems to manipulate information. Still, our involvement is critical to form hypotheses about where bias may enter in the first place, and to create the right blind taste tests to avoid it. Thus, an integration of human and AI systems is the optimal approach.
In sum, it’s fair to conclude that the human condition inherently includes the presence of bias. But increasing evidence suggests we can minimize or overcome that by programming bias out of the machine-based systems we use to make critical decisions, creating a more equal playing field for all.
originally posted on hbr.org by Brian Uzzi
About Author: Brian Uzzi, is a professor at The Kellogg School of Management at Northwestern University. He co-directs the Northwestern Institute on Complex Systems (NICO) and teaches in the new MBAi program, focused on business innovation and the complex technologies that drive it forward.