Resolve Magazine Fall 2023 >> Making Sense of Machine Learning >> Stories >> Subjecting models to humans
Before a machine learning model is released, it undergoes performance metrics. For many such models, those metrics include an F-score. An F-score combines precision—the number of positive predictions made by the model that are correct—and recall—the percentage of actual positives identified within a dataset—into a single score.
“Traditionally, what you look for is a balance of high precision and high recall,” says Eric Baumer (pictured below, left), an associate professor of computer science and engineering. “The problem is that all false positives and all false negatives are treated equally and contribute to precision and recall in the same way.”
It’s a problem, Baumer says, because some false positives are more egregious than others. For example, with a model built to classify spam, it’s essential that emails that are actually spam are labeled as such, and that none of the legitimate messages are labeled spam. But imagine if your power company wrote to say there was a problem with your payment, and they’re about to turn off your electricity—and something in the wording triggered the model to label the note as spam.
“When an email like that gets inaccurately labeled as a false positive, the consequences can be serious,” he says. “So my students and I wanted to develop ways of assessing machine learning performance that went beyond metrics like precision and recall. That was what motivated us to come up with a multi-item, multi-dimensional human assessment.”
Baumer, Lehigh PhD student Amin Hosseiny Marari, and Joshua Levine, who took part in the CSE department’s Research Experiences for Undergraduates (REU) summer program, ran the human subject studies. The team showed participants the results of a classifier, i.e., how the computer model labeled certain documents from three sources: news coverage from the Associated Press, blogs by parents with a child on the autism spectrum, and the diary entries of an 18th-century playwright and poet. They then asked specific questions about how effectively the model labeled those documents: Is the label confusing? Surprising? Offensive? Objectionable? They wanted to compare how people assessed the model versus the standard performance metrics.
“What we found is that there were basically two underlying dimensions to the ways that people assessed the classifier’s performance,” he says. “One was just, ‘Is this a suitable label or not?’ The other was, ‘Is this a biased or offensive or objectionable label?’ Usually you would never ask a human how well your classifier performs, but what we found is that these human-centered approaches can identify labels that seem to be perfectly fine when you look at them from a standard performance metric, but are in fact egregious errors in some way. ”
Obviously you can’t run human analysis on every parameter tweak. But this approach could be used in key moments, like prior to a model’s release. Another interesting possibility, says Baumer, is trying to create computational analogs that correlate with what a human would say. Those proxy metrics could be used to make incremental adjustments to the model.
“Once you’ve got two or three variants, you run those with a human subject study,” he says. “And the hope is that you’re able to identify biased or offensive models before they get incorporated into areas like image labeling, spam detection, or the criminal justice system.”
Main image: Yingyaipumi/Adobe Stock; inset: Douglas Benedict/Academic Image