By Steve Neumann, Christine Fennessy, and Emily Collins
If you’re like most people, when you have a question, you get your answer from an algorithm.
Internet search. Your email and social media. The navigation, shopping, and news apps on your phone. They all run on algorithms—sets of rules for solving a problem or completing a task—that are incredibly powerful in analyzing data when paired with machine learning, a form of artificial intelligence.
And, if you’re like most people, although you interact with algorithms from the moment you wake up (and check your weather app) until your head hits the pillow (and late-night Twitter scrolling ensues), you’re basically in the dark about what’s going on behind the scenes.
While most of us mindlessly click through the latest software update or privacy agreement, computer science and engineering researchers in the Rossin College are raising questions—and working toward answers—as to why algorithms make certain decisions, what impact they have on privacy, and how to improve our digital literacy.
Increasing transparency in decision-making
Imagine that you and a colleague are active on a social network of job-seekers, akin to LinkedIn. You’re both in the same field, and equally qualified, but as you discuss your prospects over a cup of coffee, it’s clear that your friend has been seeing more high-quality job postings than you have.
It makes you wonder: What information did the site use to generate the recommendations in the first place?
Algorithms that can learn from data and make predictions are still somewhat of a black box to the end user, says Sihong Xie, an assistant professor of computer science and engineering.
If you’re not seeing some postings because of your age or your gender or something in your past experience, he explains, the algorithm that produced the recommendations is sub-optimal because its results are unfairly discriminatory.
Xie, who is a 2022 recipient of the prestigious NSF CAREER award, and his team are investigating the transparency and fairness of machine learning models. He’s one of a number of Lehigh Engineering researchers applying an optimization technique called the stochastic gradient method in their work (see “Optimizing Machine Learning”).
Algorithms are everywhere, but what’s behind their decision-making remains a mystery to internet users, says Xie. (ISTOCK/COFOTOIS) |
In machine learning, “these optimization algorithms are becoming more and more important, for two reasons,” he says. “One is that, as the datasets become larger and larger, you cannot process all the data at the same time; and the second reason is that we can formulate an optimization problem that will analyze why the machine learning algorithm makes a particular decision.”
The latter is particularly important when applying machine learning to make decisions involving human users.
In late 2021, Xie’s PhD students Jiaxin Liu and Chao Chen presented the results of their work at two renowned meetings on information retrieval and data mining. The team found that the current state-of-the-art model, known as a “graph neural network,” can actually exacerbate bias in the data that it uses in its decisions.
In response, they developed an optimization algorithm that could find optimal trade-offs among competing fairness goals that would allow domain experts to select a trade-off that is least harmful to all subpopulations.
For instance, the algorithm could be used to help ensure that selection for a specific job was unaffected by the applicant’s sex, while potentially still allowing the company’s overall hiring rate to vary by sex if, say, women applicants tended to apply for more competitive jobs.
“Our work on ‘explainable graph neural networks’ seeks to find human-friendly explanations of why the machine learning model makes favorable or unfavorable decisions over different subpopulations,” says Xie. “We want to promote the accountability of machine learning as our society is rapidly adopting the techniques.”
Examining privacy through ‘social computing’
It’s no secret that internet heavyweights like Google, Facebook, and Amazon collect all kinds of data from users.
However, “these companies don’t release a lot of information about how their [algorithmic] programs work,” says Patrick Skeba, a fifth-year doctoral candidate, who is advised by Eric Baumer, an associate professor of computer science and engineering.
When Skeba originally decided to attend graduate school, he figured he’d pursue machine learning, but he soon realized he could satisfy his wide-ranging curiosity working in a field he’d never really considered: social computing.
“All computing is social, in that computers are used by humans and humans are social beings,” says Baumer. “While it is easy to forget that fact when you are tuning a model’s hyperparameters or benchmarking your system’s throughput, computers are mostly interesting in so far as the kinds of human interactions they enable.”
Baumer’s research examines the interrelations between the technical implementation details of algorithmic systems and the social and cultural contexts within which they operate.
“My first day in the office, [Baumer] was giving me books on computers, on philosophy, on sociology, on all kinds of things,” says Skeba, who was intrigued by Baumer’s interdisciplinary approach. “And so I got less and less interested in building machine learning models, and more interested in asking, ‘What are these models doing to us?’”
Social computing research by Baumer (right) and Skeba (left) explores human interactions with algorithmic systems. (Douglas Benedict/Academic Image) |
In other words, what are they doing to our privacy?
Skeba’s research uses two different approaches to answer that question. The first is what he and his team call their “folk theories of algorithms” study. Skeba uses interview and survey methods to query regular internet users on their understanding of how their personal data is collected.
The dearth of information, he says, leads people to make all sorts of assumptions about how vulnerable their privacy is when they do certain things, like post comments online. “And so what we see sometimes are guesses that are quite far off.”
Some people think algorithmic systems can’t infer much, if anything, from their comments, so they post without concern. Others, however, are convinced that the algorithms can derive all sorts of information about them, and are too paranoid to post anything at all.
“You end up in a situation where there’s a disconnect between how these systems work and how people understand them,” says Skeba. “So figuring out how people are imagining these systems can help us better understand their behaviors. And that, in turn, can help us educate users and give them the tools to understand how their information is being used.”
Skeba’s second research project involves evaluating the privacy risk of an online forum. The forum is run by a nonprofit dedicated to helping drug users minimize the harm associated with drug use.
“These are people posting anonymously about things that are stigmatized, or illegal, or dangerous, and so there’s a lot of fear that law enforcement, family, or employers might try to figure out who these people are,” he says. “So, from a privacy perspective, this was an important issue to look at.”
Users of the site generate thousands of words of content, he says. So the question was: How much of a privacy risk did that pose?
He and his team built a model called a stylometric classifier that can identify the author of a piece of work based on their writing style. Then, using algorithms known to work with identity matching, the researchers attempted to link specific pieces of content on the forum to accounts from websites like Reddit. If a link was made, the Reddit account could potentially expose the identity of the forum user.
“We found that the stylometric classifier did a really good job. We could get around 80 percent of users on two different websites linked just through this writing style,” says Skeba. The purpose of the study was to highlight that the simple act “of posting online introduces certain risks, and this is something we need to consider much more, moving forward.”
We already knew that we needed to protect our passwords. But now, algorithms could potentially mine the thoughts, opinions, and advice we share online to uncover our personal information. And that could affect anyone who spends time on the internet.
“If you create enough content, you could become a potential target for these kinds of analyses,” he says. “We wanted to highlight that there’s a need to critically analyze the algorithms that are being developed and deployed to ostensibly stop things like cybercrime and terrorism, and make sure they aren’t also harming people who rely on anonymity to do things that are acceptable and beneficial to themselves.”
DiFranzo is part of a team that received an award from the NSF's Convergence Accelerator for its project, "A Disinformation Range to Improve User Awareness and Resilience to Online Disinformation." (Ryan Hulvat/Meris) |
Building digital literacy
And It’s not just our personal information at stake. The algorithms designed to keep us engaged with—and boost the bottom line of—social platforms, search engines, and websites don’t necessarily have our best interests at heart.
Recommendation systems can both lead people into echo chambers of misinformation and make it difficult to notice and disengage, says Dominic DiFranzo, an assistant professor of computer science and engineering.
“Watching a video that questions the efficacy of vaccines can get you more extreme anti-vax recommendations,” he says. “It’s a downward spiral that can be hard to see when you’re in it. These algorithms weren’t designed to radicalize people per se. They’re designed to give you more of what they think you want, regardless of what that content is.”
DiFranzo is part of a multi-university team, supported by a $750,000 NSF grant, that is developing digital literacy tools to counter this online threat. Experts in computer science, cyber-security, psychology, economics, the humanities, and education will share advanced techniques and timely materials to increase disinformation awareness and improve user resilience.
The team also includes community collaborators from K-12 schools, senior citizen centers, and nonprofits promoting disinformation literacy in the Global South. Researchers will consider the psychological and cultural differences in the standards of trust and the distinct vulnerabilities of these populations.
DiFranzo’s focus is on developing the digital tools, primarily assisting in designing, building, and deploying the platform. The project will use a number of the technologies his lab has built, like the Truman platform—a system that creates interactive social media simulations for large-scale online experiments.
“There has never been a greater time to educate the public on how to discern and address online misinformation,” he says. “ut it’s not enough to just inform them about these challenges. We need to provide them with the tools, training, and experience in how to navigate this new informational environment.”