Mining medical data

Computer science and engineering professor Mooi Choo Chuah employs deep learning and data mining to improve systems for diagnosis, prognosis, and cybersecurity

By Richard Laliberte

“How long will I live?” This may be the most pressing question in health care, especially if you’ve been diagnosed with a devastating disease like amyotrophic lateral sclerosis (ALS).

Symptoms and clinical history alone may not reveal a prognosis. But data gathered from thousands of other patients may help. That, in part, is the concept behind research by Mooi Choo Chuah, a professor of computer science and engineering, that explores the use of clinical data mining to uncover patterns and better predict disease.

In the case of ALS, Chuah developed tools that improved forecasts of how rapidly the disease will progress and how long a patient will live. Information that can be used, in part, to guide treatment decisions.

But first, a person must determine what his or her health condition actually is—and that’s a puzzle reliant on patterns in symptoms and other factors. Chuah has refined ways of applying deep learning methods—a form of machine learning in which computers using artificial neural networks akin to the human brain become progressively more insightful in analyzing large amounts of data—to come up with diagnoses or treatment recommendations.

Better diagnostics could improve the accuracy of websites that interpret symptoms. They could also be applied to clinical settings. “Prediction models could help extend good diagnostics into rural areas that don’t have a lot of expert physicians,” Chuah says. “And they could help cut health care costs associated with overdiagnosis.”

Cleaning is key

Chuah began studying ALS after coming across a dataset that had been used in a crowdsourced competition designed to improve predictions of the disease’s progression. She wrote to the challenge organizer and received permission to use the data.

After reviewing the set, Chuah and her then-PhD student Qinghan Xue (now at Samsung Research) found that improving the quality of the data was key to improving its predictions. “Often, crowdsourced data is noisy,” she says. There may be typos, missing fields or values, non-numeric values, and inconsistent testing from one hospital to another. According to Chuah, the data needs to be cleaned, or raw information from clinical sources needs to be improved. In a 2017 paper, she and Xue proposed how to do both.

First, they removed inconsistent or noisy data, and imputed missing data. After that, they conducted experiments using a variety of data mining research tools. They were able to improve the efficiency and accuracy of previous prediction models that had been derived from the same data. “We got better results only because we had more time than teams in a competition did,” Chuah says. Cleaning the data had indeed been key.

As for improving data at its source, Chuah proposed an incentive model for hospitals to share their information. “If you can pool data together, you get a bigger dataset with more diverse scenarios, and the prediction model you develop will be better.” Yet hospitals are often reluctant to release patient data, partly to protect privacy and partly to guard what amounts to intellectual property.

Chuah developed a model that rewards hospitals for sharing high-quality data, with greater rewards going to institutions that contribute more patients (and thus incur greater costs), report more accurately, and offer more data from especially useful cases. The study demonstrated that applying such an incentive model could facilitate better predictions.

Physicians don't want to use deep learning if they don't understand how the model comes to a conclusion.

—Mooi Choo Chuah

Confusion to clarity

In theory, electronic medical records and crowdsourced health information from websites can both be used to diagnose symptoms or offer personalized advice. In practice, making accurate assessments presents challenges, according to a 2019 study by Chuah and Xue.

For one, the way people talk about their health can throw off algorithms looking for keywords. And that can cause medical websites to offer information that isn’t helpful.

For example, if a patient wants information about depression, but also mentions suffering from severe hypoglycemia, the health forum website might offer advice on diabetes, but not on depression.

Chuah has found a number of data mining approaches that can improve results. “When the user types their information, we look for patterns and remove noisy sentences or transition words that are not relevant,” Chuah says.

She designed a deep-learning-based medical diagnostic system that utilizes reliable medical information to generate diagnoses. She uses tools such as convolutional neural network (CNN) models and recurrent neural network (RNN) models that capture the underlying structure in sequential data. “It’s like gathering expertise from multiple physicians,” Chuah says. “As you reach an increasing level of opinion, you can do a better job of diagnosing a condition or personalizing care.”

She also gave greater weight to risk factors. “I know, for example, that Asian females typically have more calcium spots on mammograms that can generate scary false reports that lead to biopsy,” Chuah says. Knitting such information into a diagnostic model adds power to its predictions—and could potentially reduce health care costs.

Yet patients and doctors can be skeptical about diagnostics derived from artificial intelligence methods. “Physicians don’t want to use deep learning if they don’t understand how the model comes to a conclusion,” Chuah says. She made explanations a part of her system’s design so users can see which keywords led to specific predictions. “If you provide an explanation, it’s more likely a physician will trust the results,” she says.

Protecting health data

Putting trust in AI also raises concerns about cybersecurity. Chuah has investigated potential cyberattacks on these deep learning models and what might be done to detect them. Neural networks used in health care most often follow RNN models, yet the bulk of research on neural-network hacking focuses on CNN. “Few have done studies on how to attack RNN, so we decided to look at that,” Chuah says.

Using both synthetic and real-world datasets, Chuah and Xue introduced a new attack on an RNN machine learning system. They added random noise that affected how the system weighted data, causing the system to misclassify a group of inputs into a wrong output class. “Typically in a successful attack, you look for loopholes that cause a small perturbation of true values,” Chuah says. Without necessarily introducing malicious data, they led the system to make wrong predictions and reduced its accuracy.

“The importance of demonstrating this is for researchers to take a deeper look at whatever model they come up with, and introduce techniques that make the model more robust,” Chuah says. She and Xue have also shown how features in a genuine RNN-based model differ from those in a maliciously modified one, and have designed a low-cost detection scheme that could identify an attack such as the one they executed.

It’s exciting work with a far reach. “That’s what makes research interesting,” says Chuah. “You add to the progress in developing more robust deep learning models. Whether you use them for health care or autonomous cars, it’s essentially the same thing. The wisdom applies everywhere.”

You are here

Mining medical data

Cleaning is key

Confusion to clarity

Protecting health data

More from This Volume