Can the flashes of color in thousands of functional MRI (fMRI) scans reveal an invisible coordination between distant regions of the brain? Can patterns in your web clicks give advertisers a better way to target messages to you?
The era of big data is presenting no shortage of machine learning problems. Katya Scheinberg, the Harvey E. Wagner Endowed Chair Professor of Industrial and Systems Engineering, is working to provide faster, more efficient tools to sift through vast quantities of data to reveal nuggets of insight.
“Learning with large data sets requires an optimization problem,” says Scheinberg. Optimization is a numerical method that steps through data to uncover links and patterns.
Large data problems—predicting weather patterns, mapping communities within huge social networks or detecting the synchronized firing of neurons—all rely on a known group of mathematical functions, Scheinberg says. Using the data, algorithms find the value of the function under study and crunch through a data set until the calculations converge to the function’s minimum value.
Optimization tools seek to accurately model data with the simplest process. “I want to explain the data the best I can,” says Scheinberg, “and with the fewest iterations.
“Any data can be explained well with complex models,” she says. But those models impose a high cost in computational time and resources, and can be tied so tightly to specific data that they break when applied to other data. “They can’t be generalized,” she says.
Optimization approaches data like a climber proceeding step-by-step down a valley to find the bottom—in the fog, Scheinberg says. With each step, data is computed to determine the next step.
“In the beginning, crude methods will give you progress,” she says. “As you converge on the solution you have to work harder,” which is where many algorithms grind to a halt.
But what if you could clear away the fog? “If you can see more, you can make adjustments to get to the bottom faster,” Scheinberg says.
The tools she is working on today, in the third year of a DARPA-funded project, exploit the results gained in previous steps so that as calculations approach the lower limit of the function, “the problem space of models and variables gets smaller,” which reduces the time penalty associated with each step.
Scheinberg’s open-source tool is one of the fastest known that uses sequential data processing. Consider a case analyzing brain scans that contain 10,000 volumetric pixels, or voxels, each representing a 3-D region containing thousands of neurons. Each scan offers 50 million chances of activity in one region corresponding to activity in another.
“We are solving problems of 10,000 voxels in an hour,” Scheinberg says. “We’ve achieved a good balance with relatively ‘cheap’ steps and with relatively rapid progress toward the solution.”