Story by Steve Neumann
“The stochastic gradient method is the workhorse of machine learning—and thus most of artificial intelligence,” says Luis Nunes Vicente, chair of the Department of Industrial and Systems Engineering (ISE).
It’s an optimization algorithm that essentially fine-tunes models used in large-scale applications of machine learning, whether that’s a streaming service suggesting what movie to watch next, or a credit card company monitoring online transactions to detect fraud.
The main idea behind the stochastic gradient method comes from a seminal 1951 paper by mathematician Herbert Robbins and his graduate student Sutton Monro, who later joined the Lehigh ISE faculty. Today, ISE professor Frank E. Curtis (pictured), an expert in continuous mathematical optimization, is leading the charge to advance the use of the stochastic gradient method in machine learning.
A machine learning algorithm works by creating a model that learns from the historical data it’s been fed to make predictions about new data it receives.
The process usually goes poorly at first, which is to be expected, says Curtis, who is a founding member of Lehigh’s OptML Research Group. For instance, an image classification algorithm might initially misinterpret a picture of a cat as a dog. The typical solution, he says, is to feed the machine more and more data so the algorithm can update its model to make better predictions.
Machines learn through what’s called a “loss function,” which measures the predictive accuracy of a model for a given set of data. If predictions deviate too much from actual results, the loss function is represented mathematically as a very large number.
The job of the data scientist is to minimize the loss function—i.e., to reduce the degree of error in those predictions. That’s what optimization means in the context of machine learning, Curtis says, and it’s critical to advancing the technology that is driving innovation in areas such as medical diagnostics, autonomous vehicles, and speech recognition, to name a few.
With the stochastic gradient method, the loss function used by the algorithm acts as a kind of barometer, gauging accuracy with each iteration of updates to its model, and continually adjusting it to yield the smallest possible error.
In the image classification example, it would be a very slow process for a computer to go through all of the images in a dataset, Curtis says, and then update its model (a full-on “wash, rinse, repeat” approach).
The magic of the Robbins-Monro algorithm, he says, is that it employs a sampling method.
“Rather than going through all the data first and then updating its model,” Curtis says, “the algorithm randomly chooses a small bit of it and uses that to update it; and then it grabs another random bit of data, and does it over and over again until it has gone through all the data.”
By using a stochastic gradient method, the machine ends up processing the same amount of data as with other methods, but many more updates of the model that have led to improvements in its predictive accuracy will have been done. Essentially, you get a better result more quickly with the same amount of computation.
In a new project funded by a $250,000 NSF grant, Curtis and his team will develop modern improvements to the stochastic gradient method.
According to the project summary: “Despite the successes of certain optimization techniques, large-scale learning remains extremely expensive in terms of time and energy, which puts the ability to train machines to perform certain fundamental tasks exclusively in the hands of those with access to extreme-scale supercomputing facilities.”
One downside of the stochastic gradient method is that it requires a lot of “tuning,” Curtis explains. For example, “Google may need to run the algorithm a very large number of times with different parameter settings in order to find a setting in which the algorithm actually gives a good result.”
This tuning can essentially waste hours or weeks—or even months—of computation, which translates into a lot of wasted electrical power. It’s one of the reasons that only the big internet companies can afford to train very large-scale models for complicated tasks.
“We are designing ‘adaptive’ algorithms that adjust their parameter settings during the training process, so that they might be able to offer equally good solutions, but without all the wasted effort for tuning.”