Cache, as the saying goes, is king
As you waste your lunch hour scrolling through cat videos, snarky celebrity-bashing memes, and videos from your niece's third birthday party, for just a moment consider the majesty of the system behind the screen that enables such lightning-fast access to literally everything under the sun.
When users request files, images, and other data from the internet, a computer's memory system retrieves and stores it locally. In terms of a computer's ability to quickly process data, a relatively small hardware storage called cache is located close to the central processing unit, or CPU. The CPU can access data more quickly than fetching data from the main memory, or over the open airwaves of the internet.
Data is often gathered into cache in chunks. Based on a principle called locality, when a user accesses one bit of data, the system fetches the adjacent block of data. This happens, in theory, to improve efficiency—in case additional local data is requested in the future, or you decide you need to watch that cat falling into the fishtank one more time—cache makes it happen instantaneously.
But as applications become ever more complex and user behavior becomes increasingly random, says Xiaochen Guo, P.C. Rossin Assistant Professor of Electrical and Computer Engineering at Lehigh University, fetching a large chunk of data becomes a waste of energy and bandwidth.
"If the blocks of data being stored in cache are too large, the amount of available storage on a piece of computer hardware needs to increase," she says. "In this case, breaking the blocks into smaller pieces doesn't help very much. Such an approach requires a significant amount of metadata, which keeps track of the data contained in cache, but eats away at available storage as well."
As she explores ways to solve this conundrum, Xiaochen will be supported by a five-year, $500,000 Faculty Early Career Development (CAREER) Award from the National Science Foundation (NSF). This prestigious award is granted annually to rising emerging academic researchers and leaders who have demonstrated potential to serve as role models in research and education.
According to Guo, her goal through the NSF-supported Revamping the Memory Systems for Efficient Data Movement project is to improve data movement efficiency by revamping memory systems to proactively create and redefine locality in hardware. New memory system designs through the project hold the potential to unlock fundamental improvement, and she believes this may in turn prompt a "complete rethinking of programming language, compiler, and run-time system designs."
Identifying patterns and reducing overhead
Most previous attempts at solving this issue, Guo says, focused on increasing memory overhead. Eventually, such a system will use more and more storage for metadata, which simply identifies stored memory.
"Reducing metadata is a key focus of our work," says Guo, who is also affiliated with Lehigh's new Interdisciplinary Research Institute for Data, Intelligent Systems, and Computation. "Memory overhead translates directly to cost for users, which is why it's a major concern for hardware companies who might be considering new memory subsystem designs."
Guo's new design will resolve the metadata issue by identifying patterns in memory access requests.
"In a conventional design, the utilization of the large cache block is low—only 12 percent for even the most highly optimized code," she says. "With this work, we are looking more closely at correlations among different data access requests. If we find there's a pattern, we can reduce the metadata overhead and enable fine-granularity caches."
Reading a picture pixel by pixel is generally a simple enough task for a computer, and locality turns out to be fairly efficient. But when a system runs multiple tasks at once or performs something fairly complex—think about the random scrolling and clicking that can defines your social media habits—Guo's designs can enable hardware to learn and eventually predict patterns based on past and present behavior. When these predictions are perfected, the system will fetch and cache only the most useful data.
She says her team's preliminary results have been encouraging. "For our preliminary work on fine-grained memory, the proposed design achieves 16 percent better performance with a 22 percent decrease in the amount of energy expended compared to conventional memory," she says.
The next generation of computer hardware
Guo says this is an especially important topic when it comes to machine learning. "The entire community is looking at how to accelerate deep learning applications. Big companies are investing very heavily in this technology and related research," she says.
That's because machine learning apps are very data-intensive. They also move a lot of data, and Guo says this can result in heavy redundancies.
Current deep learning models address this by compressing data to reduce their memory footprint. But this makes their memory access patterns significantly more complex, so Guo says improving the way that memory systems recognize these complex patterns should improve the performance and scalability of deep learning applications.
"Essentially, this will enable larger models with higher accuracy to run on smaller devices and be calculated faster," she says.
For this and other reasons, Guo's entire research catalog has attracted interest from industrial and academic parties, including IBM, Intel, and the Berkeley National Lab, where she's spending time this summer.
"I hope, through some collaboration, we can be the group that influences the next generation of computer hardware," she says.
And the CAREER Award is a very good start on that road.
"It's encouraging that my vision is supported," she says," but there's a lot of work to do, and we need to do it right if we want to have the impact we think we can."