Helping AI make the leap from transcribing a document to interpreting it.

Dan Lopresti has spent years teaching computers to extract text and meaning from handwritten documents. His latest dare could change the paradigm by which certain difficult artificial intelligence (AI) problems are solved.

“Researchers want to build algorithms that approach human levels of performance,” says Lopresti, professor of computer science and engineering. But the model for doing this hasn’t changed in nearly half a century.

To tackle pattern recognition problems, a researcher identifies a promising algorithm, “trains” it by analyzing text and writing samples that are already recognized by human experts and then turns it loose on new documents.

The media may have evolved from punch cards to CD-ROMs to websites, and data sets have grown much larger, he says, “but nothing else has changed about the way we do this research.”

Experiments in machine perception require lots of data, mostly in the form of scanned images of printed and handwritten pages. Some well-known data sets favored by researchers date back to 1972 and are “getting a little stale,” he says.

Even if new computer code has never “seen” well-known data, often the human author has, which subtly influences how algorithms are built. Experiments are usually run on a subset of a larger data collection, and researchers rarely cite exactly which documents were used.

Lopresti’s solution, the DARE paradigm, for Document Analysis Research Engine, is a collaboration with Bart Lamiroy of Nancy Université in France. It uses a powerful server computer at Lehigh that supports the development of algorithms and the running of experiments while serving data to researchers, providing an integrated platform for exploring pattern recognition problems. The idea emerged while Lamiroy was a visiting research scientist at Lehigh in 2010-11.

The 2008 Minnesota Senate race

One data set on the server consists of scanned ballots from the contested 2008 U.S. Senate race in Minnesota between Al Franken and Norm Coleman. The recount and six-month court battle, in which Franken was declared the winner by 312 votes out of 3 million cast, hinged on how humans interpreted ballots that had not been marked according to specified guidelines, Lopresti says.

“I can show you ballots where you can reasonably disagree with me as to a voter’s intent,” he says. The algorithms that processed these ballots thus had to explore the possibilities of multiple interpretations, rather than try to distill a single truth. “That is a leap from simply transcribing a document to extracting intelligence from it.”

The DARE server lets researchers tackle complex questions by breaking apart data collections and turning elements as small as a single character on a page into individual queries in a database. The database is huge – but the ability to query elements directly has the potential to change the game for researchers.

Assessing a reputation

Querying a database can produce a truly random selection of documents from multiple data sets, so researchers don’t have to keep working with the same samples. A query is in the form of an Internet URL address, so other researchers can see which documents were used and test their own algorithms against the identical data to verify the results. Because the data is random, the content of the samples doesn’t influence a researcher’s design. The server also tracks the documents an algorithm has “seen,” allowing objective tests of the code’s ability to interpret previously unseen data.

“Normally,” says Lopresti, “algorithms can be pushed to the point where they do well on training data. The proof of the pudding is in how it does with unseen data.”

Lopresti is also investigating ways of having the system evaluate the reputation of data sources, algorithms and even individual researchers.

“When you move from transcribing a document to interpreting it, the question becomes, who is interpreting it?” he says. “Nobody knows how to quantify this yet. I have an opinion of the reputation of my colleagues, of algorithms, of data, but it’s in my head. The question is how to get it out in a way that can be used.”

By combining specificity, objectivity and reproducibility, DARE can quickly identify algorithms that are tweaked for specific data but not generally useful, thus helping researchers make faster progress. A paper given by Lopresti and Lamiroy at the 2011 International Conference on Document Analysis and Recognition in Beijing was cited for its potential to change research in the field.