NSF supports effort to enable academic researchers of every variety to find the data they need

There was a time—not that long ago—when the phrases "Google it" or "check Yahoo" would have been interpreted as sneezes, or a perhaps symptoms of an oncoming seizure, rather than as coherent thoughts.

Today, these are key to answering all of life's questions.

It's one thing to use the Web to keep up with a Kardashian, shop for ironic T-shirts, argue with our in-laws about politics, or any of the other myriad ways we use the Web in today's world. But if you are a serious researcher looking for real data that can help you advance your ideas, how useful are the underlying technologies that support the search engines we've all come to take for granted?

"Not very," says Brian Davison, associate professor of computer science at Lehigh University. "They understand web pages, not datasets. And existing dataset search services are cumbersome, focusing on searching descriptions instead of data, and they cater to researchers looking within their own discipline."

Brian and his Lehigh research team envision a "dataset search engine" that can ultimately assist many kinds of scientists in locating data that they can use to perform exploratory analysis and test hypotheses. The team has won more than $500,000 in support from the National Science Foundation (NSF) in this endeavor, which formally launched on August 1, 2018, with an estimated completion date of July 31, 2021.

The interdisciplinary Domain-Agnostic Dataset Search team at Lehigh includes Davison as principal investigator (PI) and co-PIs Jeff Heflin, associate professor of computer science, and Haiyan Jia, assistant professor of journalism and communication in Lehigh's College of Arts and Sciences. Together, they are developing techniques that enable the discovery of relevant datasets, regardless of the searcher's area of expertise.

According to the group, the sheer quantity of collections of public datasets now available has become so large that it is difficult for researchers to track them within their own discipline, and simply impossible to do so across disciplines. To help researchers find data in a discipline-agnostic manner, this NSF-backed project will investigate new, promising approaches to full-content dataset search, utilizing what the team calls "user-centric methods to develop dataset search tools and novel methods of indexing a dataset's contents."

While some disciplines have carefully curated dataset collections with search capabilities, they are limited in scope and require researchers to know which collection to search. This makes it more difficult for researchers in other disciplines to find these datasets.

"By investigating domain-agnostic search techniques," says Davison, "we hope to enable the creation of a worldwide dataset search service, much like today's web search engines."

Through this project, the team hopes to provide technology and develop a prototype of a tool that can ultimately assist many kinds of scientists to locate data that they can use to perform exploratory analysis and test hypotheses.

"Our goal," says Davison, "is that this work will one day help enable public dataset discovery and reuse, regardless of who produced the data or where it is stored—a way for researchers from all fields to organize, distribute, and access hard-won knowledge effectively, avoiding duplication of effort and enabling overall progress."

According to Heflin, data and data analytics is now an integral part of academic discovery across all areas of research and learning.

"We hope to help research communities be more efficient in their use of data to solve problems and create new knowledge," he says. "We envision a system as easy and powerful to use as Google, but used to explore datasets instead of Web pages, photos, and videos. This will be especially beneficial to research endeavors undertaken by social, physical, and data scientists."

Jia says that the design and development of the prototype will also involve professionals and practitioners in observational, interview and experimental studies to inform and guide this process, including a set of instruments for evaluating the dataset search technology and interface from the user's perspective.

"A dataset search engine using these methods benefits society by helping researchers accelerate their work and reduce duplication of effort," she says. "We intend for the end result of this project to help any analyst locate and utilize relevant datasets. It will benefit others in 'research-adjacent' pursuits as well, such as journalists seeking ways to improve their reporting, and financial managers forecasting trends in the marketplace."

All three of the primary researchers on the team are affiliated with Lehigh's new Interdisciplinary Research Institute for Data, Intelligent Systems, and Computation (I-DISC), one of three new Institutes launched by the University to create communities of scholars and catalyze crucial research in areas in which Lehigh can take a leading position on the national and international stage and make lasting societal contributions.

"I-DISC was formed to support teams of researchers that combine fundamental data and computational approaches with those focused on critical applications," says Davison. "With its potential for broad impact across the research world, this project is a perfect fit for that vision."

The team's project formally kicked off on August 1, 2018, and extends through July of 2021. The researchers intend to incorporate results of this effort into Lehigh courses that delve into data science, search engines, data journalism, and semantic Web technologies.

About the team

Brian Davison is an associate professor of computer science and engineering and director of Lehigh's undergraduate minor in data science, and teaches courses on data science, web search engines, and data mining, among others. He heads the University's Web Understanding, Modeling, and Evaluation (WUME) laboratory, and serves as editor-in-chief of the Association for Computing Machinery (ACM) journal Transactions on the Web. He was program co-chair for the ACM SIGIR 2018 conference, is currently chair of the ACM Web Search and Data Mining conference steering committee, and spent his most recent sabbatical with the Core Data Science group at Facebook.

Davison is an NSF Faculty Early CAREER award winner and one of twelve Microsoft Live Labs "Accelerating Search" award recipients. Dr. Davison's research has been supported by the National Science Foundation, the Defense Advanced Research Projects Agency, Microsoft, and Sun Microsystems. As a graduate student, he led development in the Rutgers DiscoWeb search engine project which was later spun out as an internet startup called Teoma (and was subsequently purchased by Ask Jeeves.)

Jeff Heflin leads Lehigh's Semantic Web and Agent Technologies (SWAT) lab. His specific research interests include establishing semantic interoperability between heterogeneous information systems, scalable ontology reasoning, and developing formal theories of distributed ontology systems. He is one of the pioneers of Semantic Web research and wrote the first Ph.D. dissertation on the subject.

Heflin, an associate professor of computer science and engineering, has been involved in the design of many important Semantic Web languages "since before the semantic Web was its own field." This includes work on SHOE, DAML+OIL, and OWL. In 2004, he received an NSF CAREER award to study the theory and algorithms of distributed ontologies. He has served on the editorial boards of Artificial Intelligence Journal and Journal of Web Semantics, as guest editor for three other journals in his field, and currently serves as vice president of the Semantic Web Science Association.

Haiyan Jia is an assistant professor of journalism and communication within Lehigh University's College of Arts and Sciences. Her research is highly interdisciplinary, drawing from various fields and domains including mass communication, media effects, psychology, sociology, human-computer interaction, and information science. Specifically, her primary interest focuses on the psychological and social effects of communication technology, ranging from the Internet and social media to robots and the Internet of Things. Jia also investigates the social and collective aspects of privacy in an increasingly technology-rich world.

Jia has published her work in Communication Research, Human-Computer Interaction, and in numerous conference proceedings publications such as the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI) and the ACM Conference on Computer Supported Cooperative Work (CSCW). At Lehigh, she teaches data journalism and data visualization, with an emphasis on the role of technology in reshaping the landscape of journalism, transforming readership, and enabling and empowering the new generation of journalists and citizens.

Brian Davison

Brian Davison, Associate Professor of Computer Science Engineering at Lehigh University, is principal investigator of an NSF-backed project to develop a search engine intended to help scientists and others locate meaningful research datasets.

Jeff Heflin

Jeff Heflin, Associate Professor of Computer Science Engineering at Lehigh University, is a co-principal investigator of an NSF-backed project to develop a search engine intended to help scientists and others locate meaningful research datasets.

Haiyan Jia

Haiyan Jia, Assistant Professor of Journalism and Communication at Lehigh University, is a co-principal investigator of an NSF-backed project to develop a search engine intended to help scientists and others locate meaningful research datasets.