Searching the Internet, Brian Davison likes to joke, begins with a popularity contest. You enter a topic, and the search engine produces a list of websites. The pages listed first have received the most votes, or “recommendations,” with each link to a listed page from a credible site representing one vote.
This simple concept underpins how search engines function. But each web page’s links are made in different contexts, says Davison; thus, recognizing those contexts can lead to improved quality of web search results.
Davison, assistant professor of computer science and engineering, recently received a five-year NSF CAREER Award to study this approach.
“Our goal is to improve web searches – from eliminating search engine spam to improving search engine ranking functions,” he says. “One way to begin to identify a good page is to determine how many credible sites are linked to it.”
A search engine might count each link to a site as a recommendation, Davison says, but recommendations can’t be weighted equally. “While doing a search for ‘home improvement’ you may discover a page on plumbers,” he says. “While they may be credible plumbers, your objective is to redecorate your house. We’re working to filter out the ‘recommendations’ that cause such a page to be viewed as authoritative on one topic, but are not relevant to your desired topic.”
Davison and his students are considering the topical context of a link and determining how to assign rank by authoritativeness within the query topic, thus improving the authority calculation. They are also seeking new ways to determine context beyond this topical approach and to estimate why a link was created. This will help decipher links that are created, say, for advertising purposes and do not match the search intent.
After finishing its analysis, Davison’s team will be able to identify a site’s topical content and the communities in which it is well-respected.
The NSF award will also help Davison purchase storage equipment.
“We’ll begin with 12 terabytes and expand to approximately 50 terabytes,” he says, estimating his team’s eventual capacity at 500 million to 1 billion web pages. “This is a small fraction of Google’s capacity, but it is substantially more than the typical research trial.”