How and Why
Google, Yahoo, others work to make search engines better at scanning the Web
Imagine yourself a monk in the Dark Ages. The abbot comes to you after vespers and asks you to report on the Bible's treatment of, say, sloth. "For the love of God!" you'd think. "How am I going to find every occurrence of sloth in the Vulgate?" Unless you had the Bible memorized, you would have started at Page 1 and read more than 700,000 words.
A French cardinal named Hugues de Saint-Cher solved this problem. He gathered 500 colleagues and, in the year 1230, completed the first concordance of the Bible.
It was the original search engine.
Fast-forward to 1994. The World Wide Web had grown from a few hundred pages to tens of thousands in one year, and a concordance was badly needed. But even an army of monks wouldn't have had a prayer against the rapidly expanding Internet.
In response, software engineers developed "spiders," programs that crawled across the Web and recorded the content of more than 100 pages per second. The spiders started at a major site and followed all its links, then followed all the links from those sites, and so on. In the 1990s spiders went on periodic strolls. Today, a search company's spiders never rest.
The information they gather is stored in a compressed version on the company's computers. This index includes the URLs, the words that appeared on the pages, and a few other details. When you conduct a search, you're not really searching the Web; you're searching the index.
The first search engines presented pages in the order they were found. Once searches started turning up thousands of matches, that approach was of no use, since the best match may have been buried in pages of junk. The next generation incorporated the idea of a search algorithm, a method to evaluate how well a page matched the query. Results were scored based on whether the search terms were included in the title, how high on the page they appeared, and whether they were capitalized or bolded.
But the search engines still produced lots of bad results, and clever Web designers could easily game the system by putting misleading terms in their titles. (The pornography site Whitehouse.com, for instance, shocked many unsuspecting patriots during its seven-year existence, which ended in 2004 when the founder decided it might pose problems for his kindergartner son.)
Enter Google founders Sergey Brin and Larry Page with their PageRank system in 1997. The idea, according to Google's current vice president of search products and user experience, Marissa Mayer, was that "each page is as important as the pages that point to it." PageRank determines how many pages link to the page being analyzed and how many sites link to those pages. It all goes into an equation with more than a trillion variables, one for every page of the Internet. This algorithm generates a numerical assessment of each page's quality. Google combines that number with the traditional match-strength algorithm to prioritize its results.
Brin and Page's billion-dollar realization was that users would rather see a reputable page that matched their query reasonably well than an obscure page that matched perfectly.
These innovations remain the backbone of today's search engines, from Google and Yahoo to Bing and others. But the Web is changing at a staggering pace. The 1994 index for Lycos, one of the Web's first search engines, had only 54,000 pages. To put the proliferation of electronic data in perspective, humankind had generated 5 trillion megabytes of data by the year 2003. We now produce that volume every two days.
Maintaining cutting-edge techniques for filtering out junk is crucial. Google changes its algorithm two to three times every day in response to user behavior. Its engineers monitor which results users select after their searches. If the highest-ranking results aren't the top selections, they adjust the algorithm to promote the more popular pages.
Personalization is a promising if sometimes controversial technique. With access to your browsing history, e-mail records and the documents on your computer, a search engine can guess what pages you want to see. It can tailor results to your home town, your shopping preferences or even your political leanings. How thoroughly a search engine should be permitted to probe your computing habits is an emerging policy question.
A related technique is having an engine use other people's recent queries to suggest similar searches. If you enter the word "Obama" into Microsoft's Bing search engine, it suggests that you add the word "biography" or "speeches" to your search.
Sophisticated natural language processing -- teaching search engines to interpret words the way people do -- is a holy grail in the search-engine industry. The original programs didn't treat search terms as words, but as strings of letters. The only words they truly recognized were connectors such as "and," "or" and "not." Even today, "obtaining good search results depends on getting the words right," according to Jamie Callan, a Carnegie Mellon professor who studies search architecture. "In the next generation," he says, "search engines will jump over vocabulary mismatches or omissions."
Search programs have, indeed, taken baby steps toward speaking our language. They recognize misspellings and include related forms of a search term in their query -- a search for "run" will return pages with "ran." Some even search for close synonyms.
But this is still pretty rudimentary stuff. Search engines don't handle sentences or paragraphs very well, because they simply search for every word in the query rather than interpreting the actual meaning.
According to Elizabeth D. Liddy, dean of the School of Information Studies at Syracuse University, tomorrow's search engines will understand how subjects, verbs and objects interact in a sentence, and they will distinguish active from passive voice. They will be able to differentiate pages about Washington, D.C., from those about Washington state based on context.
Callan thinks future programs will also use "searchonyms," helpful search terms that the user didn't think of. For example, if a Washingtonian were to search for "professional football team Washington," you'd probably be looking for Redskins information. The search engine would automatically add "Redskins" to your query.
Language is only one frontier. Google's Mayer is driving the search engine toward what she calls an omnivorous search box. "Today, a search box eats only key words," she says. To interface properly with people, however, "it should eat concepts, images and voice streams." Google is already experimenting with a program called Google Goggles, which can analyze UPCs and wine labels, and recognize iconic images.
If you feed it a picture of Notre Dame in Paris, it might even tell you a story about a 13th-century cardinal named Hugues de Saint-Cher, who witnessed its construction.
Palmer, a freelance writer living in New York, is a regular contributor to Slate.com's Explainer column.