washingtonpost.com
AOL Search Queries Open Window Onto Users' Worlds

By Ellen Nakashima
Washington Post Staff Writer
Thursday, August 17, 2006

Out of more than 36 million search queries that hundreds of thousands of AOL users typed into AOL's Internet search engine from March to May, here is the term most queried: Google.

That so many customers would use one search engine to find another is among the odd truths being mined from AOL's public release of search data. The company last week called the incident involving 658,000 users' queries a "screw-up" and apologized. But for better or worse, the data offer the first widespread public glimpse of how people search the Internet, of what they are interested in. Of how people think.

In just a week, the breach has spawned a cottage industry of Web sites and online commentary devoted to analyzing and parsing the data, which include Social Security numbers and potentially embarrassing searches, such as "bad breath could it be an infection in one of my teeth."

While acknowledging concerns about privacy, researchers said it is an opportunity to study how people search for information in a limitless universe of data.

Web sites have devoted themselves to combing through the information. There's http://www.dontdelete.com , which includes a feature offering "hours of entertainment" that will, lottery style, pull up a random search from the AOL information.

SEOSleuth.com shows which Web sites were most visited by those AOL searchers: Google and MySpace were tops. There's even a site in German, Sistrix.com, that looks at search-term frequency.

Even privacy advocates who were outraged by the breach have analyzed the search strings, mainly so they can provide evidence to back up their claims about how invasive the data are. Even though AOL assigned random ID numbers to each user, some search strings provide enough clues that anyone with access to databases of phone or Social Security numbers or addresses could try to link that data to a person.

A Washington Post analysis turned up at least 190 searches in the data set that appeared to contain a Social Security number and at least several thousand that contained possible telephone numbers.

JoAnn Whitman, a 55-year-old retired grocery store worker from Grand Junction, Colo., accidentally typed an order confirmation from Bed, Bath & Beyond into the AOL search engine on May 3. The entry included her name and address. Contacted by The Washington Post, she expressed dismay.

"They say, 'Oh, we'll protect it, but it's not secure,' " she said of the data. "I don't think that it's anybody else's business."

She said that she had not heard of the AOL data disclosure and that she was thankful there was nothing really embarrassing in her searches, which included queries to "www.mervynsboys shoes .com" and "www.Wellfargobank.com."

Paul Boutin, a technology columnist for the online magazine Slate, owned by The Washington Post Co., has created his own user typology with the data. In an article titled "You Are What You Search," he grouped users into seven categories, including the Pornhound, who shifts from "poems about a red rose" before midnight to "sexy dogs and hot girls" a half-hour later; the Newbie, including folks who type in http://www.google.com ; and the Basket Case, which includes the person who came up with the search query: "I hurt when I think too much I love roadtrips I hate my weight I fear being alone for the rest of my life."

Boutin also works for a start-up in Silicon Valley that makes software to sift through computer-generated search logs. "If you look at search logs for even a few weeks, you've seen every crazy search term you can imagine," he said. "The only interesting thing left is what are the large-scale and long-term patterns in this data."

For instance, he said, do ad campaigns actually drive people to search for things? Did Chrysler's "Ask Dr. Z" sweepstakes move people to look up Dr. Z online? It's useful stuff for marketers, he said.

Matthew Hindman, a political science professor at Arizona State University, said, "My first reaction was horror at the privacy implications," he said. "And then I got excited about all the fun things we could learn from the data."

For instance, he said, which search terms are people using to find Web sites with political content such as gun control, abortion or campaigns? That's usually difficult to determine because political Web sites generate a small percentage of all Internet traffic.

"Having this great mass of raw data from average users really is a great opportunity to find out about how citizens search," he said.

It is also useful to research the claim, he said, that the Internet is part of a communications shift from broadcast to narrowcast, from the era of Walter Cronkite to an era of podcasts and online news. Both Republican and Democratic political consultants, he said, have argued that a changing media environment necessitates a change in campaign tactics -- giving more power to small-scale news producers.

AOL's own researchers published a paper, posted online before the data breach, called "A Picture of Search." They found, among other things, that the most frequently searched subject category -- about 15 percent of queries -- was "other." That means that most searches were so diverse they could not be categorized. Entertainment was next, followed by shopping, with pornography at about 7 percent.

AOL declined to make its researchers available for comment.

The Electronic Frontier Foundation this week filed a complaint with the Federal Trade Commission alleging that AOL violated its privacy policy and deceived users about how their data were being used. The San Francisco advocacy group provided examples of search strings containing sensitive personal terms that could be linked to individuals. It was unclear whether the individuals were the users themselves.

Asked whether the research benefits outweigh the privacy implications, Hindman said no. "This sets a terrible precedent, and the hope is that other companies will learn from this mistake and put much stricter guidelines on how search data such as this is handled," he said.

But he said he is still going to use the data because "it seems pretty clear the damage is already done."

Research database editor Derek Willis and news researcher Madonna Lebling contributed to this report.

View all comments that have been posted about this article.

© 2006 The Washington Post Company