By Peter Whoriskey
Washington Post Staff Writer
Thursday, December 11, 2008
Google's professed corporate mission is "to organize the world's information."
But for years, the U.S. government, one of the world's largest depositories of data, has been unwilling or unable to make millions of its Web pages accessible.
"The vast majority of information is still not searchable or findable either because it's not published or it's on Web sites which the government has put up which no one can index," Google chief executive Eric Schmidt said during a recent presentation at the New America Foundation.
Now Schmidt has a unique opportunity to change that as an informal adviser to President-elect Barack Obama, a tech booster who dubbed his first Senate law "Google for government" because it aimed to make federal information more accessible.
Today, a wide array of public information remains largely invisible to the search engines, and therefore to the general public, because it is held in such a way that the Web search engines of Google, Yahoo and Microsoft can't find it and index it. Not surprisingly, Yahoo and Microsoft officials agree that people would be better served if more public information became accessible to their search engines.
A person using one of the search engines, for example, can't find Environmental Protection Agency enforcement actions against a given company, can't discover the picture of a specific ancient Egyptian artifact at the Smithsonian and can't search by name for the details of a Vietnam War casualty.
And for many Web users, if an online item can't be found with a Web search engine, then for all practical purposes it doesn't exist.
"Unfortunately, too much of the public information provided on government Web sites just doesn't show up when the average American does a Google search," said J.L. Needham, Google's manager of public-sector content partnerships. "As a result, information that is intended for the public's use is effectively invisible."
To be sure, much of the information that the search engines are asking for is already digitized and available on the Web. EPA enforcement actions can be found through a portal on the agency's site, details on Egyptian artifacts can be found through a search of the National Museum of Natural History and details of a Vietnam War casualty may be found by searching the National Archives site.
The trouble, as the search engines see it, is that most Web users have become accustomed to finding information by typing queries into one of the engines -- and if they don't find it there, they give up.
Needham estimates that 1,000 federal government Web sites are inaccessible to search engine "crawlers," the programs that are run to discover what information is available on the Web.
Much of the inaccessibility stems from the fact that so much federal government data, while public, can be accessed only after users fill out an online form. The search engines' crawlers generally can't look into such databases.
For example, Google notes that a user seeking details on an Environmental Protection Agency enforcement action against Anheuser-Busch can't be found by entering a simple search query such as "EPA enforcement Anheuser-Busch." Instead, a person needs to know to go to a particular EPA enforcement Web site and enter "Anheuser-Busch."
To make those databases visible to search engines would require the federal government to make each item into a Web page and then to provide a list of those Web page addresses to the search engines.
Microsoft is working with more than 25 federal agencies to make their Web sites "crawlable" by search engines.
"I do agree with Google," said Molly O'Neill, chief information officer of the EPA, which has more than 200 Web sites. "When people search, they should be able to find the data."
But information technology officials in the federal bureaucracy said that the transition may require significant manpower and that the costs could be large.
"We have been working very closely with Google," said Francisco Camacho of the Web services division of the Smithsonian. "With limited resources as always, it's a little bit hard."
The National Archives expects that its entire database containing descriptions of its holdings will be available to Google by January, said Pamela Wright, a program manager for the National Archives and Records Administration. The EPA has made some sites accessible, too, and the Smithsonian has sent Google the links for 78,000 pages, Camacho said.
Some federal officials have grumbled, however, that Google is making this push purely for financial reasons: The more that is available to search engines, the more people will use search engines, letting Google show advertising to more people.
"The more information is available, the more people are likely to use Google," said Danny Sullivan, editor in chief of http://SearchEngineLand.com. "It does help Google in the end."
But Needham said the company's motive in the federal Web site effort isn't the money; it's making sure customers find what they want.
"We don't care because there is monetization value," Needham said. "It's because if we fail to answer a question, then our users are disappointed with us, not their government."