Gareth Millward is a research fellow at the Centre for History in Public Health at the London School of Hygiene and Tropical Medicine.

A computer and notepad in a library. (iStock)

Since the mid-’90s, the Internet Archive has been trawling the information super highway. It’s robots crawl the Internet and copy every webpage they can find, every two months. So far, it has archived more than 430,000,000,000 web pages.

It’s a rich and fantastic resource for historians of the near-past. Historians like me.

History relies on evidence. Medieval manuscripts, letters, modern government documents, novels, films, interviews, objects, buildings, even the layout of farmland. We’ll use anything we can to get a view of how humans behaved in the past. In the 21st century, the web gives us a unique window onto society. Never before has humanity produced so much data about public and private lives – and never before have we been able to get at it in one place.

Until very recently, this was just a curiosity, a theoretical possibility. Now, however, we have the computing power and a deep enough archive to try to use it. But it’s a lot more difficult to understand than we thought.

I was part of a research project organized by the British Library and Institute of Historical Research. We were among the first in the world to use the web archive for academic research. Of course, we knew that there were a lot of pages in the database. But since we could navigate Google reasonably easily, we thought we could use the archive in the same way. Do a search. Get a group of webpages on a particular subject. Read them. Draw some conclusions. How hard could it be?


Let’s take my own research as an example. I wanted to look at how disability organizations used the web. So I took one prominent sight loss charity in the United Kingdom, RNIB. If I Google “RNIB” right now, I get the organization’s home page right at the top. I get a map showing my local branches.

If I put “RNIB” into the British Library’s prototype search engine, I get half a million results, spread across three decades. They’re not sorted by date, or by relevance. I get RNIB press releases alongside random results on employment websites. Oh, and this is just the British web (addresses ending in “.uk”). I’d get hundreds of thousands more on a worldwide search.

The ways in which we attack this archive, then, are not the same as they would be for, say, the Library of Congress. There (and elsewhere), professional archivists have sorted and cataloged the material. We know roughly what the documents are talking about. We also know there are a finite number. And if the archive has chosen to keep them, they’re probably of interest to us. With the internet, we have everything. Nobody has – or can – read through it. And so what is “relevant” is completely in the eye of the beholder.

Take for example one result that kept repeating. RNIB had advertised a talking watch in the early 2000s. This meant that on news stories about soccer, international conflict and fashion – not directly and obviously related to the work done by the charity – I had thousands of repeated references to a timepiece. Now, if I were researching web advertising, this would be absolutely vital information. As someone looking at how RNIB behaved on the web, it was interesting as an isolated artifact. Not something I needed to see on page after page of search results.

Historians must take new approaches to the data. First, we’re going to have to realize that we can’t read everything. We already do this with printed documents, but we need to be more explicit about it and more willing to admit it. Smaller samples of Web sites, specifically chosen for their historical importance, can give us a much better understanding. We can begin to ask questions about how sites are constructed and what information people and organizations chose to reveal. Similarly, much more focussed searches on smaller time periods, more marginal topics, or specific cultural groups can produce a more manageable “corpus” for reading and manipulating in the same way we would on our trips to traditional archives.

But the really exciting stuff will come by looking beyond the text of the documents. Because the web is – by definition – an interlinked mass of information. By working with computer scientists, we’re beginning to analyze how links between Web sites are formed and what these relationships between organizations tell us. Do we find, for example, that certain organizations choose to link to other institutions in their wealth bracket? Or can we trace the importance of a charity over time by the number of websites that link to it? What can we find out from unusual links between groups that, ostensibly, shouldn’t have that much in common?

This mass of data we have, far from rendering the archive unintelligible, may give us richer and more fruitful answers. We just need to work out the right questions to ask.