Attempts to digitize the world’s 129 million books have focused, almost exclusively, on their text. But a fascinating collaboration by Georgetown University fellow Kalev Leetaru, in partnership with Flickr and the Internet Archive, will attempt to modernize 500 years of literary heritage in a new, and entirely of-the-Web, way: by pulling the images from more than 2 million books, tagging their themes, and making them available on Flickr.
There are charming vintage ads for horse-drawn carriages and coffee. Early illustrations of D.C.’s city plans. Stunning, sepia-toned photos of national parks and landmarks.
And all of them are in the public domain, which means you can save, repurpose or Pinterest them more or less as you see fit.
But the project is more than a series of pretty pictures — it also clears one of the remaining hurdles between archived knowledge and the public. Many libraries and museums have already digitized their materials. The problem is that, frequently, they then dump that information online as massive PDFs, which can’t can’t be easily searched, skimmed or shared. They might be online, in other words, but they still behave like Gothic tomes: weighty, impenetrable and largely inaccessible to the modern reader.
“This project inverts how we think of all of those books,” explains Leetaru, who wrote the software behind the project. “[It treats] books as galleries of images instead of reams of text.”
It’s an interesting project for Leetaru, a D.C.-based computer scientist and academic who tends to think of big data in very big ways. He’s previously developed programs that analyze news archives to predict future events, and he was on the first research team to map Twitter sentiment in real-time. To Leetaru, the problem of digitized books was essentially a question of big data: how to translate abstract visuals into concrete, manipulable strings of code, how to classify and group them, how to navigate the system that results.
So, working on nights and weekends, Leetaru wrote a program that would automatically extract images from books previously digitized by the Internet Archive. The program then tagged the images with information like its year, title and accompanying text and uploaded the whole packet to Flickr.
“The power of ‘big data,’ ” Leetaru said, “means that a single person working purely on his personal time was able to extract the images of 600 million pages of books spanning 500 years.”
He’s uploaded 2.6 million images so far, with an estimated 10 million more to follow. And the possibilities for those images, when considered as data, are huge: Leetaru imagines an app that could infinitely cycle through a tag — say, “birds” or “beaches” — and display a never-ending art collection like a screensaver on your wall. He thinks art historians could develop algorithms that identify similar images and trace the evolution of subject matter over time.
Perhaps most importantly, Leetaru thinks any library, anywhere, could add to the data trove: In a matter of weeks, in fact, he’s releasing his software and instructions for using it, with the hope that, eventually, archivists will collect “all of the world’s out-of-copyright book images in this single massive gallery of our history.”
He’s already well on his way: The images uploaded thus far represent 500 years of out-of-copyright images, spanning every conceivable topic. It is, Leetaru says, “perhaps the greatest collection of public domain imagery ever created.”
You can see the full collection here.