The Washington PostDemocracy Dies in Darkness

How Wikipedia reading habits can successfully predict the spread of disease

(Karen Bleier/AFP/Getty Images)

The ability to forecast the spread of an infectious diseases weeks in advance can make a world of difference when it comes to public health responses. For decades, scientists have been trying to create models to predict how something like the flu will spread.

People's Internet usage has opened a new door for predictive data. There are already some tools out there, such as Google Trends, which tries to "nowcast," or show what's happening right now with the spread of certain diseases in the world. There have been studies, too, on whether Twitter can accurately predict how a disease is spreading.

But getting access to Google Trends or Twitter data is not always easy -- or cheap. So a team of mathematicians, biologists and computer scientists got together to see if they could use something that's completely open and free: Wikipedia.

As it turns out, they could accurately forecast how influenza and dengue spread based purely on people's reading habits of Wikipedia articles. Last week, they showed how their algorithm could predict flu season in the United States. The full results of their research are published in this week's PLOS Computational Biology.

"Nowcasting is cool, but ideally you want to provide information to public health departments and policymakers so they can plan ahead of time," said Sara Del Valle, a project leader at Los Alamos National Laboratory whose team worked on the study. "Because if you really want to make a difference in how people are treated when they come to clinics and hospitals, it's better for them to be prepared. If they know in advance, we will see people in a couple of weeks, four weeks, they can better prepare."

Researchers looked at seven diseases and 11 countries over a period of three years, starting in 2010, and compared page views on Wikipedia articles about those diseases to official data from health ministries. By looking at readers' habits, they successfully predicted the spreads of influenza in the United States, Poland, Thailand and Japan and dengue in Brazil and Thailand at least 28 days in advance.

Official government data -- usually released with a one- or two-week lag time -- lagged four weeks behind Wikipedia reading habits, according to Del Valle; people, she said, are probably reading about the illnesses they have before heading to the doctor.

But not all the diseases or countries yielded such results; they couldn't predict slow-progressing diseases like HIV/AIDS, or diseases with very small numbers of victims, such as Ebola (before the current outbreak) in Uganda or the plague in the United States. Seasonal diseases were much easier to forecast using the Wikipedia model.

And the study had other limitations; for instance, researchers used language as a proxy for country (Japanese articles about influenza were used to predict the spread of the disease in Japan). That may work for some languages, but for some more widely spoken ones, like English, it can be trickier.

Even still, researchers were able to accurately predict the spread of influenza in the United States by examining the page views for English Wikipedia articles. They hope they can next get country-specific data from Wikipedia.

They found that their model was transferable, meaning they could use data in one country to forecast the spread of a disease in another. The team has made its algorithm and coding free for all to use, and Del Valle said there could be many more uses for Wikipedia data.

"That was our main goal: people could go and replicate what we did," Del Valle said. "Not only replicate, but also improve."