In a new paper, “Don’t Get Duped: Fraud through Duplication in Public Opinion Surveys,” Noble Kuriakose, a researcher at SurveyMonkey, and Michael Robbins, a researcher at Princeton and the University of Michigan, gathered data from “1,008 national surveys with more than 1.2 million observations, collected over a period of 35 years covering 154 countries, territories or subregions.”
Kuriakose and Robbins continue:
Each survey in these projects is nationally representative or nearly nationally representative. Most surveys have 1,000 or more respondents, although a select number have fewer in select countries. Each instrument is lengthy, covering 75 or more questions or more on a range of topics.
These are serious surveys. But a lot of them seem to have problems:
For each survey, we examined the distribution of maximum percent match for substantive variables for every unique country-year to determine if there is a significant likelihood of substantial data falsification via duplication. . . . We find that nearly one in five country-year surveys in publicly available datasets included in our analysis has a level of near (or full) duplication of 5 percent or greater. These results imply that duplicates and near-duplicates present a prevalent problem that has been largely undetected to date.
The problems were much more serious in non-OECD countries. OECD is the so-called rich countries club, so what Kuriakose and Robbins found was that data problems were more common in surveys in poorer countries. They found 5 percent of surveys in rich countries to have duplication problems, compared to 26 percent in the other countries.
Where is the duplication coming from? Kuriakose and Robbins suggest it is people hired to do data collection who make up responses because it’s less effort than actually gathering the data. And of course this happens in rich countries too, as in the notorious survey fabricated by Michael LaCour.
One thing I don’t quite understand in this new paper is why the authors don’t list the surveys where they suspect fraud. That would be good to know, right? What they should really do is post all their raw data, but perhaps they don’t have permission from the individual surveys to do this. But they could still post all their code and give their results on a survey-by-survey basis. Especially when fraud is involved, it makes sense for us to be able to see exactly what analysis was done here.
Not everyone is happy with this new paper. Michael Dimock, president of the Pew Research Center, which organizes many cross-national surveys, expresses skepticism about the claims of fraud. He writes:
We assigned a team of international-survey and methods experts to look into both their newly proposed fraud detection method and our own international survey data. Our assessment is that their particular statistical tool is not an accurate stand-alone indicator of data fraud. And so their claim that one-in-five international surveys contain falsified data is untrue.
Here’s the Pew Center’s longer response, by Katie Simmons, Andrew Mercer, Steve Schwarzer and Courtney Kennedy, who report that “natural, benign survey features can explain high match rates.”
I have not looked at all of these reports in detail, and at this point I’d guess that (a) Kuriakose and Robbins’s measure has issues as their baseline is based on an unrealistic simulation model, but (b) the number of duplicate responses is high enough that, once this all shakes out, we will indeed take it as strong evidence of some duplication of the survey responses. We’ll have to see. But for now I’ll just say that there’s reason for concern.