What population sample will bring us wisdom?
This eagle owl is pictured at a zoo in Mulhouse, eastern France. (Sebastien Bozon/AFP via Getty Images)

Last year, a widely reported paper in Science found that less than half of published research in top, peer-reviewed psychology journals failed to replicate when the studies were repeated by other researchers. Last week, a new commentary in Science called into question the conclusions of the original study, suggesting instead that the data suggest “the reproducibility of psychological science is quite high.”

With the back and forth over the psychology replication studies, it it is worth highlighting new research by Kevin Mullinix, Thomas Leeper, James Druckman and Jeremy Freese published in the Journal of Experimental Political Science (JEPS) (a journal I was co-editing at the time the article was reviewed) finding high rates of replication among a diverse set of experimental studies from across the social sciences. These results are mirrored in a working paper by Columbia University PhD candidate Alex Coppock. I recently (although before the new commentary in Science had appeared) posed the following questions to Mullinix, Leeper and Coppock:

Joshua Tucker: How exactly did you set up your replication studies, and how were they different from the psychology replication studies?

Thomas Leeper: For our paper, we conducted two sets of replication projects. In the first project, we started with three basic experiments. We implemented each experiment on five different samples – a nationally representative sample, a sample of undergraduate students, a convenience sample of adults, participants in an exit poll in two modest-sized Midwestern cities and an online sample recruited from Amazon Mechanical Turk, a website increasingly used by researchers to connect with and compensate survey respondents.

Kevin Mullinix: In the second project, we chose 20 experiments that had been vetted by a project called Time Sharing Experiments in the Social Sciences (TESS). These were experiments that had gone through the scientific peer review process before being implemented using a nationally representative sample of U.S. adults. We then replicated these experiments using the Mechanical Turk platform.

Alex Coppock: I followed a very similar format to Thomas and Kevin. In total, I replicated 12 studies on Mechanical Turk, seven of which were originally conducted on TESS and five of which were originally conducted on other national probability samples.

TL: So, a big distinction is important here: We replicated studies that had been implemented but not necessarily published. Only some of the TESS experiments are ever published (for reasons that Annie Franco, Neil Malhotra, and Gabor Simonovits have recently examined). By contrast, the paper in Science only tried to replicate studies that had been published in top journals. So we were in some sense looking in the “file drawer” to see whether scientific studies could replicate, not whether studies published in top journals could replicate. Indeed, many of the studies we replicated had so-called “null” effects in both the original experiment and our replication.

JT: What were your overall findings? 

TL: Our results are very clear. In our first project (three experiments implemented with five distinct samples): Our results are consistent over 80 percent of the time. The Science study, by contrast, found consistent results across replications about 40 percent of the time. We find the same thing in our second project (20 experiments each replicated once): The effect sizes in the original studies and the replications correlate at about 0.75. (A correlation of 0 would mean the original and replication results were completely unrelated, and a correlation of 1 would mean the pattern of results replicated perfectly.). So that’s quite consistent.

AC: In my set of studies, the overall correlation is 0.81. This figure is quite high, especially considering that the estimated effects are somewhat noisy, so the correlations appear smaller than they actually are. For comparison, the equivalent figure from the Science study is 0.51.

JT: Even though you used different types of populations for your studies and the psychology studies attempted to replicate using similar populations, you were able to replicate a dramatically larger proportion of studies — why do you think that was the case?

KM: We suspect there are a few reasons for this. First, we were broader in our replications. Our research crossed disciplinary boundaries. TESS funds research across the social sciences, so we are getting a much wider view of replication than in the Science study. Second, we were more focused on the types of studies we replicated. The Science study selected all kinds of research – including experiments and observational studies – and included research that was conducted in all kinds of settings with all kinds of samples.

We only replicated experiments (no observational studies), and all of our experiments were conducted in a survey setting. This means that the scientific protocol used in our replications was essentially identical to the original studies. The Science replications were conducted by different research teams, sometimes with markedly different protocols than in the original research.

Since we were replicating studies that may or may not have been published, we took a uniform analytical strategy whereby we simply assessed the effect of the experiment for each study’s first experimental test. In these types of experiments, the outcome variable measures participants’ answer to a survey question. We simply looked at how much respondents’ answers to this outcome question differed between the control and treatment groups.

Had we replicated analyses used in publications – where variables may have potentially been selectively picked or different analytical strategies employed – we may have replicated results at a lower rate. We know that research showing statistically significant effects is much more likely to be published than research showing no significant effects, which has led some to question the validity of published research.

TL: By replicating experiments that had never been published, we included studies with non-significant results in the original study that were also non-significant in the replication. Andrew Gelman, one of your fellow Monkey Cage bloggers, has repeatedly shown that publication processes introduce a “significance filter” — that means effects in published research are likely to be overestimates of true effect sizes. I think that highlights the scientific publication process as a major reason why the Science study found such low rates of replication and why we find something so different.

AC: I’ll offer one more hypothesis for why the replication rate for the psychology studies was lower than what we found: small sample sizes. Small sample sizes mean that in order for an estimate to be deemed statistically significant, it has to be large – a lucky draw. If journals only publish statistically significant findings, the result is the systematic overreporting of large, noisy estimates. By contrast, the TESS studies start out with much larger sample sizes, so both the original and replication treatment effects are measured with much greater precision.

JT: There’s no denying that the Mechanical Turk population differs from the general population in important ways: They are younger, whiter, more liberal and better educated. What can explain the close correspondence between the experimental results even despite these systematic differences?

TL: Our research is consistent with a growing body of evidence about the use of Mechanical Turk in the social sciences. For example, earlier research by our co-author Jeremy Freese, along with Jill Weinberg and David McElhattan, found similar results on three experiments conducted on Mechanical Turk and with a national sample. Aggregating this accumulated evidence, one plausible explanation for this correspondence is that the effects of many interventions studied to date are homogeneous, meaning they are largely the same on all kinds of individuals so that it does not necessarily matter who is the subject of the study.

KM: It is important to keep in mind, though, that Mechanical Turk respondents have distinct characteristics, many of which might be important in some areas of research. Research by Yanna Krupnikov and Adam Seth Levine in the Journal of Experimental Political Science, for example, shows these respondents sometimes behave differently from other respondents in relatively simple experiments similar to the ones we conducted. Connor Huff and Dustin Tingley have also shown that even though the respondents on Mechanical Turk are diverse, certain combinations of demographics are rare such as Asian American males or individuals from certain geographical locations. That reiterates that even though non-representative samples like Mechanical Turk can provide useful insights, there are a number of research contexts in which social scientists need to conduct research on nationally representative and diverse samples.

JT: What about treatments that *do* impact different types of subjects differently? Is Mechanical Turk still OK to use in those instances?

TL: Of course, that is exactly the issue. If we think some treatment (Example: a persuasive message about Social Security) will resonate more with older individuals (the treatment is moderated by age), a disproportionately young sample may be problematic. That is, a convenience sample may reveal a null effect, but had the study been implemented with a representative sample, the treatment effect would be evident.

But we do not always have compelling theories to explain when the effects of a given treatment are likely to be homogeneous (similar) versus heterogeneous (different) across individuals. That suggests that we need to start looking more carefully for sources of effect variation and rely on large, diverse samples to detect those patterns. This is actually what Alex’s ongoing research is showing.

AC: Yes, treatment effect homogeneity – that the treatment being studied has a similar effect across different individuals — is my working hypothesis to explain the high replication rate. The experiments I replicated all involved treatments that, one way or another, were trying to persuade people to change their minds about political issues. In related work with Andrew Guess, I show that people, regardless of background characteristics, appear to update their political attitudes in the direction of (randomly assigned) evidence by approximately the same amount, leading me to think that persuasive treatments tend to have similar effects across lots of different types of people.

JT: Many scholars are still uncomfortable with using Mechanical Turk for research. What limitations do you see for the platform?

KM: The main thing to keep in mind is that Mechanical Turk respondents are not a representative sample of a population. Our research suggests that these respondents behave similarly to others in a wide array of contexts, but they still do not represent the population as a whole. We should be cautious about Mechanical Turk research trying to make descriptive claims (for example, to predict elections or presidential approval) and studies where the results are contingent on particular characteristics of subjects and the sample provides little variation of those characteristics. For these kinds of research. we need to rely on nationally representative samples, like what is provided by TESS.

TL: In fact, it is only because we have access to the nationally representative samples provided by TESS that we are able to know that the Mechanical Turk results at this point in time are credible. Others might want to repeat our study in the future to see if the replication rates remain high.

AC: I agree, and to reiterate Thomas’s point earlier, it can be really difficult to develop strong theories about how or why treatment effects could be different on Mechanical Turk. We know that subjects on Mechanical Turk differ from the national population in all sorts of measured ways, like age, partisanship and gender. If these measured characteristics were the only dimensions along which treatment effects varied, then we’d have no problem: We could just reweight the sample data to match the population. The difficulty comes from treatments whose effects might vary with unmeasured characteristics of the subjects – that’s when it might be dangerous to extrapolate from one sample to another, and when national probability samples are so crucial.