BN: There’s a growing consensus that scientific research faces a crisis of publication bias and non-replicable research. From political science to psychology and medicine, our journals are filled with far too many narrowly statistically significant results that frequently fail to replicate. By contrast, null results are almost non-existent in our top journals even though many of our theories are wrong. (I’m not a big fan of null hypothesis significance testing but it’s the convention in scientific publishing so let’s set aside the problems with it for another day.)
Preregistration is an important step toward addressing these problems. Under this approach, researchers would formally specify theoretical expectations, research design, and analysis plan before conducting a study. The act of committing to a set of predictions and hypothesis tests allows readers to better separate confirmatory from exploratory analyses and prevents scholars from Hypothesizing After Results are Known (HARKing). Researchers can of course still present analyses that were selected after examining the data in this framework, but they must disclose which results were not predicted in advance.
DH: What’s funny to me is that, in part, what you are advocating is what we already pretend to do. That is, our articles invariably have a theoretical section which develops hypotheses and then an empirical section which tests them. But in practice, there’s no guarantee that people don’t revisit the hypotheses after knowing the results — and I’ve had some reviewers propose just that.
So it sounds like you want to bring actual practice more in line with the deductive ideal in which we test previously stated hypotheses, is that right? Can pre-registration solve these kinds of problems on its own?
BN: Actually, I’m skeptical that preregistration itself — which is starting to come into wider use in development economics as well as experimental political science and psychology — is a solution. As I argued in a white paper for the American Political Science Association Task Force on Public Engagement, it is still too easy for publication bias to creep in to decisions by authors to submit papers to journals as well as evaluations by reviewers and editors after results are known. We’ve seen this problem with clinical trials, where selective and inaccurate reporting persists even though preregistration is mandatory.
I think a better approach is to offer a publishing option in which journals would consider accepting some articles in principle before the results were known based on peer review of the design and analysis plan. Such an approach, which has been formalized by the Registered Reports movement (of which I am a part), would better align author and journal incentives with our goals as scientists.
Authors would be encouraged to submit articles for which the results would be informative about a question of interest to the field regardless of the outcome, while reviewers and editors would be forced to evaluate the value of a design independent of the empirical result. The result should be fewer false positives and more replicable findings. (For more details, see my white paper or the Registered Reports FAQ.)
DH: Gotcha. That raises a whole set of questions, but let me start on the practical side. Let’s call the review of articles before the outcome is known “preacceptance,” to set it apart from the related question of preregistering studies. Do you see preacceptance as valuable primarily (or even exclusively) in experimental research, or do you envision trying to instill a norm of preregistration in observational research as well?
BN: I think the preacceptance format is likely to be most influential and widely used for experiments, which are frequently used to test hypotheses specified in advance. But in principle there’s no reason that it couldn’t be used with observational data when researchers have strong theoretical expectations. Political scientists are already starting to preregister designs for observational studies before the data are observed or have been analyzed—why couldn’t journals preaccept those articles on the same basis?
Also, before we go further, let me briefly clarify a few points that are often misinterpreted:
First, I’m recommending that journals offer preacceptance as an optionfor authors, not as a requirement. No one is proposing that this approach would be the only acceptable way to conduct research.
Second, I’m not suggesting that scholars be prevented from analyzing the data in ways that they did not initially anticipate. The format would instead distinguish more clearly between confirmatory and exploratory findings while putting the burden of proof on authors to justify deviations from analysis plans, which creates a disincentive against using specification searches to try to obtain statistically significant results.
Third, while studies submitted in this format that pass peer review would be accepted in principle before the data were collected and analyzed, the final write-up would still be peer-reviewed to ensure that it met the field’s standards for writing quality, data analysis, etc.
DH: In advocating for pre-acceptance, you mention a “crisis of publication bias and non-replicable research.” Tell me more — what’s the empirical evidence that there is indeed a crisis, and that an important factor is reviewers’ unwillingness to publish results that don’t meet the threshold for statistical significance?
If the problem is that research findings won’t replicate, maybe the right solution is to put more emphasis on replication. We could increasingly expect researchers to provide multiple, independent tests of a claim before publishing work. Or we could look more favorably on projects that attempt to replicate findings. Right now, the almost universal advice you seem to get in the field is that it’s hard to publish studies that are “just” replications.
BN: At this point, I think the evidence of pervasive publication bias and a lack of replicable findings is overwhelming across the social and natural sciences. As many political psychology researchers know, the field of social psychology has been enmeshed in controversy after the pervasiveness of questionable research practices became clear and many prominent findings in the field failed to replicate. Similarly, John Ioannidis and others have shown that even articles in the world’s most prestigious scientific and medical journals frequently fail to replicate or show reduced effect sizes in subsequent research.
These problems extend to our own field. For instance, Neil Malhotra at Stanford’s Graduate School of Business and his collaborators have published numerous studies documenting a pattern of disproportionate numbers of narrowly statistically significant findings in our own field as well as sociology. In their most recent study, he and his coauthors find that this pattern is driven by researchers failing to write up and submit null results.
My view is that authors anticipate how difficult it will be to publish null results given reviewer and editor demands for statistical significance (and the post hoc reasoning that is often used to question those findings after the fact) and thus choose not to invest time in trying to publish them. That’s why we need to incentivize scholars to write up and submit their studies by reserving journal space for the most important research before the results are known.
As far as what else we could do, I’m of course a big supporter of replication in all of its forms (for instance, I provide replication data and code for all of my published articles and support efforts by journals like the American Journal of Political Science to mandate that all authors do the same). I’m skeptical, though, that “expect[ing] researchers to provide multiple, independent tests of a claim before publishing work” is a realistic approach given the incentives scholars face. That’s what psychology did for many years and the result was a culture of tweaking data to find significant results and other questionable research practices such as suppressing unsuccessful replications.
It’s also dangerous because studies can fail to replicate for many reasons, including random chance. A failed replication doesn’t mean an author is wrong or that the original study is invalid. I’d prefer to accumulate preregistered studies that are published regardless of their outcomes and try to build a less contaminated publication record over time, preferably in shorter article formats.
Finally, I’m all in favor of “look[ing] more favorably on projects that attempt to replicate findings” but don’t know how to turn that sentiment into concrete changes in how articles are published. Most of our journals use article formats are inappropriately long for direct replications, making them seem like a poor fit. In addition, editors are encouraged to maximize the impact factors of their journals, which creates a disincentive for them to publish replications—a type of article that tends to generate controversy (for unsuccessful replications) and relatively few citations (especially for successful replications). Still, I welcome efforts by new journals like the Journal of Experimental Political Science and Research & Politics to publish shorter articles, including more replications.
Let me stop there and turn it over to you. What do you think about this approach?
DH: I don’t mean to pick on social psychology, and I certainly don’t mean to suggest that there is no problem of p-hacking in political science. But several things make me think that preacceptance would actually have less of an impact on experimental political science than on either observational work in our field or experimental work in other fields.
For one thing, both field experiments and many of our survey experiments in political science are very costly in terms of researchers’ time and money, meaning that they can be difficult and expensive to run multiple times. That of course has statistical implications, but let me focus here on the practical consequences.
In the survey experiments I’ve done, I have typically only been able to afford a small number of questions, and so my core hypotheses are in essence inscribed in my questionnaire and my decisions about what to ask in what order. Put differently, researchers put great care into designing their experiments to afford maximum leverage on the hypothesis of interest—and those of us who are outside reviewers can identify the relevant hypotheses pretty easily from the design of the survey instrument or the experiment in many cases.
Second, as a related point, I think that some of the common arguments for preacceptance are grounded in null hypothesis significance testing, and its—perhaps overstated—fear of multiple comparisons. (In a nutshell, the concern about multiple comparisons is that the more tests you run, the more likely you are to get what seems like a significant result by chance alone—unless you adjust your threshold for significance accordingly.) We’re all taught the Bonferroni correction when doing multiple statistical tests, and as a result, we have a habit of thinking of each new statistical test as an independent chance to reach the potentially false conclusion of a non-zero effect. I’ve already mentioned that our surveys tend not to have too many dependent variables, which can mitigate this problem somewhat. But another point is that our dependent variables tend to be correlated, frequently highly so—a fact which makes Bonferroni tests too conservative.
OK, you might say — but what if we are worried about researchers splitting their samples to find a subgroup in which a statistically significant result obtains? Well, one advantage of an accumulated body of prior research is that it makes cases of suspected p-hacking stand out. If there is no strong theoretical reason to expect that gender moderates a treatment and no prior results showing similar moderating effects, our concern about p-hacking should go up. Here, too, testing on other samples is key.
That all said, I agree with the vast majority of your observations. I’d rather have a research literature grounded in the estimation of substantive effects and their uncertainty, regardless of whether the corresponding p-value happens to be 0.04 or 0.06. With observational research, I worry a great deal about the models we didn’t get to see — at times, I wish that people would include a footnote saying just how many models they ran overall.
But if I were institutionalizing changes to our journals to address these issues, I’d first want to see guaranteed space for replications (conceived of broadly to include independent tests with new data). Our journals are a reflection of our limited attention—and I do think that the results themselves are one factor in determining what should command that attention. The final thing I’ll say: we’re scientists, and I’d be totally happy to see an experiment in which a few journals implemented a preacceptance track.
BN: Great — I don’t think we really disagree. I agree that experimental design puts some constraints on p-hacking in expensive nationally representative survey experiments and field experiments, but even in those cases we often have multiple potential outcome measures and multiple subgroups of potential interest where a plausible account of moderation could be constructed. And it’s easier than ever to mine experimental data from low-cost sources like Mechanical Turk where these constraints are much less likely to be binding.
Also, the high costs of many survey and field experiments are precisely why we need preaccepted articles. Even if p-hacking is less of a concern in these cases, the social cost of those findings being buried in the file drawer is arguably greater in those cases given the greater statistical power, more representative samples, and more extensive effort and resources devoted to those types of studies.
And as I said above, I’d love to see more preregistered observational data analysis. Even if most studies are not handled in this fashion, the contrast effect can be instructive as well—when we encounter a surprising finding that was not preregistered, it can be more clearly treated as exploratory, especially if it comes from the widely used public datasets that are most likely to generate spurious findings due to repeated hypothesis testing and publication bias.
In the end, as you say, these are ultimately empirical questions. We need to try new approaches in our journals and see what works. I hope editors are willing to consider experimenting with preacceptance.