A few years ago I wrote about a paper by Andrew Healy, Neil Malhotra and Cecilia Mo that claimed that having the local team win a college football game could improve incumbent vote share in the presidential election by up to two percentage points.

At the time, I wrote:

I took a look at the study (I felt obliged to, as it combined two of my interests) and it seemed reasonable to me. There certainly could be some big selection bias going on that the authors (and I) didn’t think of, but I saw no obvious problems. So for now I’ll take their result at face value and will assume a 2 percentage-point effect.

But then I did some calculations and estimated that, assuming this effect was real and as large as was estimated, that if you take account of all the different games in different parts of a state, the total effect would be much much smaller:

There are multiple games in multiple weeks in several states, each of which, according to the analysis, operates on the county level and would have at most a 0.2% effect in any state. So there’s no reason to believe that any single game would have a big effect, and any effects there are would be averaged over many games.

Still, I found the results “disturbing,” even if the effects on the election would be much tinier than might appear from a naive interpretation of the point estimates.

But wait, there’s more!

Now we have an update, a paper by Anthony Fowler and B. Pablo Montagnes provocatively titled “College football, elections, and false-positive results in observational research,” and with this summary:

We reassess the evidence and conclude that there is likely no such effect, despite the fact that Healy et al. followed the best practices in social science and used a credible research design. Multiple independent sources of evidence suggest that the original finding was spurious—reflecting bad luck for researchers rather than a shortcoming of American voters.

How did this happen? Fowler gives the background:

We might worry that this surprising result is a false positive, arising from some combination of multiple testing (within and across research teams), specification searching, and bad luck. It’s the kind of empirical test where a positive result gets published in a respected journal and covered by the press, but a negative result is filed away and forgotten. Therefore, Pablo and I sought to reassess this finding in order to test whether it reflects a genuine phenomenon or a chance false positive.

That is a good point. When that study came out, I’d taken it at face value because I didn’t see any obvious flaws. But since then, I’ve become more sensitive to how published results can be contingent on a variety of decisions that researchers must make. Even when researchers are scrupulous and careful, these published results may not actually command much confidence — but, of course, the popular press will often hype those results nonetheless.

Fowler continues:

It’s not an experimental result, so we can’t conduct an independent replication. Therefore, we proceed by testing additional, independent hypotheses that should hold if football games and mood indeed influence elections. For example, we find that the estimated effect of college football games is actually greater for counties that care less about college football, just as great even when the incumbent does not run for reelection, and just as great in other parts of the state as in the home county of the team. We also find no effect of NFL games on elections, despite the greater popularity of the NFL and similar patterns of regional support. As a result of multiple, independent tests, we conclude that the original finding was most likely a false positive.
We think our substantive results are important for our understanding of elections and voter competence. We also think the paper holds interesting lessons for many empirical social scientists. What should researchers do when they’re worried about a false positive result arising from an observational study where pure replication is impossible? We think our approach of testing additional, independent hypotheses may be a fruitful one in many cases.

To be clear, Healy, Malhotra and Mo were quite reasonable and restrained in their 2012 paper. The results were presented more dramatically (I’d say overdramatically) in the popular press, but the authors of the scientific paper did not themselves hype or overstate their findings. They did some analyses that maybe don’t stand up so well under careful reanalysis, but that can happen. As Fowler and Montagnes emphasize, we’d like to learn from this experience when doing and reporting on future studies.

Takeaway: Statistical significance does not equal certainty

So here is the take-home message when evaluating published research claims: Statistical significance is not a good enough reason to believe.

And the take-home message when thinking about politics is: Voters aren’t as “irrational and emotional” as is sometimes claimed.

P.S. More here, including responses from the authors of the papers discussed above.