Open data is all the rage in political science these days, and big data, well, big data is all the rage everywhere. What is perhaps less appreciated is the opportunity offered by the possibility of open little data, and nowhere is this more on display at the moment than on a topic near and dear to my heart: whether there will be a second player wearing a N.Y. Mets cap in the Hall of Fame in the near future.

Ryan Thibodaux, who tweets at @NotMrTibbs, has been assiduously gathering every shred of information about how people voting for candidates for membership in baseball’s Hall of Fame have cast their ballot. For those not familiar with this election, it has three important characteristics. First, to get elected to the baseball HoF, a candidate needs to appear on 75 percent (or more) of the ballots.  Second, voters can list no more than 10 people on each ballot.  Finally, “only active and honorary members of the Baseball Writers’ Association of America, who have been active baseball writers for at least ten (10) years, shall be eligible to vote.”  This means we’re talking about an electorate of approximately 450 people, a perfect case of “little data.”

Ballots in this election can be cast secretly, but being sports writers, many of those who cast the ballots publicize their choices.  And here’s where @NotMrTibbs comes in: Not only does he track who has said they are voting for whom, but he has made this information publicly available in one place on the web.

However, we still need a little bit of social science to make sense of this data. Line 4 of the spreadsheet reports the percentage of public ballots on which they candidate has been named. The decision to publicize one’s ballot, though, is not random; if it were, we could simply use this number as a pretty good estimate of the support the candidate would enjoy overall in the voting (with a confidence interval around that estimate).  But there are all sorts of reasons why some ballots are publicly announced and others are not.  It may be, for example, that people who vote for unpopular candidates do not announce their choices. Or perhaps writers who tend to vote more disproportionately for home town candidates are more likely (or not) to write about their choices.  In a way, this is a similar problem to trying to figure out election returns from early voting in presidential elections.

Fortunately, @NotMrTibbs once again comes to the rescue by providing us with some potentially very useful information for estimating the nature and direction of this possible bias in our estimate based on the early ballots in rows 8, 9, and 10.  Here we can see how the candidate did in the previous election on both private and public ballots.  Turning to Mike Piazza, we find that he did  perform better on the publicly announced ballots in 2015, and that the difference was a non-trivial 13 percent.  Thus, if Piazza was only coming in a few points above 75 percent in the public ballots this year — and with a majority of ballots not yet revealed — he might have significant reason to be worried.  At 86 percent of public ballots, though, we have a much better reason to think he might be safe.

The method is of course not perfect. There may be different sources of bias this year for all candidates, or for Piazza specifically.  There may be different patterns that come into play as candidates get closer to securing 75 percent of the vote.  We of course know much less about the sources of bias for first time candidates on the ballot (although Ken Griffey Jr. is still looking pretty good!).  Finally, any true Mets fan will always be paranoid until the last minute. Nevertheless, this remains a nice example of how open little data can provide useful information too, as well as a good opportunity to thank those who make such available to the public.


Update 4:45 PM: Nathaniel Rakich has actually been trying to estimate the types of bias I described in this post at his blog Baseballot!  For what it is worth, he also is predicting Piazza gets in….