The Presidential Advisory Commission on Election Integrity, led by the vice president, has gotten considerable attention for requesting voter registration information (including names, birthdays and Social Security numbers) from each state.
Presumably, the commission will use the names and birthdays in these lists to identify potential duplicate registration records between states. That’s the method used by the Interstate Crosscheck Program pioneered by Kansas Secretary of State Kris Kobach, a co-chair of the commission, which helps states identify voters who have moved to a new state by flagging potential duplicate registration records. In 2012, Crosscheck identified more than 1.4 million potential duplicate registrations.
Recent academic work, however, found that for every 200 registrations flagged using Crosscheck’s methodology, at least 199 were false matches in which the middle names or Social Security numbers did not line up. That’s one-half of one percent.
And the advisory commission will have considerably less information than is made available to Crosscheck. Twenty-two states have refused to comply with the commission’s request. Seventeen other states, such as Colorado, have said that they will provide only the publicly available voter information and omit confidential information like Social Security numbers or month and day of birth. Even more challenging for the commission will be states that only indicate the ages of voters in 10- or 20-year bins.
Here’s how we did our research.
We tried to assess how much higher the rate of false positives would be if the commission used this limited data to perform a Crosscheck-style match. To do this, we used a commercial data set of the voter registration records for the whole country. We used this file to construct a data set in which we know there are exactly zero duplicates because, by construction, each registration record is a unique combination of the voter’s first and last name, and birth day, month, and year.
We then used this data set of over 150 million registered voters and compared it to itself — but missing at least one column of data — to evaluate what false match rates we’d get with incomplete information, as we explain below.
As you can see, even when we know that there are no true duplicates in the data, each time we eliminate one more bit of information from our matchup, the rate of false “duplicates” that we find rises dramatically.
There’s probably another one of you out there somewhere.
If we choose a registered voter at random, there is a 13.6 percent chance that she will be erroneously matched to at least one other voter somewhere in the country who shares her full name, birth month and birth year. In states that only provide data about birth years (but not months and days), this probability jumps to 36.2 percent. For states that only indicate a voters’ decade of birth, the probability of a duplicate for any given voter is 58.6 percent.
As you can also see in the table, there is a high chance of matching a voter to more than 10 other unique voters. One out of every 100 registered voters has at least 10 namesakes born in the same month and year. Similarly, there is a one in three chance that at least 10 people with your first and last name were born in the same decade as you.
These probabilities may seem high, but they reflect a real overlap in voter names and birthdays. In our data, we find 1.02 million voters who share just 100 common first and last name combinations. And 4,333 name combinations are shared by at least 1,000 people each.
How many people could possibly have the same name?
Consider the 12,553 registered voters named “Maria Rodriguez,” the 17th most common name in our data. Although each has a unique birthday, every Maria Rodriguez born between 1918 and 1999 shares a name and birth year with at least one registered voter. Applying Crosscheck’s methodology of duplicate registration identification could cause all 12,553 people with this name to be flagged as a possible duplicate, even though exactly zero are.
Had we performed these searches using each voter’s name and full birthday, we would have concluded that there were zero duplicate records in our data set. But working with registration records that lack essential details, as the commission might, could cause us to draw wildly inaccurate conclusions about the potential for voter fraud.
This could create headaches for the 143 registered voters between the ages of 20 and 79 who go by the name Mike Pence.
Stephen Pettigrew is a research and data consultant with the MIT Election Data and Science Lab. Follow him on Twitter@pettigrew_stats.
Mayya Komisarchik is a PhD candidate in government at Harvard University who works with the MIT Election Data and Science Lab. Find her on Twitter @MayyaKomis.