The Washington PostDemocracy Dies in Darkness

How people who should know better abuse math to bolster the ‘election fraud’ lie

A worker stands behind the deli counter at the Eastern District convenience store in the Park Slope neighborhood of Brooklyn in New York on March 26. (Amir Hamja/Bloomberg News)
Placeholder while article actions load

Imagine that you run a small deli. Over the years, you’ve noticed that there’s a rhythm to when you sell sandwiches during the week, with a peak generally landing about 1:30 p.m. or so and then fading over the next few hours.

Curious about whether you can precisely predict daily sales, allowing you to manage supplies and staffing, you decide you’re going to track sandwich sales for a week. The result looks like this.

You’re not a math whiz, so you pass the data to your cousin Fritz, who has a PhD. Your question to Fritz is simple: How many sandwiches should you expect to sell each hour of a weekday?

In short order, he tells you something unexpected. The data you gave him isn’t sandwich sales at all. Instead, it’s phony data, derived from an algorithm aimed at masking deli fraud. And he can prove it.

See, if you take the sales from two days — say, Monday and Tuesday — and average the values, you can then create a sixth-order polynomial that describes the hourly pattern. Fritz’s PhD allows him to do the math himself, he assures you, but he passes along a formula that he derived from Excel. There it is: the precise formula for determining how many sandwiches (the y value) you will sell each hour (the x value). That’s math, hard at work, way beyond your ken — but precise.

But wait! If you take that same formula and compare it to the sales each day of the week, something alarming happens. The formula predicts the number of sandwiches being sold very well. Suspiciously well. If you look at the R-value of the correlation between the sales each hour and compare it to the formula, you get numbers that are very close to 1, meaning it’s a perfect correlation. And in a human-based system like sandwich sales, that shouldn’t happen!

Below, we used the average to do the R-value calculation, but you get the point. For each day, the predicated sales — here, the average — is extremely close to perfectly correlated to the actual sales. Ergo: This could be a function only of a computer-based effort to forge sales data.

You find this surprising for quite a few reasons. The first is that you tallied the sales yourself, so you know they’re correct. The second is that, even if Fritz were right that the numbers were artificial, why does he assume there’s some deli-fraud algorithm out there that’s responsible? The third is that, even without a PhD, you see a problem with Fritz’s analysis. He’s comparing an average derived from two of the values with all five of the values. Doesn’t it seem obvious that the result would be a strong correlation?

The answer, of course, is yes. Being surprised that sandwich sales over the course of the day is correlated to an average of the number of sandwiches sold over the course of two days is like being surprised that a coin comes up heads about half the time you flip it.

Or, more to the point, like being surprised that an estimate of voter turnout based on four counties in Michigan correlates strongly to voter turnout in nine counties in Michigan — including the four used to generate the “sixth-degree polynomial” (that complicated formula) in the first place.

This, however, is what the analysis of Douglas Frank, PhD, offers. Frank’s analysis of voter data in Michigan has led him to determine with seeming authority that the election results in that state were rigged, tailored to match the precise formula he himself derived from the state’s results. Claims like Frank’s analysis of Michigan have earned him the attention of MyPillow chief executive Mike Lindell, whose efforts to prove that voter fraud occurred in 2020 has led him to elevate all sorts of unfounded allegations about last year’s presidential election. Frank’s analysis has convinced others, too, with the conservative polling firm Rasmussen Reports elevating a write-up of his allegations over the weekend.

Rasmussen highlighted a different part of Frank’s assessment, the idea that about 66,000 Michigan voters cast ballots in last year’s election who weren’t in voter rolls in October. As The Washington Post’s Lenny Bronner quickly pointed out, Michigan has same-day voter registration, so those 66,000 voters are almost certainly just people who actually weren’t registered in October but who voted anyway.

The firm, which consistently showed more favorable approval data for Donald Trump over the course of his presidency than other pollsters, has repeatedly elevated dubious and unfounded fraud claims over the past few months. That’s aligned with a broader shift in its public-facing presence to be more aggressive toward critics from the mainstream media. (Last year, it accused me of “republishing a defamatory falsehood [and] committing fraud” for pointing out that its 2018 general-election polling showed Republicans with a one-point lead over the Democrats in an election where Democrats won more votes in House races national by a nearly 10-point margin.) Responding to Bronner’s tweet, Rasmussen offered the equivalent of a “just asking questions” shrug.

It should know better than to take Frank’s analysis at face value. This is a polling firm, after all, a company whose business is statistical analysis. Yet, there it was, sharing Frank’s claims uncritically.

Frank has been working with an attorney named Matthew DePerno, who has been sharing graphs from Frank’s presentation on Twitter with a bit of colorful commentary.

So what do those graphs show? What our third sandwich chart shows: that a prediction of how many votes would be cast in a Michigan county by age derived from the number of votes cast in a Michigan county by age correlates with the number of votes cast in a Michigan county by age. Frank does a lot of hand-waving on the side, like that discrepancy between the October voter roll and votes cast and by including comparisons of Census Bureau population estimates — which appear to be five-year averages of the population from 2015 to 2019 — are lower than the number of registered voters in some places. (Frank does point out that this could be a function of outdated voter rolls, but he doesn’t dwell on it.)

The heart of his analysis, though, is that R-value correlation between his predicted turnout and the actual turnout. How did he generate his prediction?

“What I actually did is I averaged four counties, the four largest counties, and used that key to predict all nine,” he explains. A few seconds later, he marvels that “the accuracy of my prediction is just ridiculously good. It shouldn’t be that good.”

Well, it should, because you are predicting data based on the data itself. If it weren’t a really close correlation, that’s when things would get funky.

Incidentally, that Frank is using a “sixth-order polynomial” doesn’t mean he’s doing some incredibly complicated calculation. It just means that he’s trying to fit his prediction as closely as possible to the existing data, thereby increasing the correlations.

He does note that the fit between his prediction isn’t quite perfect.

“There are a few little wiggles that don’t perfectly line up, but that’s not unusual because, after all, we’re dealing with human behavior,” Frank says at one point. “But for me to be able to predict that that well, you know there’s an algorithm function.”

There are certain words I’m not allowed to use when writing for The Post, so I will describe that as “baloney.” First of all, his claim is that this isn’t human behavior, so he can’t use that as a rationalization. The deviation from his prediction is a function of his using an average of values, nothing more. And, yes, you do know that there’s an algorithm function: the one he made!

Even if he’d uncovered some weird pattern, that of course doesn’t mean that fraud occurred. This is what’s known as an ontological fallacy: He’s assuming that fraud exists and is using this purported weirdness to support that assumption. If there had been something odd about his data, one could also assume that, say, the data had some error in it. But that’s not what he set out to prove.

All of this assumes, of course, that there are common voting patterns by age in the same way that there are common sandwich-ordering patterns in our initial example. (Which, by the way, was simply applying a small randomization to a pattern in Excel.) But we know that there are common patterns in how people vote depending on how old they are. Six years ago, I wrote about the turnout curve in California, creating a graph that looks not entirely unlike Frank’s “key.” I did not prove that elections in California were riddled with fraud.

A lot of people won’t know better than to understand Frank’s assessments for what they are. Lindell doesn’t, it seems, nor do a lot of other people who, like Frank, are eager to assume that some fraud occurred. Rasmussen should and perhaps does, but they shared the analysis anyway.

There remains no credible evidence at all that anything untoward occurred in the 2020 election. Even if Frank’s analysis were not obvious question-begging, there’s no evidence of any effort to do the sort of rigging that he alleges. It’s the mathematical equivalent of prestidigitation, aimed at masking an empty argument with complexity.

A good show, but easy to explain as a trick.