Did the outcome of voting for president in Wisconsin accurately reflect the intentions of the electors? Concerns have been raised about errors in vote counts produced using electronic technology — were machines hacked? — and a recount may occur.

Some reports involving statistical analysis of the results has been discussed in the media recently. These analyses, though, rely on data at the county level. Technology, demographics and other important characteristics of the electorate vary within counties, making it difficult to resolve conclusively whether voting technology (did voters cast paper or electronic ballots?) affected the final tabulation of the vote for president.

For this reason, I have examined ward-level data. Wards are the smallest aggregation unit at which vote counts are reported in Wisconsin, and many wards have fewer than 100 voters. My analysis, which relies on using election forensics techniques designed to identify electoral fraud, reveals some reasons to be suspicious about vote patterns in Wisconsin. To be very clear, my analysis cannot prove whether fraud occurred, but it does suggest that it would be valuable to conduct an election audit to resolve such concerns definitively.

Problems with county-level data

It is problematic to rely on county-level data to assess whether there are systematic differences between using paper ballots and voting electronically.

The figure below shows the main problem: Different voters in a county often used different voting technologies. Many counties, in fact, used multiple technologies. Almost all used electronic vote-tabulation technology, and some used both direct-record electronic (DRE) and optical scanner (Opscan) technologies. (See this list of equipment used by each municipality.) Opscan technologies mark votes on paper but tabulate the votes electronically, whereas DRE technologies use electronic voting in a way that is like using an ATM.

In the figure, each horizontal line corresponds to a type of optical scanner technology, and each vertical line corresponds to a county. “None” for the Opscan type (the top row) reflects an unknown mix of DRE technologies and hand-tabulated paper ballots.  Down the subsequent rows the other types are: (2) Dominion (Premier)-Accuvote-OS, (3) Dominion (Premier)/Command Central-Accuvote-OS, (4) Dominion (Sequoia)- Sequoia Insight, (5) Dominion (Sequoia)/Command Central- Sequoia Insight, (6) Dominion ImageCast Evolution, (7) ES&S DS200, (8) ES&S M100, (9) Optech- Eagle, (10) Optech/Command Central- Eagle, (11) Optech/Command Central- Eagle, Dominion (Sequoia)/Command Central- Sequoia Insight.

A green dot appears when all of the voters in a county used the same kind of technology. Purple dots appear when the technologies used in a county are diverse: The most frequently used technologies are more blue, and the least frequently used are more red. In only 26 of the 72 counties were all votes recorded using the same kind of voting technology.

Ward-level election forensics

If we could obtain useful measures of ward-level attributes, such as the demographic characteristics of each ward or the voting histories of the voters in each ward, we could attempt regression-style analysis using ward observations.

Unfortunately, we lack such data.

But we can use the Election Forensics Toolkit (a website developed as part of a USAID-funded project) to look at features of the ward data to see how likely they are to occur by chance.

If these features occur more often than they should by chance alone, then it is possible that the election results were produced in some other way than by simply recording actual votes.

The table below shows the results of a number of these types of tests.  In the table, a “Small” ward has less than 100 votes. The main takeaway point from this table is that all of the statistics that lead us to have concerns about “Small” wards come from wards that use some kind of Opscan technology.

Let’s start with the statistic labeled “LastC,” which is the mean of the last digits of the vote counts. At least for large vote counts, this article argues that the each of the 10 possible last digits of vote counts should occur equally often, in which case the mean should be about 4.5. Other patterns may suggest the counts were manipulated.

In the small Opscan wards the last digits of vote counts for Trump and for Clinton have means (LastC) that are much less than 4.5. Each “confidence interval” for a given statistic gives a range of estimates we could have observed given variations in the data that might have occurred by chance. The two LastC intervals do not include 4.5, which is why the estimates are shown in red.

As this article points out, last-digit diagnostics have not been claimed to work when vote counts are small. One view is that we have no reason to expect any particular result for those statistics, so there is nothing to worry about.

Even so, it is worth noting that this issue arises only in small wards that use Opscan technologies. Small wards using other voting technologies do not exhibit these anomalies.

Another statistic (C05s) is the mean of a variable indicating whether the last digit of the vote count is zero or five. Based on the same rationale about digit frequencies as for LastC, C05s should be 0.2 if there are no problems. C05s being too large may mean that someone was sloppy and simply wrote down approximate numbers. C05s too small might mean that someone is faking the numbers (It has been found that 2 and 7 are favorite numbers for people trying to produce random numbers out of their heads.)

In the small Opscan wards C05s for Clinton is too small, showing that vote counts for Clinton too rarely have a last digit of zero or five. Notably this statistic is significantly too large if ward vote counts of zero are included.

The P05s statistic, which is the mean of a variable indicating whether the last digit of the rounded percentage of a candidate’s votes is zero or five, has a specific motivation from the idea that people who commit frauds want to allow their efforts to be detected to claim credit. Such “signaling” frequently occurs in Russian elections. Like C05s, P05s should be 0.2 if no signaling is occurring, but larger values of P05s are concerning.

Votes in the small Opscan wards exhibit a “signaling” pattern (P05s).

Having vote percentages concentrated around more than one distinct value, which would mean the distribution of percentages is multimodal, is also a potential problem.  For instance, there might be a set of wards where a candidate received 30 percent of the votes and another cluster where the candidate received 60 percent.

In an elaborate model for election frauds, multimodality is an important indicator that one candidate is gaining fraudulent votes. We would have to know how many voters registered in each ward to be able to estimate that model.

DipT is the p-value from a test that there is no multimodality, a test we can do without having the data needed for the fancier model.

Vote percentages in the small Opscan wards are significantly multimodal.

In contrast to the array of anomalies in the small wards with Opscan technology, none of the statistics in small wards without Opscan technology have values to worry about.

None of the statistics in “big” wards have values to worry about, although additional analysis shows that the big wards set is diverse: Some Opscan machines, particularly the Dominion (Sequoia)/Command Central- Sequoia Insight (in 209 big wards) and the Dominion ImageCast Evolution (in 272 big wards), exhibit anomalies.

Why do small wards with Opscan technology (and several other kinds of wards) have anomalies, and do the anomalies mean the reported vote counts do not accurately reflect the intentions of the electors? Given all the information we have, it is hard to say.

A rigorous post-election audit, like some are trying to have happen in several states, is not subject to the limitations that prevent a full regression-style analysis nor to the interpretive uncertainty involved in using statistics like those from the Toolkit.

A crucial feature of an audit is that paper ballots are inspected directly by humans and not merely tabulated again by a machine, which can happen in a recount under some state recount procedures. An audit can tell us at least whether the votes marked on paper have been correctly tabulated by the machines.

A rigorous audit or a full recount that has humans manually checking the paper ballots can provide convincing evidence about who won the election. In the current environment, the reassurance such an audit may provide would contribute to the incoming government’s legitimacy.

Walter R. Mebane Jr. is a research associate at the Center for Political Studies, professor of political science and professor of statistics at the University of Michigan.