The Washington Post has published another NSA story, this time by Bart Gellman, Julie Tate, and Ashkan Soltani. Here’s the lede:
Ordinary Internet users, American and non-American alike, far outnumber legally targeted foreigners in the communications intercepted by the National Security Agency from U.S. digital networks, according to a four-month investigation by The Washington Post.
Nine of 10 account holders found in a large cache of intercepted conversations, which former NSA contractor Edward Snowden provided in full to The Post, were not the intended surveillance targets but were caught in a net the agency had cast for somebody else.
The story is built around the implied claim that 90% of NSA intercept data is about innocent people. I think the statistic is a phony. Especially in an article that later holds up US law enforcement practice as a superior model.
What’s wrong with the statistic? Well, let’s take an example from law enforcement. Suppose I become the target of a government investigation. The government gets a warrant and seizes a year’s worth of my email. Looking at my email patterns, that’s about 35,000 messages. About twenty percent – say 7500 –are one-off messages that I can handle with a short reply (or by ignoring the message). Either way, I’ll never hear from that person again. And maybe a quarter are from about 500 people I hear from at least once a week. The remainder are a mix — people I trade emails with for a while and then stop, or infrequent correspondents that can show up any time. Conservatively, let’s say that about 25 people are responsible for the portion of my annual correspondence that falls into that category. In sum, the total number of correspondents in my stored email is 7500+500+25 = 8000 or so. So the criminal investigators who seized and stored my messages from me, their investigative target, and over 8000 people who aren’t targets.
Or, as the Washington Post might put it “7999 out of 8000 account holders found in a large cache of communications seized by law enforcement were not the intended surveillance target but were caught in a net the investigators had cast for somebody else.”
Maybe the Post is performing some far more sophisticated calculation, and they didn’t bother to explain it, despite its prominence in the story. If not, though, the inherent bias in the measure is such that it demands an acknowledgement . (After all, it allows you to say “half of all account holders in the database weren’t the target” if the agency stores just a single message sent to the target.) This is something that any halfway sentient editor should have recognized.
Which raises this question: I’ve heard of newspapers chasing stories that are “too good to check.” Does the Post think that Gellman’s are too good to edit?
UPDATE: The original email volumes were an order of magnitude too low, so I modified the numbers. H/T @davidfolkenflik for pointing out the error.