My wife and I each have a friend named Jordan. My friend Jordan is a man. Hers is a woman. The likelihood that a "Jordan" in the United States would be a man is about 74 percent, according to the API from Genderize.io. But we're probably getting a bit ahead of ourselves.
When it released its second quarter fundraising numbers on Wednesday night, the Hillary Clinton campaign broke out a few bullet points. Among them was a bit of data that neatly reinforced one constituency's enthusiasm for Clinton's candidacy: "61 percent of our donors were women."
That's a pretty specific figure! Not "more than 60 percent," not "about 60 percent." Sixty-one percent. Flat out.
Let's say you give money through the website, maybe by buying something in the store. When you go to complete your purchase, you fill out all of the requisite information: Name, address, yada yada. And then whatever you ordered comes in the mail and you've contributed $20 or whatever to Hillary Clinton. At no point did you tell them your gender.
So: What if your name is Jordan? Do you fall into the 61 percent of women or the 39 percent of men? How did Clinton figure out that split?
I asked the campaign. "It was determined through an internal analysis," staffer Josh Schwerin told me. So, that's helpful.
It does apparently rule out one possibility, however. We've written in the past about Acxiom, a company that compiles massive amounts of consumer data about Americans to help companies better target customers. Facebook partners with Acxiom to allow political campaigns to upload a list of voters and target specific groups: young mothers, say. (Facebook doesn't then allow that data to be exported back to the campaign.)
But Clinton did the analysis internally. Nathan Matias of the MIT Center for Civic Media wrote an extensive delineation of how, with a list of names in hand, one might go about figuring out a person's gender.
"The simplest approach" to automating gender identification, he writes, "is to use historical birth records to estimate the likely sex of a first name." The Social Security Administration releases annual data on the names given to babies each year. If you know a name and a birth year, it becomes much easier to narrow down identity.
Matias points to a project that uses information from the Global Name Data project, which itself uses data from the SSA. A look at that data indicates that 2.5 percent of the names it includes are within the 25 to 75 percent likelihood range for gender certainty. As a percentage of all the people included in the calculations (versus the percentage of names), 2.1 percent of people fall into that category -- and 90.7 percent of people fall into the 75 percent or more certainty range.
In this data, by the way, the probability that a "Jordan" is male drops to 73 percent and change. But, here, try it for yourself.
So those figures, with whatever uncertainty exists, come from looking only at name. The more information you have, the easier it can be to identify a person's gender.
After all, we're talking about a political campaign. And the value to Hillary Clinton's team in knowing the gender of donors is not so that it can fill out a neat bullet point on a press release. The value lies in knowing the gender of voters across the board, so that they can be targeted with appropriate messages and advertising. So, without question, the information handed over by donors went into the campaign's existing voter database, where it sits with all of the other information the campaign has about a person, starting with the kernel of when, where and how often they vote.
As you're probably aware, voter registration information and voting histories are public information. What information is collected varies by state. In New York, the voter registration form mandates a gender choice. In California, it doesn't. Since this information is updated after each election, and because people move around, it's hard for a candidate's campaign to rebuild a list from scratch. So there are external vendors that maintain voter files, as well as systems run by political parties. There's a lot of competition in this; earlier this year, we noted that the Republicans were losing data customers to an external group.
In Clinton's FEC filing, the campaign lists a number of payments to "NGP Van Inc.," the go-to voter file vendor for Democrats. VAN (as it's known) allows campaigns to navigate data that's been compiled on voters by the party a company called TargetSmart for years, rolling in information provided by campaigns and updates from state registration data. Most of the people who gave money to Clinton are likely already identified by gender in the database to which Clinton's campaign is subscribing. (Update: TargetSmart tells us that 1.3 percent of the voters in its database aren't identified by gender.) It's not cheap; the campaign has already paid VAN nearly $80,000. But for what it provides, it's invaluable.
There's almost certainly still some margin of error in the 61 percent women figure that Clinton's campaign identifies: People missing from the voter file or people named Jordan or any number of other problems. We can't really know how far off it is, because we don't know how they got the figure in the first place. It's probably close to right. But who knows? Data are harder than it seems.
But it could be easier, as Nathan Matias points out at the outset of his article. "The simplest way to collect gender data," he writes, "is to ask people."
Correction: This post originally indicated that VAN did the data collection.