Last week I wrote about a Brookings survey on college students’ attitudes on free speech issues. Since then the survey methodology has been criticized as “junk science” in a Guardian article by reporter Lois Beckett because it was conducted online, using what is sometimes called a “non-probability” sample. While there could plausibly be other problems with this survey (as is true with any survey), these criticisms in and of themselves don’t render a poll “junk science.”

The critiques made in the Guardian article are either disingenuous, confused or both.

For those who haven’t already fallen down this rabbit hole, let me first explain what a so-called “probability-based” sample survey is, and how it differs from a so-called “non-probability-based” sample survey (also sometimes called an “opt-in” panel survey). Bear with me as we wade into the weeds.

A probability survey is — at least in theory — conducted by consulting the complete list of everyone in the population you’re interested in learning about, and then randomly selecting a portion of those people to poll.

If you want to know what Americans think about the president, for example, you might use a full list of phone numbers of every single American and randomly call 1,000 of them to ask. That assumes, of course, that every American has a phone (they don’t) and is willing to take your survey. Which will never actually be the case.

The basic assumption for a probability sample is that every member of the population you’re interested in has a known and non-zero probability of getting interviewed in your poll — and therefore if you repeated the process enough times, everyone would end up being in your sample. If your list is incomplete, or if a lot of the people you contact decide not to participate (a bigger problem today, with response rates around 9 percent, than in the past), you still might be concerned about how representative, and therefore meaningful, your results are.

The main threat to the validity of any survey is that there could be unobservable factors that affect both the probability that an individual completes the survey and the answers that he or she provides. People who like responding to surveys might answer differently from those who dislike responding to surveys, even after we controlled for race, gender, family income or whatever other demographic information.

“Surveys underrepresent surly people; that’s what I tell my students,” says Andrew Gelman, a professor of statistics and political science at Columbia University who is arguably the statistics field’s biggest public intellectual.

So what about those scandalous “non-probability” surveys like the one we started with? Are they “junk science”?

Lots of major surveys are now conducted online and use “non-probability” samples. If you recently saw a survey from YouGov, Harris Poll, Morning Consult, SurveyMonkey, Google Consumer Surveys or Nielsen, for example, chances are that poll was a non-probability panel poll. Such polls are often cited by The PostFiveThirtyEight, the New York Times and yes, even the Guardian.

Including in fact multiple times by Beckett, the Guardian reporter who just wrote that article in which critics suggested such polls are “junk science.”

There are a lot of different kinds of non-probabilistic surveys. Some are done well, and some are done poorly. But generally speaking, these kinds of polls are administered by developing a database of people who have indicated their willingness to be surveyed (often multiple times, which can be useful for tracking changes in opinions). The polling organization will reach out to people in the database to ask them to participate in a given poll when they happen to be a member of a population of interest.

That’s how the Brookings survey worked.

John Villasenor, a Brookings scholar and UCLA professor of engineering, public affairs, management and law, wrote a survey questionnaire on free speech issues. He contracted with Rand Survey Research Group, the 45-year-old polling arm of a well-known nonpartisan, nonprofit research organization, to administer the survey and advise him on the pros and cons of various statistical sampling methods, as he had not conducted a survey before. This is a common arrangement; often when you see a poll released on behalf of researchers, companies or nonprofits, they have contracted out the actual data collection work to an established pollster.

None of the existing Rand Internet panels had the sample of students Villasenor needed, so Rand helped him find another vendor, Opinion Access Corp., which used existing opt-in panels to identify the required number of college students to invite to complete the survey. (This is according to both Villasenor and Sandra Berry, senior survey adviser at Rand SRG; Opinion Access Corp. referred all questions about the survey methodology to Rand.)

Eligible respondents in this database — that is, college students (subsequently narrowed to college students at four-year schools only) — were then sent an email asking them to fill out the survey.

Upon receiving the raw data, Villasenor looked at the characteristics of his respondent pool to see whether they were broadly in line with the overall population of college students (geographically, racially, ethnically, etc.). To the extent they weren’t — the gender ratio was off — he re-weighted the data. Which is normal.

Of course, there’s still the chance that the respondent pool was different from the general population of college students in some unobserved ways, as I mentioned earlier. Again, this is a problem no matter how you construct the sample.

Villasenor told me he also tried to match his questions up with other surveys from earlier years on similar topics, although he said it was difficult to determine how much the election of President Trump and other recent events, such as the neo-Nazi march in Charlottesville, may have influenced students’ views about free speech. Certainly it’s valid to wonder how the proximity to Charlottesville could have affected his results. Like any poll, this one tells you about people’s views at a specific moment in time. That doesn’t mean we should ignore these findings — or even that whatever effects Charlottesville may have had are temporary.

Villasenor’s process for surveying college students is not unusual. Consider a 2016 survey of college students released by the Panetta Institute, which was administered by Hart Research Associates. Some critics have cited this poll favorably while condemning Villasenor’s survey. But that poll describes its methodology in similar terms:

Hart Research contracted with an online survey vendor to administer the survey to a sample of people currently enrolled in some type of post-secondary institution drawn from the vendor’s multi-million-member respondent panel. Screening questions limited participation to students enrolled in a four-year higher learning institution. A total of 801 interviews were completed online.

In an ideal world, to make sure surveys of college students are representative, we would run a probability survey by consulting a phone book that listed every college student in the country. From that phone book we would select a sample of a thousand or so individuals at random, with each individual in the population equally likely to be selected into the survey. Further, every individual we selected into our sample would answer all of our survey questions.

Voilà, you’d have your probability survey!

In practice, no magic phone book of all college students currently exists, and even if it did, we could not compel everyone in it to complete our survey. Given this, doing a true probability survey of this population is not feasible.

Others have tried to build their own “magic phone book.” Another survey whose methodology the Guardian described favorably, and in contrast to the supposed “junk science” of Villasenor’s poll, is from the Knight Foundation and the Newseum Institute, and conducted by Gallup. Beckett described it as “a carefully randomized process from a nationally representative group of colleges.” The methodology section of the poll itself says Gallup surveyed a “random sample” of college students.

Well … sort of. Here’s what Gallup actually did: It selected a random sample of 240 U.S. four-year colleges, drawn from the Integrated Postsecondary Education Data System (IPEDS). Then Gallup contacted each of those 240 schools in an attempt to obtain a sample of their students; just 32 colleges agreed to participate, with Christian schools overrepresented, by my count, for some reason.

Then Gallup emailed a portion of the students at these schools to invite them to participate in an initial online screening, to be followed by a phone call; the combined response rate for the web recruit and telephone surveys was 6 percent. Gallup weighted its responses to match the demographics of U.S. colleges on enrollment, public or private affiliation, and region of the country.

As a reminder, for such a survey to be considered a probability survey, as the Guardian story suggests it is, every college student must have a known, non-zero probability of inclusion. But the probability of sampling any student who went to one of the schools that refused to participate is … zero.

In fact, the probability of surveying many of the students in the United States is also … zero.

If Harvard has a policy that prevents it from turning over student data to polling organizations, then no amount of repetitions of the survey process (calling colleges and asking for lists of students, etc.) will ever lead to any Harvard students ever being included.

This matters because, once again, the assumption underlying probability sampling is that if you repeated the process enough times, everyone would wind up in your sample. If some of the people in your population of interest have no chance of ever being in the sample — or always refuse to be panelists — it isn’t actually a probability sample. 

I point this out not to pick on Gallup. It does excellent work, and I cite its stuff all the time. In fact I cited this very survey of college students last year and would not hesitate to do so again.

The point is that even this supposed gold-standard of polls doesn’t actually meet the impossible “probability sample” gold standard. You can call your poll a “probability-based sample,” but that’s really a theoretical concept. Whether a given poll satisfies the technical assumptions of this concept is open for debate.  With any poll the usual caveats apply — and you shouldn’t get suckered into thinking one poll is the “truth” and other polls are “junk.”

Perhaps especially if you’re inconsistent about whether any given methodology is junk or truth, depending on whether you like its results.

Critics quoted in the Guardian article also lambasted Villasenor’s use of margins of error, saying those are only appropriate to calculate when using probability surveys.

Here’s what I’ll say about that. Like all margin of error calculations, these margins of error are valid only under very specific assumptions, including the assumption that there are no unobservable factors that affect both the probability of response and the answers people give in the survey. Which is almost certainly never the case in the real world, when you have 9 percent response rates to even random-digit-dial surveys of the general population. The criticism that you shouldn’t provide margins of error without conducting a true “probability survey” is so general that it could apply to any survey.

Many in the polling community are still debating many of these issues, to be sure. (Gelman has lots of entertaining things to say about the industry’s infighting on this subject, and how much of it he believes might be driven by the desire to protect incumbent business models.)

Meanwhile, for most major news organizations, the ship on this sailed around 2014. The key to success for Nate Silver’s FiveThirtyEight was arguably not to decide that some polls were “methodologically flawed” and disregard their results, but to find the best way to aggregate results from multiple polls using multiple methodologies. This builds on another powerful idea in statistics: that all models are flawed, but an ensemble of flawed models nearly always predicts better than the “best” model within the subset.

This means that we shouldn’t necessarily treat any survey of public opinion as the final word on any matter. All social science depends on replication through a variety of methods. Most methodological critiques apply to nearly all methods. If we are going to be choosy about methodology, we must be consistent about our methodological choices before we look at the results. Otherwise we aren’t doing science at all.