Amazon’s Mechanical Turk service is a commonly used tool for social science experiments on the Internet.

The following is a guest post by political scientists Kathleen Searles of Louisiana State University and John Barry Ryan of Stony Brook University.

*****

For decades, social scientists have wondered what studies on college students can really tell us about human behavior in the “real world.” The concern is similar to the one in medicine about how much of what we learn from studies on lab rats applies to adult humans.

In political science, one important concern is that college students have opinions and world views that are not yet fully formed. For example, a researcher’s experiment showing that a particular campaign message persuades college students might not convince most voters if used in a real campaign. The bottom line is that the sample of subjects in a study are so different from the population at large that we might be worried that the inferences we draw may not be valid for a typical person.

In the modern world, many researchers recruit their experimental subjects through the Internet. One popular option is Amazon.com’s Mechanical Turk (MTurk). MTurk is a service where workers seeking payment for performing short tasks (known as Human Intelligence Tasks or HITs) are matched with prospective employers. (Amazon owner Jeffrey P. Bezos owns The Washington Post.)

Scholars have turned to MTurk as a convenient and cheap source of subjects for experiments, surveys and other tasks in a range of disciplines including political science. Surveys with research firms typically cost thousands of dollars, while a study on MTurk can cost as little as $100.

Despite increasing scholarly attention, academic use of MTurk remains controversial. Many people legitimately ask, “Who are these people who sit at their computers and do surveys for 50 cents?” The implication: They must be weird and, therefore, any study using them is flawed. In a Monkey Cage post titled, “Don’t trust the Turk,” Andrew Gelman linked to Dan Kahan’s blog where he has written extensively on whether to trust samples derived from MTurk (also see here, and here).

But a quite different view was presented on an expert roundtable at the recent meeting of the Midwestern Political Science Association. We organized the panel (along with Scott Clifford), featuring Adam Berinsky, Cindy Kam, Yanna Krupnikov, Richard Lau and Thomas Leeper discussing the merits and pitfalls of using MTurk in political science studies. In short, the panelists argued that the right question is not whether to use MTurk but when to use MTurk.

Here are some key points from the discussion:

  1. MTurk users are not necessarily “weird.” The population of users on MTurk is constantly shifting and thus, it is hard to say with any certainty what characteristics a “typical” MTurk sample will possess. For this reason, it is more appropriate to ask whether the particular sample is valid for the research question at hand. Leeper discussed his work with Kevin Mullinix suggesting that results from a series of experiments are the same regardless of whether an MTurk sample, student sample or nationally representative sample is used. This should not be all that surprising because MTurk users are very similar to young people who participate in a commonly used national Internet survey.
  1. Sometimes it doesn’t matter what kind of sample you use. The standards for representativeness are not one size fits all. Researchers need to think through how the elements of their study may interact with the characteristics of participants. For example, in contrast to Leeper’s work, Krupnikov and Adam Seth Levine’s research finds that studies using MTurk can lead to different conclusions because MTurk users try to figure out what the study is really about. Other research has shown that studies involving altruism can be affected because MTurk users participate in many studies and share information with one other about those studies. Despite these potential differences, classic studies in economics and cognitive psychology have been successfully replicated using MTurk. Thus, researchers need to consider that the sample could affect their results, but that is generally good research practice and should be done for all studies — not just MTurk studies.
  1. Don’t oversell your results: Because studies using MTurk do not always lead to the same conclusion as those using national samples, researchers should be careful about the generalizations they make when using MTurk samples. This is no different than the care researchers should take when generalizing from student samples to adults or from American samples to outside the United States. In addition, Berinsky and his colleagues’ work on MTurk shows that the magnitude of an effect found with an MTurk sample may vary from what we would observe in the whole population even if the direction of the effect does not. For example, one could use MTurk to determine whether a particular campaign ad makes people more or less supportive of a candidate, but such a study would not necessarily tell you how much support changes.
  1. Think about your participants: Using MTurk requires careful attention to best practices in both experimental design and implementation, but also in the treatment of workers. Researchers should think through the ways in which their institutional review boards (which regulate research at universities) constrain how much workers need to be paid. Importantly, researchers should not abuse MTurk users. It is generally accepted that researchers should pay workers no less than minimum wage for their time. Researchers must consider communication with workers, and quick payment for a completed task is an important part of engendering good relations between academics and the MTurk community.

In the end, is MTurk an appropriate platform for political science studies? Our conclusion based on the research: It depends. MTurk studies are not especially good or especially bad. Researchers needs to explain why MTurk is appropriate for their particular research question and keep the limitations of the sample in mind. Journal editors must parcel out inappropriate critiques of MTurk from fundamental concerns about experiments, and limitations regarding generalizability to which all academic studies are beholden.

Platitudes such as “Don’t trust the Turk” are nice, but, as is often the case in life, they are too simple to be followed.