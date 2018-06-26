

I have been a professor for more than two decades now. The fruits of technology have made almost all aspects of my day job easier, but there are a few areas where my Luddite instincts have won out. One of those areas is technology in the classroom. Another is the effort to quantify what I do for a living.

The meat and potatoes of evaluating college faculty is research and teaching. In evaluations for hiring, tenure, promotion and so forth, departments want to know a professor’s research impact and teaching abilities. Sure, one could go through the trouble of reading the work and observing the classes, but that takes time, effort and a willingness to travel beyond one’s area of expertise. Specialization makes it difficult for, say, an expert on Congress to assess a political theorist’s published paper on Rousseau.

Instead, academic units have come to rely on more quantifiable metrics to assess research and teaching. Google Scholar, for example, provides easily accessible data on citation counts (how often someone’s article is cited by others). Similarly, universities have tried to quantify teaching through the use of student evaluations. These evaluations usually ask students to rate their professor on a 1-5 scale on a variety of teaching criteria (course difficulty, teacher feedback on grading, etc.). These evaluations are more sophisticated than RateMyProfessor, but are in the same general ballpark.

So far, so understandable. Surely it makes sense to rely on citation metrics and student evals over impressionistic takes on scholarship, right?

Two articles this year in PS: Political Science and Politics offering a sobering reminder that just because something is quantified does not mean it is objective. The first one, by Kristina Mitchell and Jonathan Martin, analyzes gender bias in student evaluations of teachers [SETs] by comparing how a male and female professor are rated using the exact same online course. Their results are pretty plain:

First, women are evaluated based on different criteria than men, including personality, appearance, and perceptions of intelligence and competency. To test this, we used a novel method: a content analysis of student comments in official open-ended course evaluations and in online anonymous commentary. The evidence from the content analysis suggests that women are evaluated more on personality and appearance, and they are more likely to be labeled a “teacher” than a “professor.” Second, and perhaps more important, we argue that women are rated more poorly than men even in identical courses and when all personality, appearance, and other factors are held constant. We compared the SETs of two instructors, one man and one woman, in identical online courses using the same assignments and course format. In this analysis, we found strong evidence to suggest gender bias in SETs.

Their conclusion is not a new finding; it confirms what the prior literature has said. But it is yet another data point showing that relying solely on student evaluations to evaluate a professor’s teaching abilities is not just wrong, it’s potentially discriminatory.

The second article by six political scientists is “The Benefits and Pitfalls of Google Scholar.” They take care to note the strengths of the site — it nudges scholars toward more open access of their work, for example. But the flaws of Google Scholar (GS) are very real. To give just one example from the paper:

GS counts are biased toward incremental work and away from boldness and innovation. Highly original work that does not fit neatly into an existing literature might establish a new research agenda and expand interest in the topic, but its impact will not be visible in citation counts for many years. According to GS, John Nash’s foundational paper defining Nash equilibrium received only 16 citations in the first five years after publication…. In general, the number of citations that an article or book receives in the five or so years after publication reveals little about its long-term impact.

Much as “teaching to the test” has its pedagogical problems, “researching for the citation count” is a recipe for risk-averse scholarship. This is particularly problematic for junior scholars, whose work is usually evaluated for tenure just a few years after publication.

So, to sum up: the primary means by which universities measure research impact and teaching excellence rely heavily on flawed metrics. Oh, and they are particularly flawed if you are a woman.

Does this mean these measurements should be discarded completely? No, of course not. All metrics are flawed. A scholar’s idiosyncratic take on a colleague’s teaching style is not guaranteed to be free of bias either.

If universities are going to continue to use these metrics, however, then they need to do so with their eyes open. The biases in the metrics cannot just be said at the beginning of a meeting and then ignored. They have to be ever-present in the mind when considering tenure and promotion. Otherwise, all quantifiable metrics do is provide senior scholars a false sense of security that they are objectively evaluating a subordinate.