“If you can’t prove what you want to prove, demonstrate something else and pretend that they are the same thing. In the daze that follows the collision of statistics with the human mind, hardly anybody will notice the difference.”

-D. Huff (1954)

To a great extent, this is what the academy does with student evaluations of teaching effectiveness. We don’t measure teaching effectiveness. We measure what students say, and pretend it’s the same thing. We dress up the responses by taking averages to one or two decimal places, and call it a day.

But what is effective teaching? Presumably, it has something to do with learning. An effective teacher is skillful at creating conditions that are conducive to learning. What is to be learned varies by discipline and by course: It might be a combination of facts, skills, understanding, ways of thinking, habits of mind, a maturing of perspective, or something else. Regardless, some learning will happen no matter what the instructor does. Some students will not learn much no matter what the instructor does. How can we tell how much the instructor helped or hindered learning in a particular class?

**W****hat can we measure?**

Measuring learning is not simple: Course grades and exam scores are poor proxies, because courses and exams can be easy or hard.^{[1]} If exams were set by someone other than the instructor—as they are in some universities—we might be able to use exam scores to measure learning.^{[2]} But that’s not how our university works, and there would still be a risk of “teaching to the test.”

Performance in follow-on courses and career success may be better measures of learning, but time must pass to make such measurements, and it is difficult to track students over time. Moreover, relying on long-term performance measures can complicate causal inference. How much of someone’s career success can be attributed to a single course?

There is a large literature on student teaching evaluations. Most of the research addresses *reliability*: Do different students give the same instructor similar marks?^{[3]} Would the same student give the same instructor a similar mark at a different time, e.g., a year after the course ends?^{[4]}

These questions have little to do with whether the evaluations measure effectiveness. A hundred bathroom scales might all report your weight to be the same. That doesn’t mean the readings are accurate measures of your *height* (or even your weight, for that matter).

Moreover, inter-rater reliability strikes us as an odd thing to worry about, in part because it’s easy to report the full distribution of student ratings—as we advocated in part I of this blog. Scatter matters, and it can be measured *in situ* in every course.

**Observational Studies v. Randomized Experiments**

Most of the research on student teaching evaluations is based on *observational studies*. Students take whatever courses they choose from whomever they choose. The researchers watch and report. In the entire history of Science, there are few observational studies that justify inferences about causes.^{[5]}

In general, to infer causal relationships (e.g., to determine whether effective teaching generally leads to positive student teaching evaluations) requires a *controlled, randomized experiment* rather than an observational study. In a controlled, randomized experiment, individuals are assigned to groups at random; the groups get different *treatments*; the outcomes are compared across groups to test whether the treatments have different effects and to estimate the sizes of those differences.

“Random” is not the same as “haphazard.” In a randomized experiment, the experimenter deliberately uses a blind, non-discretionary chance mechanism to assign individuals to treatment groups. Randomization tends to mix individuals across groups in a balanced way. Differences in outcomes among the groups then can be attributed to a combination of chance and differences in the treatments. The contribution of chance to those differences can be taken into account rigorously, allowing scientific inferences about the effects of the treatments. Absent randomization, differences among the groups other than the treatment can be *confounded* with the effect of the treatment, and there is generally no way to tell how much the confounding contributes to the observed differences.^{[6]}

For instance, suppose that some students choose which section of a course to take by finding the professor reputed to be the most lenient grader. Those students might then rate that professor highly for meeting their expectations of an “easy A.” If those students perform similar research to decide which section of a sequel course to take, they are likely to get good (but easy) grades in that course as well. This would tend to “prove” that the high ratings the first professor received were justified, because students who take the class from him or her tend to do well in the sequel.

The best way to reduce confounding is to assign students at random to sections of the first and second courses. This will tend to mix students with different abilities and from easy and hard sections of the prequel across sections of the sequel. Such randomization isn’t possible at Berkeley: We let students choose their own sections (within the constraints of enrollment limits).

However, this experiment has been done elsewhere: the U.S. Air Force Academy^{[7]} and Bocconi University in Milan, Italy.^{[8]}

These studies confirm the common belief that good teachers can get bad evaluations: Teaching effectiveness, as measured by subsequent performance and career success, is negatively associated with student teaching evaluations. While one should be cautious in generalizing the conclusions because the two student populations might not be representative of students at large (or at least of Berkeley students), these are by far the best studies we know of. They are the only controlled, randomized experiments; they are from different continents and cultures; and their findings are concordant.

**What do student teaching evaluations measure?**

There is evidence that student teaching evaluations are reliable, in the sense that students generally agree.^{[9]} Homogeneity of ratings is an odd thing to focus on. We think it would be a truly rare instructor who was equally effective (or equally ineffective) at facilitating learning across a spectrum of students with different background, preparation, skill, disposition, maturity, and ‘learning style.’ That in itself suggests that if ratings are indeed extremely consistent, as various studies assert, then perhaps ratings measure something other than teaching effectiveness. If a laboratory instrument always gives the same reading when its inputs vary substantially, it’s probably broken.

If evaluations don’t measure teaching effectiveness, what do they measure? While we do not vouch for the methodology in any of the studies cited below, their conclusions indicate that there is conflicting evidence and little consensus:

● student teaching evaluation scores are highly correlated with students’ grade expectations^{[10]}

● effectiveness scores and enjoyment scores are related^{[11]}

● students’ ratings of instructors can be predicted from the students’ reaction to 30 seconds of silent video of the instructor: first impressions may dictate end-of-course evaluation scores, and physical attractiveness matters^{[12]}

● the genders and ethnicities of the instructor and student matter, as does the age of the instructor^{[13]}

Worthington (2002, p.13) also makes the troubling claim, “the questions in student evaluations of teaching concerning curriculum design, subject aims and objectives, and overall teaching performance appear most influenced by variables that are unrelated to effective teaching.” We as a campus hang our hats on just such a question about overall teaching performance.

**What are student evaluations of teaching good for?**

Students are arguably in the best position to judge certain aspects of teaching that *contribute* to effectiveness, such as clarity, pace, legibility, audibility. We can use surveys to get a picture of these things; of course, the statistical issues raised in part I of this blog still matter (esp. response rates, inappropriate use of averages, false numerical precision, and scatter).

Trouble ensues when we ask students to rate teaching effectiveness *per se*. On the whole, students then answer a rather different set of questions from those they are asked, regardless of their intentions. Calling the result a measure of teaching effectiveness does not make it so, any more than you can make a bathroom scale measure height by relabeling its dial “height.” Calculating precise averages of “height” measurements made with 100 different scales would not help. And comparing two individuals’ average “height” measurements would not reveal who was in fact taller.

**Summary**

● Teaching effectiveness ratings might be consistent across students; this can be assessed in every class in every semester. But consistency is a red herring. The real question is whether ratings measure instructors’ ability to facilitate learning, not whether all students rate an instructor similarly. Does better teaching earn better ratings?

● Controlled, randomized experiments are the gold standard for reliable inference about cause and effect. The only controlled randomized experiments on student teaching evaluations have found that student evaluations of teaching effectiveness are negatively associated with direct measures of effectiveness: Evaluations do not seem to measure teaching effectiveness. There are only two such experiments, so caution is in order, but they do suggest that better teaching causes students to give worse ratings, at least in some circumstances.

● Student teaching evaluations may be influenced by factors that have nothing to do with effectiveness, such as the gender, ethnicity, and attractiveness of the instructor. Students seem to make snap judgments about instructors that have nothing to do with teaching effectiveness, and to rely on those judgments when asked about teaching effectiveness.

● The survey questions apparently most influenced by extraneous factors are exactly of the form we ask on campus: overall teaching effectiveness.

● Treating student ratings of overall teaching effectiveness as if they measured teaching effectiveness is misleading: Relabeling a package does not change its contents.

We think student teaching evaluations—especially student comments—contain information useful for assessing and improving teaching. But they need to be used cautiously and appropriately as part of a comprehensive review.

It’s time for Berkeley to revisit the wisdom of asking students to rate the overall teaching effectiveness of instructors, of considering those ratings to be a measure of actual teaching effectiveness, of reporting the ratings numerically and computing and comparing averages, and of relying on those averages for high-stakes decisions such as merit cases and promotions.

In the third installment of this blog, we discuss a pilot conducted in the Department of Statistics in 2012–2013 to augment student teaching evaluations with other sources of information. The additional sources still do not measure effectiveness directly, but they complement student teaching evaluations and provide formative feedback and touchstones. We believe that the combination paints a more complete picture of teaching and will promote better teaching in the long run.

^{[1]} According to Beleche, Fairris & Marks (2012), “It is not clear that higher course grades necessarily reflect more learning. The positive association between grades and course evaluations may also reflect initial student ability and preferences, instructor grading leniency, or even a favorable meeting time, all of which may translate into higher grades and greater student satisfaction with the course, but not necessarily to greater learning” (p. 1).

^{[2]} See, e.g., http://xkcd.com/135/

^{[3]} See, e.g., Abrami, et al., 2001; Braskamp and Ory, 1994; Centra, 2003; Ory, 2001; Wachtel, 1998; Marsh and Roche, 1997.

^{[4]} See, e.g., Braskamp and Ory, 1994; Centra, 1993; Marsh, 2007; Marsh and Dunkin, 1992; Overall and Marsh, 1980.

^{[5]} A notable exception is John Snow’s research on the cause of cholera; his study amounts to a “natural experiment.” See http://www.stat.berkeley.edu/~stark/SticiGui/Text/experiments.htm#cholera for a discussion.

^{[6]} See, e.g., http://xkcd.com/552/

^{[7]} Carrell and West, 2008.

^{[8]} Braga, Paccagnella, and Pellizzari, 2011.

^{[9]} Braskamp and Ory, 1994; Centra, 1993; Marsh, 2007; Marsh and Dunkin, 1992; Overall and Marsh, 1980.

^{[10]} Marsh and Cooper, 1980; Short et al., 2012; Worthington, 2002.

^{[11]} In a pilot of online course evaluations in the Department of Statistics in fall 2012, among the 1486 students who rated the instructor’s overall effectiveness and their enjoyment of the course on a 7-point scale, the correlation between instructor effectiveness and course enjoyment was 0.75, and the correlation between course effectiveness and course enjoyment was 0.8.

^{[12]} Ambady and Rosenthal, 1993.

^{[13]} Anderson and Miller, 1997; Basow, 1995; Cramer and Alexitch, 2000; Marsh and Dunkin, 1992; Wachtel, 1998; Weinberg et al., 2007; Worthington, 2002.

*Co-authored with Richard Freishtat, Ph.D. and cross-posted from UC Berkeley’s Center for Teaching and Learning blog*