Skip to main content

What exactly do student evaluations measure?

Philip Stark, professor of statistics | October 21, 2013

“If you can’t prove what you want to prove, demonstrate something else and pretend that they are the same thing. In the daze that follows the collision of statistics with the human mind, hardly anybody will notice the difference.”

-D. Huff (1954)

To a great extent, this is what the academy does with student evaluations of teaching effectiveness. We don’t measure teaching effectiveness.  We measure what students say, and pretend it’s the same thing. We dress up the responses by taking averages to one or two decimal places, and call it a day.

But what is effective teaching? Presumably, it has something to do with learning.  An effective teacher is skillful at creating conditions that are conducive to learning. What is to be learned varies by discipline and by course: It might be a combination of facts, skills, understanding, ways of thinking, habits of mind, a maturing of perspective, or something else.  Regardless, some learning will happen no matter what the instructor does. Some students will not learn much no matter what the instructor does. How can we tell how much the instructor helped or hindered learning in a particular class?

What can we measure?

Measuring learning is not simple: Course grades and exam scores are poor proxies, because courses and exams can be easy or hard.[1] If exams were set by someone other than the instructor—as they are in some universities—we might be able to use exam scores to measure learning.[2] But that’s not how our university works, and there would still be a risk of “teaching to the test.”

Performance in follow-on courses and career success may be better measures of learning, but time must pass to make such measurements, and it is difficult to track students over time. Moreover, relying on long-term performance measures can complicate causal inference.  How much of someone’s career success can be attributed to a single course?

There is a large literature on student teaching evaluations. Most of the research addresses reliability: Do different students give the same instructor similar marks?[3] Would the same student give the same instructor a similar mark at a different time, e.g., a year after the course ends?[4]

These questions have little to do with whether the evaluations measure effectiveness.  A hundred bathroom scales might all report your weight to be the same. That doesn’t mean the readings are accurate measures of your height (or even your weight, for that matter).

Moreover, inter-rater reliability strikes us as an odd thing to worry about, in part because it’s easy to report the full distribution of student ratings—as we advocated in part I of this blog. Scatter matters, and it can be measured in situ in every course.

Observational Studies v. Randomized Experiments

Most of the research on student teaching evaluations is based on observational studies. Students take whatever courses they choose from whomever they choose.  The researchers watch and report.  In the entire history of Science, there are few observational studies that justify inferences about causes.[5]

In general, to infer causal relationships (e.g., to determine whether effective teaching generally leads to positive student teaching evaluations) requires a controlled, randomized experiment rather than an observational study.  In a controlled, randomized experiment, individuals are assigned to groups at random; the groups get different treatments; the outcomes are compared across groups to test whether the treatments have different effects and to estimate the sizes of those differences.

“Random” is not the same as “haphazard.” In a randomized experiment, the experimenter deliberately uses a blind, non-discretionary chance mechanism to assign individuals to treatment groups.  Randomization tends to mix individuals across groups in a balanced way. Differences in outcomes among the groups then can be attributed to a combination of chance and differences in the treatments.  The contribution of chance to those differences can be taken into account rigorously, allowing scientific inferences about the effects of the treatments. Absent randomization, differences among the groups other than the treatment can be confounded with the effect of the treatment, and there is generally no way to tell how much the confounding contributes to the observed differences.[6]

For instance, suppose that some students choose which section of a course to take by finding the professor reputed to be the most lenient grader. Those students might then rate that professor highly for meeting their expectations of an “easy A.”  If those students perform similar research to decide which section of a sequel course to take, they are likely to get good (but easy) grades in that course as well.  This would tend to “prove” that the high ratings the first professor received were justified, because students who take the class from him or her tend to do well in the sequel.

The best way to reduce confounding is to assign students at random to sections of the first and second courses.  This will tend to mix students with different abilities and from easy and hard sections of the prequel across sections of the sequel. Such randomization isn’t possible at Berkeley: We let students choose their own sections (within the constraints of enrollment limits).

However, this experiment has been done elsewhere: the U.S. Air Force Academy[7] and Bocconi University in Milan, Italy.[8]

These studies confirm the common belief that good teachers can get bad evaluations: Teaching effectiveness, as measured by subsequent performance and career success, is negatively associated with student teaching evaluations. While one should be cautious in generalizing the conclusions because the two student populations might not be representative of students at large (or at least of Berkeley students), these are by far the best studies we know of. They are the only controlled, randomized experiments; they are from different continents and cultures; and their findings are concordant.

What do student teaching evaluations measure?

There is evidence that student teaching evaluations are reliable, in the sense that students generally agree.[9]  Homogeneity of ratings is an odd thing to focus on.  We think it would be a truly rare instructor who was equally effective (or equally ineffective) at facilitating learning across a spectrum of students with different background, preparation, skill, disposition, maturity, and ‘learning style.’ That in itself suggests that if ratings are indeed extremely consistent, as various studies assert, then perhaps ratings measure something other than teaching effectiveness.  If a laboratory instrument always gives the same reading when its inputs vary substantially, it’s probably broken.

If evaluations don’t measure teaching effectiveness, what do they measure? While we do not vouch for the methodology in any of the studies cited below, their conclusions indicate that there is conflicting evidence and little consensus:

●  student teaching evaluation scores are highly correlated with students’ grade expectations[10]

●  effectiveness scores and enjoyment scores are related[11]

●  students’ ratings of instructors can be predicted from the students’ reaction to 30 seconds of silent video of the instructor: first impressions may dictate end-of-course evaluation scores, and physical attractiveness matters[12]

●  the genders and ethnicities of the instructor and student matter, as does the age of the instructor[13]

Worthington (2002, p.13) also makes the troubling claim, “the questions in student evaluations of teaching concerning curriculum design, subject aims and objectives, and overall teaching performance appear most influenced by variables that are unrelated to effective teaching.” We as a campus hang our hats on just such a question about overall teaching performance.

What are student evaluations of teaching good for?

Students are arguably in the best position to judge certain aspects of teaching that contribute to effectiveness, such as clarity, pace, legibility, audibility.  We can use surveys to get a picture of these things; of course, the statistical issues raised in part I of this blog still matter (esp. response rates, inappropriate use of averages, false numerical precision, and scatter).

Trouble ensues when we ask students to rate teaching effectiveness per se. On the whole, students then answer a rather different set of questions from those they are asked, regardless of their intentions.  Calling the result a measure of teaching effectiveness does not make it so, any more than you can make a bathroom scale measure height by relabeling its dial “height.” Calculating precise averages of “height” measurements made with 100 different scales would not help.  And comparing two individuals’ average “height” measurements would not reveal who was in fact taller.


●  Teaching effectiveness ratings might be consistent across students; this can be assessed in every class in every semester. But consistency is a red herring. The real question is whether ratings measure instructors’ ability to facilitate learning, not whether all students rate an instructor similarly. Does better teaching earn better ratings?

●  Controlled, randomized experiments are the gold standard for reliable inference about cause and effect. The only controlled randomized experiments on student teaching evaluations have found that student evaluations of teaching effectiveness are negatively associated with direct measures of effectiveness: Evaluations do not seem to measure teaching effectiveness. There are only two such experiments, so caution is in order, but they do suggest that better teaching causes students to give worse ratings, at least in some circumstances.

●  Student teaching evaluations may be influenced by factors that have nothing to do with effectiveness, such as the gender, ethnicity, and attractiveness of the instructor.  Students seem to make snap judgments about instructors that have nothing to do with teaching effectiveness, and to rely on those judgments when asked about teaching effectiveness.

●  The survey questions apparently most influenced by extraneous factors are exactly of the form we ask on campus: overall teaching effectiveness.

●  Treating student ratings of overall teaching effectiveness as if they measured teaching effectiveness is misleading:  Relabeling a package does not change its contents.

We think student teaching evaluations—especially student comments—contain information useful for assessing and improving teaching.  But they need to be used cautiously and appropriately as part of a comprehensive review.

It’s time for Berkeley to revisit the wisdom of asking students to rate the overall teaching effectiveness of instructors, of considering those ratings to be a measure of actual teaching effectiveness, of reporting the ratings numerically and computing and comparing averages, and of relying on those averages for high-stakes decisions such as merit cases and promotions.

In the third installment of this blog, we discuss a pilot conducted in the Department of Statistics in 2012–2013 to augment student teaching evaluations with other sources of information. The additional sources still do not measure effectiveness directly, but they complement student teaching evaluations and provide formative feedback and touchstones.  We believe that the combination paints a more complete picture of teaching and will promote better teaching in the long run.


[1] According to Beleche, Fairris & Marks (2012), “It is not clear that higher course grades necessarily reflect more learning. The positive association between grades and course evaluations may also reflect initial student ability and preferences, instructor grading leniency, or even a favorable meeting time, all of which may translate into higher grades and greater student satisfaction with the course, but not necessarily to greater learning” (p. 1).

[2] See, e.g.,

[3] See, e.g., Abrami, et al., 2001; Braskamp and Ory, 1994; Centra, 2003; Ory, 2001; Wachtel, 1998; Marsh and Roche, 1997.

[4]  See, e.g., Braskamp and Ory, 1994; Centra, 1993; Marsh, 2007; Marsh and Dunkin, 1992; Overall and Marsh, 1980.

[5] A notable exception is John Snow’s research on the cause of cholera; his study amounts to a “natural experiment.” See for a discussion.

[6] See, e.g.,

[7] Carrell and West, 2008.

[8] Braga, Paccagnella, and Pellizzari, 2011.

[9] Braskamp and Ory, 1994; Centra, 1993; Marsh, 2007; Marsh and Dunkin, 1992; Overall and Marsh, 1980.

[10] Marsh and Cooper, 1980; Short et al., 2012; Worthington, 2002.

[11] In a pilot of online course evaluations in the Department of Statistics in fall 2012, among the 1486 students who rated the instructor’s overall effectiveness and their enjoyment of the course on a 7-point scale, the correlation between instructor effectiveness and course enjoyment was 0.75, and the correlation between course effectiveness and course enjoyment was 0.8.

[12] Ambady and Rosenthal, 1993.

[13] Anderson and Miller, 1997; Basow, 1995; Cramer and Alexitch, 2000; Marsh and Dunkin, 1992; Wachtel, 1998; Weinberg et al., 2007; Worthington, 2002.

Co-authored with Richard Freishtat, Ph.D. and cross-posted from UC Berkeley’s Center for Teaching and Learning blog

Comments to “What exactly do student evaluations measure?

  1. I teach chemistry at a community college and have found that student evaluations are directly linked to grades they get in course. I have found that for the same course student A will say I am tough and I do not explain and I am not patient and etc. etc. and then student b will say I was the greatest and fairest teacher they had. If the material is hard off course the students will have to work harder which majority of students at community college do not want to these days.

  2. My question is how do we even begin to make students interested in participating in course/professor evaluations? This is another important topic that should be considered.

  3. A student’s GPA can be as much influenced by picking the easiest professors (which can be discovered easily these days) as it is from any kind of effort. Aren’t the grades that professors give the students largely arbitrary too? Why yes, they are.

    But the outcomes of GPAs are just as real. But employeers ask for GPAs, and scholorships disappear if you don’t maintain them at a high level.

    If you want to get rid of student evals, get rid of grades too.

  4. But note that your reference [1] does find a positive correlation between student learning and positive evaluations by students:

    “While small in magnitude, we find a robust positive, and statistically significant, association between our measure of student learning and course evaluations.”

    Although N is small, the paper does give some evidence (via something like a natural experiment) that class sections that learn more also give better evaluations.

  5. While reading this article, several questions came to mind. First – define ‘effectiveness’. If the goal is to measure effectiveness, then this needs to be defined. I have known some that would argue that effectiveness can be shown by testing results over time – so ‘teaching to the test’ may illustrate effectiveness for this person. Many more that I know would argue that effectiveness is more about teaching students to think critically, rather than a particular subject matter. So many variables, so many perceptions.

    In relation to inconsistent results: As a student and an educator, knowing what I know through coursework and experience, I would surmise that one particular student may evaluate an instructor’s effectiveness differently on any given day, never mind a year later. There are so many factors that enter into the personal experience as a student that you cannot absolutely ensure continuity.

    I know that I myself may rate things slightly different because of how I feel that day. If I am feeling ‘under the weather’, either mentally or physically, I may be less generous. If I am feeling ‘on top of the world’, I may see things in a more positive light. An overall feeling about the environment or other courses will affect the evaluation.

    I would venture to say that overall, the outliers will be evident, even considering such fluctuations; though the data in the middle may not be exact, it would provide an overall picture, but not an actual measurement. As long as the need exists to measure these results, then any system is better than nothing.

  6. While I certainly agree that student evaluations do not necessarily measure what we pretend they measure, this article is not the argument I would every put forward to make my case.

    Insisting that “Controlled, randomized experiments are the gold standard” is a positivist and problematic argument. Qualitative research, and all that it can offer, is essentially discounted by this thinking.

    No thanks.

  7. A professor whose class I took in the eighties was so bad that on the day of evaluations only 1/3 of the students were in attendance. I doubt that fact showed up in the evaluation summary.

  8. What is the magnitude of the effect? Am I right that, from the USAF study, in thinking that it is small? Is it significant in terms of statistics, but insignificant in normal parlance, that is, less than 1%?

    “This result indicates that a one-standard-deviation change in professor quality results in a 0.05-standard-deviation change in student achievement. In terms of scores, this effect translates into
    about 0.6 percent of the final percentage of points earned in the course.”

  9. At last, and honest appraisal of the ‘teaching evaluation’ system! But it misses the point: ‘Effectiveness’ scores are used (despite the fact that its users do not know what they mean) by those who decide the career trajectories of professors for the purpose of making their own jobs easy. Imagine having to actually delve into the details of an instructor’s course and conduct a real assessment!

  10. When I was in Berkeley in the early 1960s, a student group (SLATE) conducted teacher evaluations for use by students. One of the tougher graders who taught undergraduates, Jacobus tenBroek, got top evaluations.

    I remember that the SLATE summary of the evaluations one semester said that “Every student should have the tenBroek experience.” (I took three of his courses, so I know how hard it was to get an A from him; the best I managed was an A-).

    Unlike evaluations today, these were written for the benefit of other students, and may have been more honest in consequence.

    • Huff, D., 1954. How to Lie with Statistics. WW Norton, NY. Chapter 7, “The Semiattached Figure”

Comments are closed.