Skip to main content

Do student evaluations measure teaching effectiveness?

Philip Stark, professor of statistics | October 14, 2013

Since 1975, course evaluations at Berkeley have included the following question: Considering both the limitations and possibilities of the subject matter and course, how would you rate the overall teaching effectiveness of this instructor?

1 (not at all effective), 2, 3, 4 (moderately effective), 5, 6, 7 (extremely effective)

Among faculty, student evaluations of teaching are a source of pride and satisfaction—and frustration and anxiety. High-stakes decisions including merit reviews, tenure, and promotions are based in part on these evaluations.  Yet, it is widely believed that evaluations reflect little more than a popularity contest; that it’s easy to “game” the ratings; that good teachers get bad ratings; that bad teachers get good ratings; and that fear of bad ratings stifles pedagogical innovation and encourages faculty to water down course content.

What do we really know about student evaluations of teaching effectiveness?

Quantitative student ratings of teaching are the most common method to evaluate teaching.[1] De facto, they define “effective teaching” for many purposes, including faculty promotions. They are popular partly because the measurement is easy: Students fill out forms. It takes about 10 minutes of class time and even less faculty time. The major labor for the institution is to transcribe the data; online evaluations automate that step. Averages of student ratings have an air of objectivity by virtue of being numerical.  And comparing the average rating of any instructor to the average for her department as a whole is simple.

While we are not privy to the deliberations of the Academic Senate Budget Committee (BIR), the idea of comparing an instructor’s average score to averages for other instructors or other courses pervades our institution’s culture.  For instance, a sample letter offered by the College of Letters and Sciences for department chairs to request a “targeted decoupling” of faculty salary includes:

Smith has a strong record of classroom teaching and mentorship.  Recent student evaluations are good, and Smith’s average scores for teaching effectiveness and course worth are (around) ____________ on a seven-point scale, which compares well with the relevant departmental averages.  Narrative responses by students, such as “________________,” are also consistent with Smith’s being a strong classroom instructor.

This places heavy weight on student teaching evaluation scores and encourages comparing an instructor’s average score to the average for her department.

What does such a comparison show?

In this three-part series, we report statistical considerations and experimental evidence that lead us to conclude that comparing average scores on “omnibus” questions, such as the mandatory question quoted above, should be avoided entirely. Moreover, we argue that student evaluations of teaching should be only a piece of a much richer assessment of teaching, rather than the focal point. We will ask:

●      How good are the statistics? Teaching evaluation data are typically spotty and the techniques used to summarize evaluations and compare instructors or courses are generally statistically inappropriate.

●      What do the data measure? While students are in a good position to evaluate some aspects of teaching, there is compelling empirical evidence that student evaluations are only tenuously connected to overall teaching effectiveness.[2] Responses to general questions, such as overall effectiveness, are particularly influenced by factors unrelated to learning outcomes, such as the gender, ethnicity, and attractiveness of the instructor.

●      What’s better? Other ways of evaluating teaching can be combined with student teaching evaluations to produce a more reliable, meaningful, and useful composite; such methods were used in a pilot in the Department of Statistics in spring 2013 and are now department policy.

At the risk of losing our audience right away, we start with a quick nontechnical look at statistical issues that arise in collecting, summarizing, and comparing student evaluations. Please read on!

Administering student teaching evaluations

Until recently, paper teaching evaluations were distributed to Berkeley students in class. The instructor left the room while students filled out the forms. A designated student collected the completed forms and delivered them to the department office. Department staff calculated average effectiveness scores, among other things. Ad hoc committees and department chairs also might excerpt written comments from the forms.

Online teaching evaluations may become (at departments’ option) the primary survey method at Berkeley this year. This raises additional concerns. For instance, the availability of data in electronic form invites comparisons across courses, instructors, and departments; such comparisons are often inappropriate, as we discuss below. There also might be systematic differences between paper-based and online evaluations, which could make it difficult to compare ratings across the “discontinuity.[3]

Who responds?

Some students are absent when in-class evaluations are administered.  Students who are present may not fill out the survey; similarly, some students will not fill out online evaluations.[4] The response rate will be less than 100%. The further the response rate is from 100%, the less we can infer about the class as a whole.

For example, suppose that only half the class responds, and that all those “responders” rate the instructor’s effectiveness as 7.  The mean rating for the entire class might be 7, if the nonresponders would also have rated it 7. Or it might be as low as 4, if the nonresponders would have rated the effectiveness 1. While this example is unrealistically extreme, in general there is no reason to think that the nonresponders are like the responders. Indeed, there is good reason to think they are not like the responders: They were not present or they did not fill out the survey. These might be precisely the students who find the instructor unhelpful.

There may be biases in the other direction, too.  It is human nature to complain more loudly than one praises: People tend to be motivated to action more by anger than by satisfaction. Have you ever seen a public demonstration where people screamed “we’re content!”?[5]

The lower the response rate, the less representative of the overall class the responders might be.  Treating the responders as if they are representative of the entire class is a statistical blunder.

The 1987 Policy for the Evaluation of Teaching (for advancement and promotion) requires faculty to provide an explanation if the response rate is below ⅔. This seems to presume that it is the instructor’s fault if the response rate is low, and that a low response rate is in itself a sign of bad teaching.[6]  The truth is that if the response rate is low, the data should not be considered representative of the class as a whole.  An explanation of the low response rate—which generally is not in the instructor’s control—solves nothing.

Averages of small samples are more susceptible to “the luck of the draw” than averages of larger samples.  This can make teaching evaluations in small classes more extreme than evaluations in larger classes, even if the response rate is 100%.  Moreover, in small classes students might imagine their anonymity to be more tenuous, which might reduce their willingness to respond truthfully or to respond at all.


As noted above, Berkeley’s merit review process invites reporting and comparing averages of scores, for instance, comparing an instructor’s average scores to the departmental average.  Averaging student evaluation scores makes little sense, as a matter of statistics.  It presumes that the difference between 3 and 4 means the same thing as the difference between 6 and 7.  It presumes that the difference between 3 and 4 means the same thing to different students. It presumes that 5 means the same things to different students in different courses. It presumes that a 4 “balances” a 6 to make two 5s. For teaching evaluations, there’s no reason any of those things should be true.[7]

Effectiveness ratings are what statisticians call an “ordinal categorical” variable: The ratings fall in categories with a natural order (7 is better than 6 is better than … is better than 1), but the numbers 1, 2, …, 7 are really labels of categories, not quantities of anything.  We could replace the numbers with descriptive words and no information would be lost: The ratings might as well be “not at all effective”, “slightly effective,” “somewhat effective,” “moderately effective,” “rather effective,” “very effective,” and “extremely effective.”

Does it make sense to take the average of “slightly effective” and “very effective” ratings given by two students? If so, is the result the same as two “moderately effective” scores?  Relying on average evaluation scores does just that: It equates the effectiveness of an instructor who receives two ratings of 4 and the effectiveness of an instructor who receives a 2 and a 6, since both instructors have an average rating of 4. Are they really equivalent?

They are not, as this joke shows: Three statisticians go hunting. They spot a deer. The first statistician shoots; the shot passes a yard to the left of the deer.  The second shoots; the shot passes a yard to the right of the deer.  The third one yells, “we got it!”

Even though the average location of the two misses is a hit, the deer is quite unscathed: Two things can be equal on average, yet otherwise utterly dissimilar. Averages alone are not adequate summaries of evaluation scores.

Scatter matters

Comparing an individual instructor’s (average) performance with an overall average for a course or a department is less informative than campus guidelines appear to assume. For instance, suppose that the departmental average for a particular course is 4.5, and the average for a particular instructor in a particular semester is 4.2.  The instructor is “below average.” How bad is that? Is the difference meaningful?

There is no way to tell from the averages alone, even if response rates were perfect. Comparing averages in this way ignores instructor-to-instructor and semester-to-semester variability.  If all other instructors get an average of exactly 4.5 when they teach the course, 4.2 would be atypically low.  On the other hand, if other instructors get 6s half the time and 3s the other half of the time, 4.2 is almost exactly in the middle of the distribution. The variability of scores across instructors and semesters matters, just as the variability of scores within a class matters. Even if evaluation scores could be taken at face value, the mere fact that one instructor’s average rating is above or below the mean for the department says very little. Averages paint a very incomplete picture.  It would be far better to report the distribution of scores for instructors and for courses: the percentage of ratings that fall in each category (1–7) and a bar chart of those percentages.

All the children are above average

At least half the faculty in any department will have teaching evaluation averages at or below median for that department. Someone in the department will be worst.  Of course, it is possible for an entire department to be “above average” compared to all Berkeley faculty, by some measure. Rumor has it that department chairs sometimes argue in merit cases that a faculty member with below-average teaching evaluations is an excellent teacher—just perhaps not as good as the other teachers in the department, all of whom are superlative.  This could be true in some departments, but it cannot be true in every department. With apologies to Garrison Keillor, while we have no doubt that all Berkeley faculty are above average compared to faculty elsewhere, as a matter of arithmetic they cannot all be above average among Berkeley faculty.

Comparing incommensurables

Different courses fill different roles in students’ education and degree paths, and the nature of the interaction between students and faculty in different types of courses differs.  These variations are large and may be confounded with teaching evaluation scores.[8] Similarly, lower-division students and new transfer students have less experience with Berkeley courses than seniors have.  Students’ motivations for taking courses varies, in some cases systematically by the type of course.  It is not clear how to make fair comparisons of student teaching evaluations across seminars, studios, labs, large lower-division courses, gateway courses, required upper-division courses, etc., although such comparisons seem to be common[9]—and are invited by the administration, as evidenced by the excerpt above.

Student Comments

What about qualitative responses, rather than numerical ratings?  Students are well situated to comment about their experience of the course factors that influence teaching effectiveness, such as the instructor’s audibility, legibility, and availability outside class.[10]

However, the depth and quality of students’ comments vary widely by discipline. Students in science, technology, engineering, and mathematics tend to write much less, and much less enthusiastically, than students in arts and humanities. That makes it hard to use student comments to compare teaching effectiveness across disciplines—a comparison the Senate Budget Committee and the Academic Personnel Office make. Below are comments on two courses, one in Physical Sciences and one in Humanities. By the standards of the disciplines, all four comments are “glowing.”

Physical Sciences Course:

“Lectures are well organized and clear”

“Very clear, organized and easy to work with”

Humanities Course:

“There is great evaluation of understanding in this course and allows for critical analysis of the works and comparisons. The professor prepares the students well in an upbeat manner and engages the course content on a personal level, thereby captivating the class as if attending the theater. I’ve never had such pleasure taking a class. It has been truly incredible!”

“Before this course I had only read 2 plays because they were required in High School. My only expectation was to become more familiar with the works. I did not expect to enjoy the selected texts as much as I did, once they were explained and analyzed in class. It was fascinating to see texts that the author’s were influenced by; I had no idea that such a web of influence in Literature existed. I wish I could be more ‘helpful’ in this evaluation, but I cannot. I would not change a single thing about this course. I looked forward to coming to class everyday. I looked forward to doing the reading for this class. I only wish that it was a year long course so that I could be around the material, GSI’s and professor for another semester.”

While some student comments are extremely informative—and we strongly advocate that faculty read all student comments—it is not obvious how to compare comments across disciplines to gauge teaching effectiveness accurately and fairly.[11]

In summary:

●      Response rates matter, but not in the way campus policy suggests. Low response rates need not signal bad teaching, but they do make it impossible to generalize results reliably to the whole class. Class size matters, too: All else equal, expect more semester-to-semester variability in smaller classes.

●      Taking averages of student ratings does not make much sense statistically.  Rating scales are ordinal categorical, not quantitative, and they may well be incommensurable across students. Moreover, distributions matter more than averages.

●      Comparing instructor averages to department averages is, by itself, uninformative. Again, the distribution of scores—for individual instructors and for departments—is crucial to making meaningful comparisons, even if the data are taken at face value.

●      Comparisons across course types (seminar/lecture/lab/studio), levels (lower division / upper division / MA / PhD), functions (gateway/major/elective), sizes (e.g., 7/35/150/300/800), or disciplines is problematic. Online teaching evaluations invite potentially inappropriate comparisons.

●      Student comments provide valuable data about the students’ experiences. Whether they are a good measure of teaching effectiveness is another matter.

In the next installment, we consider what student teaching evaluations can measure reliably. While students can observe and report accurately some aspects of teaching, randomized, controlled studies consistently show that end-of-term student evaluations of teaching effectiveness can be misleading.


[1] See Cashin (1999), Clayson (2009), Davis (2009), Seldin (1999).

[2] Defining and measuring teaching effectiveness are knotty problems in themselves; we discuss this in the second installment of this blog.

[3] There were plans to conduct randomized, controlled experiments to estimate systematic differences during the pilot of online teaching evaluations in 2012-2013; the plans didn’t work out. One of us (PBS) was involved in designing the experiments.

[4] There are many proposals to provide social and administrative incentives to students to encourage them to fill out online evaluations, for instance, allowing them to view their grades sooner if they have filled out the evaluations. The proposals, some of which have been tried elsewhere, have pros and cons.

[5] See, e.g.,

[6] Consider these scenarios:
(1) The instructor has invested an enormous amount of effort in providing the material in several forms, including online materials, online self-test exercises, and webcast lectures; the course is at 8am. We might expect attendance and response rates to in-class evaluations to be low.
(2) The instructor is not following any text and has not provided any notes or supplementary materials. Attending lecture is the only way to know what is covered. We might expect attendance and response rates to in-class evaluations to be high.
(3) The instructor is exceptionally entertaining, gives “hints” in lecture about what to expect on exams; the course is at 11am. We might expect attendance and response rates to in-class evaluations to be high.
The point: Response rates in themselves say little about teaching effectiveness.

[7] See, e.g., McCullough & Radson, (2011)

[8] See Cranton & Smith, (1986), Feldman (1984, 1978).

[9] See, e.g., McKeachie (1997).

[10] They might also be able to judge clarity of exposition, but clarity may be confounded with the intrinsic difficulty of the material.

[11]  See Cashin, (1990), Cashin & Clegg (1987), Cranton & Smith (1986), Feldman, (1978).

Co-authored with senior consultant Richard Freishtat, Ph.D., and cross-posted from UC Berkeley’s Center for Teaching and Learning blog

Comments to “Do student evaluations measure teaching effectiveness?

  1. Most of the time for the past six years I have been getting decent, or average, evaluations in my Introductory Sociology course. The exception has been the Fall 2014 semester. Since I am an adjunct faculty these negative evaluations resulted in the cancellation of my teaching contract for the Spring 2015.

    Even though I have to see the evaluations yet, I suspect the negative evaluations are related to a midterm multiple-choice exam in which the students did below expectations. Rather than the memorization of facts, the correct answers implied a level of analytic skills that many lower division students lack. In previous the semesters the midterm exam consisted exclusively of essay questions, which allowed me latitude in grading them by reading ‘between lines’ the formulation of their ideas and interpolating the intention of their argument when some hints of it was suggested in their comments, but the multiple-choice questions resulted to be a very inflexible evaluating instrument which prevented me of much latitude in grading them.

    In spite of telling the students that, since their final grade in the course was cumulative, the low scores they got in the midterm could be improved by their performance in the quizzes and class assignments stipulated by the syllabus, their low performance in the midterm exam sowed a deep distrust in the mind of a significant number of my former students. When I mean significant, I mean it in the statistical sense. The assessment of a limited number of students was sufficient for pushing the aggregate scores for the evaluation questions below the standard scores required by the Office of the Provost.

    Had some of these students been absent the day of the evaluation and, probably, the aggregate scores would have been different, but I requested mandatory attendance to my lectures. Had I provided them with an easier midterm exam and, most likely, my evaluations would have been much more positive. On the other hand, the evaluation did not reflect the fact that at the end of the semester several committed and dedicated students thanked me for the class and shook hands with me. I have reasons to believe that they were not engaging in a public relations exercise.

    Since I curved the scores of all the students, at the end of the semester the grade distribution in my three sections was similar to previous semesters. I kept my promise to them, but what I gather from this unpleasant experience is that there are students, and probably a large population, who think that they learn only when they get good grades. Therefore, for the same reason, the fact that they get lower grades means to them that they are not learning and, sincerely, they will condition their responses in an evaluation accordingly to their frustration. These are students who come to college to get good grades … and to learn.

  2. Is there any research on the timing of “end-of-semester” evaluations? In our school, administration begins collecting end-of-semester evaluations five weeks prior to the end of a 14-week semester.

    In other words, in the fall semester, in some classes, students complete “end of semester” evaluations in early November, and students in other classes complete “end of semester” evaluations at the actual end of the semester in early December. Administrators do this primarily for their own convenience.

    Any feedback on research findings regarding this issue (pro or con) is appreciated!

  3. What if student evaluations were like modern grading rubrics? As a prof, I surely won’t do a good job of evaluating my students’ work if every evaluation got some subjective score of 1-7. By today’s teaching standards, I have to make my rubric pretty concrete and transparent. So, if I give a 7, there’s a fairly objective way to justify that number (e.g., for a physics calculation question, the calculation showed the work and resulted in the correct answer within a precision of 0.1 units). Most rubrics require explaining each of the ordinal values, and they are provided PRIOR to the evaluation, so that students (and instructors) would know the rules of what matters in an evaluation.

    Let’s take instructor availability as a dimension to be evaluated, and let’s assume all students *have* to meet with the instructor during the semester (which is usually not the case). A more accurate evaluation of the instructor’s availability would be concrete: 1=instructor didn’t show up for one or more meetings, 7=instructor showed up for all meetings. An instructor not showing up for a meeting is a pretty flagrant error (but could happen with bad instructors), and it requires some more thought about the scale of 1-7 for certain dimensions. But this kind of evaluation also has the advantage of showing instructors where there is room for improving their evaluations.

    Other dimensions that would be easy to apply this to: Quality of instructor answers to my questions; Quality of instructor-provided examples; Pace of instruction (perhaps a scale of + and – to indicate too fast or too slow); etc.

    Today’s information-technology environment almost allows students to evaluate these things in real time. E.g., after a professor replies to an email (more than half of the questions I get are this way), the students could be asked if they’re satisfied with the response. Similar polls could be done after examples are presented in class, after they get feedback from homework, exams, etc.

    The other reality is that at most universities students are treated as customers. Their happiness is what is mostly being measured by traditional evaluations. I think we assume that part of their happiness is how well they learned in the course (and it’s implied that if they’re happy, then the instructor did a good job teaching). There is some truth that students should be happy in a course, so that the environment is conducive to learning. However, that should be only one dimension of an evaluation.

  4. I found this a useful evaluation of student evaluations. But your argument about how averages are statistically inappropriate in these evaluations is kind of off-target, isn’t it? The example you used about the hunters is especially tangential: I think it should be obvious to most already that averaging isn’t *always* applicable, and your example highlights possibly the worst example of use of averages, with no analogy to the averaging of evaluation scores.

    At least student evaluation scores aggregate in a convex sort of way, for the most part. For instance, if someone gets scores of 4 and 6 by different students, you could easily argue that their aggregate score should probably be somewhere between 4 and 6. (You could plausibly come up with some exceptions centered around unfair treatment of students, but I think these detect problems that are maybe out of the scope of numeric student evaluation scores.

    Also, students are aware that the scores they give are going to be averaged, so that helps determine their interpretation of the scale. For lack of a better aggregate measure, it seems to me that taking the average is a good enough useful approximation (though adding in the variability wouldn’t hurt)… assuming that the scores mean anything at all, it could be reasonable to assume that their relation to the value of the outcomes is close-enough-to-linear.

  5. teaching effectiveness can best be measured through careful observation and intense concentration on the expert/teacher

  6. Thank you for this well thought out article. I taught research design, methods, and basic statistics to graduate students who didn’t think of themselves as “numbers persons”; however they were required to take this course for their degree.

    I think that student evaluations should include a question about whether the course is required or not. In other words, take into account whether then students are participating willingly and with interest or because they have to. This has an effect on their ratings and their qualitative comments.

  7. Interesting article. Very well researched.

    While I agree with some of the arguments against online evaluations, I think it would be more likely to capture the bigger picture. Students who attend a class will be different than those who choose not too. The students who attend presumable enjoy the class and are more likely to rate the teacher higher, whereas the students who have decided the class is not worth their time would not attend and may rate the teacher lower. A paper survey would miss this important data.

    Teacher evaluations should certainly not be the “be all end all” as a performance metric. With an online evaluation, the data could be easily integrated within a database to examine other metrics.

    I’m hosting a webinar on online education evaluations if you or your readers are interested in learning more about them:

  8. There seems to be some confusion in the comments about “teacher effectiveness”. The author’s main point is that not all questions measure this well, since some of them are known to covary with factors that presumably do not relate to teacher effectiveness, such as race and gender. Some commenters appeared to interpret this as the author saying we should not have evaluations. I do not think this is what he is saying.

    As far as I can tell, the point of confusion is what exactly “teacher effectiveness” means. The author explicitly states that evaluations are quite good at revealing what the student experience is like. For instance, my own teaching evaluations indicate that my students like that I am energetic and cheerful, but they dislike that I do not prepare handouts, or when I do they are too dense. As for the student experience — what it feels like to be them, what they do like and what they would prefer to what they got, this is utterly clear. However, I agree with the author in the fundamental point that this type of data has a non-transparent relationship with my effectiveness as a teacher.

    Intuitively, teacher effectiveness should be measured based on what the students have actually learned from the course. Effective teachers are ones that, all other things being equal, lead a greater number of students to learn more. The question we should be asking is, how do we measure that??

    In my own case, do you think that my students will remember what I taught them better because I was bright and cheerful during class, which may have uplifted their mood? Do you think that they will remember less because they had to take notes during class (since I didn’t give them a handout) or they had to make their own notes after class (because I did give them a handout but it was like a textbook)? I am quite sure that I got better teacher ratings because of my energy, and worse ratings because of the (lack of) handouts. But as for what they actually retained, I would guess — and this is wild speculation — that my mood had no effect whatsoever, whereas making them take their own notes during/after class has a modest enhancing effect on recall.

    The measurement question is even more complicated when we ask, “What *should* they be learning?” In my discipline, I would hazard that less than 5% of undergrads go on to employment in a field where they actually use any of the technical material we teach. Thus, as far as I am concerned, the value-add that my teaching supplies is not so much in th “what” as in the “know-how”. In my field, students analyze logical patterns, formulate hypotheses, and adjudicate between hypotheses based on their inherent reasonableness as well as their match to the data. This type of analytic practice is one of the real things that we teach them that is of value.

    Another real thing that I emphasize is collaboration. In my class, students go out into the field and collect their own data. Then in groups, they have to analyze it and write it up. The intention is to foster their collaborative skills, including learning how to work with difficult people, learning how to contribute positively to a project with their own skillset and allocate tasks to others based on their, learning how to write, etc.. Do you think that a student evaluation will reflect any of that? I am guessing, but I would guess “partly”. From my own experience being mentored as a teenager, it takes many years before you appreciate the full value of the teaching you have received. I knew which teachers I liked then, and I feel like I know which teachers I learned a lot from now, and there is substantial overlap between them. But can I honestly say that at that time I was able to evaluate how much I would end up learning? No…

    I agree with the author 100%. Course evaluations are a wonderful tool for reflecting the student experience. They are not a good tool at all for evaluating teacher effectiveness.

  9. I am a staff member on campus but teach at other schools. I really did not like this blog statement so I am exercising my right of free speech to respond.

    Whether we like it or not, we must give students a chance to provide feedback. They will feel slighted and left out if we don’t. A good professor is going to receive strong reviews no matter what. Sure, it varies on the kind of course, but a professor’s ratings also depend on how well that person treats people, how quickly she or he responds, whether or not they actually attend their office hours, and whether or not they provide a meaningful classroom experience where students not only learn but also get to apply what they have learned.

    I’ve received thousands of surveys over the years based on my teaching performance. They are always consistently strong. There are sometimes outliers that make me uncomfortable, but they also give me pause and reflection as to why a certain evalutative comment was made. There is always room for improvement and education itself is rooted in tenets of continous improvement. At least that is what our accreditors think!

    We need evaluations. They measure how well we are getting the material to the students, how well we are providing them with feedback, and ultimately, how well we are serving them. I’m proud I graduated from a prestigious university whose faculty were devoted to teaching. I’d never read something like this there, thank goodness.

    • Dear Staff Member–

      I apologize for being less than clear. I think student feedback is incredibly valuable. I read every student evaluation I receive and look for ways to improve my teaching and to improve students’ experience in the class. Students are uniquely qualified and situated to assess some aspects of teaching and to report their own experiences in a course.

      However, I don’t think students are in a good position to judge teaching effectiveness. Nor do I think trying to quantify teaching effectiveness on a 7-point scale makes any sense. And even if it did, it does not make sense to report and compare averages: The overall distribution of the ratings is much more informative. I don’t think comparisons of student ratings across types of courses or across disciplines makes sense. And I think response rates are crucial to the possibility of generalizing reliably from the sample to a class as a whole.

      Fiat lux,

  10. Breath of fresh air, Professor Stark. There is such a great need for statistical literacy in our public school system accountability. The mere and obvious fact among many brought up by you and lost to numerical accountability lobbyists is that half members of any set of data will always fall below the median line. Provisions should be placed in our accountability models to prevent the lower half association with failure if defined performance expectations and standards have successfully been met by evaluates regardless of where they are positioned relative to others. Thanks for the enlightening and intelligent read!

  11. I hear you professor Stark! It is lousy to have outcomes that really, really matter tied to a faulty metric. That definitely should change. What are the proposed alterations?

    When I was a student at Berkeley, my major would make the results of the student evaluations of professors available to incoming cohorts. I agree with your arguments above, but those ratings proved to be pretty dead-on accurate for the sample I was dealing with. I learned the hard way first and then used those numbers to design an incredible learning experience for my last two years.

  12. As alluded to by Dragan, most students are not in a position to evaluate a teacher’s effectiveness. Their feedback is important for a variety of reasons, but (as I suspect will be discussed in the next installment) students do not know how well they will remember the material in the future.

    When I was in undergrad, a particular professor of mine did not stand out as an exceptionally good professor. He told amusing stories in his classes, but his lectures seemed a little scattered and his tests had bizarre questions on them. However, during grad school, it was information from his classes that I recalled easiest and was most useful in helping me pass my qualifying exam.

    To assume that my review of the professor during my undergraduate years at all represented his effectiveness as a teacher would be a fallacy. Students are going to rate a professor based off of what they think is a measure of teaching effectiveness, but their opinion on what constitutes teaching effectiveness may not correlate at all to actual effectiveness.

    I view the usefulness of teaching evaluations as something akin to evaluations of the bedside manner of a doctor. We want doctors with good bedside manner, but it’s silly to assume that all doctors with good bedside manner are going to make an accurate diagnosis and give the patient the appropriate treatment. And it’s downright foolish to use bedside manner evaluations as the only assessment of a doctor’s effectiveness (which is what we currently do for teaching evaluations).

  13. I think you bring up excellent points, but just some thoughts on the matter:

    Regarding participation rate, the current system indeed creates a nonresponse bias. It makes intuitive sense that the students who still attend lecture by the end of the semester naturally find it worth their time and thus are likely to rate the instructor higher. Our friends from that other institution across the Bay have a solution which, I think, is better: instead of giving course evaluations at the end of the last lecture, they run evaluations online. The incentive for completing them? You get to see your final grade the moment they are released. Unfortunately, Berkeley is one of those places where students care quite a bit about grades, so this is a low-cost way to get more students to participate. The flaw, of course, is that students may just randomly click through buttons to complete the survey as quickly as possible, so that idea needs to be adjusted.

    Although they are not completely accurate measures, student evaluations still have merit. Although I agree that a strict comparison isn’t very meaningful (a professor with a score of, say, 6.3 vs. a professor with a score of 6.2 probably is not very meaningful), there is probably a difference between an average of 6 and an average of, say, 4. I believe they also encourage faculty to invest time and passion in their teaching (not that they wouldn’t without these ratings). For the last two years I’ve been at Berkeley, I have had excellent teachers. Given that Berkeley often only offers one lecture per class per semester, it’s not like I was able to use the ratings (in EECS, the ratings are publicized for students to access) to pick a teacher, but I still had teachers who believed in the importance of quality teaching. This semester is the first time I have had a professor who gives off the impression that he isn’t interested in teaching… and this semester, I am studying abroad.

    And at the end of the day, the people who understand teaching effectiveness the most are the people who use it to learn — the students. Although the form in which the data is being collected could be improved, I believe that student evaluations of some sort should continue to be used in some fashion.

  14. Since evaluations come at a cost, it is not likely we’ll get a better substitute for the raw data of the student evaluations soon. The key is how better to use the data the evaluations provide. I’ve always thought a better indicator of outstanding teaching is the percentage of students reporting that teaching is a 6 or 7 on a 7-point scale. A good indicator that teaching needs improvement is the pct. of students reporting that teaching is 1, 2 or 3 on the 7-point scale. Both are far better than using the mean score. The median score is better than the mean score.

    If evaluations are on-line now for all courses, it also becomse feasible to compare the professor’s performance with colleagues. But all the other points remain true – you need a high enough pct of a class to be meaningful; you need to compare performance in teaching similar kinds of classes; you need to assess the class’ design for rigor; you need to look at the qualitative comments on whether the instructor is enthusiastic about the subject and incites enthusiasm in the student about the subject, whether classes are seen as clearly organized and the material clearly presented, whether the class induces creative effort and mastery of skills on the part of the student, and so on.

  15. Hi,

    Not all of the points made about student evaluations are correct. Although I do agree that the comparisons to department averages should not be done, well-constructed, behaviorally-anchored questions can be used to develop factors that can distinguish between less effective and more effective instructors. Moreover, the best predictors of the global effectiveness rating are teaching characteristics such as clarity, enthusiasm, organization, and they contribute far more than non-teaching factors such as attractiveness.

  16. I found a lot I enjoyed about this post. My favorite would have to be the juxtaposition of science vs humanities students’ responses–almost comic in adhering to the stereotypes.

    When I was an undergraduate (in mathematics), all my teacher evaluations were left blank except for the following comment: “I have no qualifications by which to judge this teacher’s effectiveness. For what it’s worth, I [enjoyed/did not enjoy] the course.” Only once, when I had a truly awful teacher, did I complete the entire form…

    However, something troubles me here. If most students say the teacher’s bad, surely that says something? Which is why I’m uncomfortable with this: “Student comments provide valuable data about the students’ experiences. Whether they are a good measure of teaching effectiveness is another matter.” This seems dangerously close to “most students don’t know much/most students don’t know what’s good for them/most students are just plain dumb” category. I don’t see how you can get away from students’ opinion without saying something like, “Well, what the students like and dislike is all good and fine, but they don’t really know what’s good for them.”

    All that said, I look forward to future installments!

  17. This well-thought-out post points out clear statistical issues which makes evaluations imperfect. Indeed, similar issues — imprecision, sample election — affect almost any case of measurement.

    *However*, the post provides no argument that teacher evaluations are likely to be so biased so as to be useless, or worse, that they may point in the wrong direction; i.e., a teacher with an evalution of 5 is better than a teacher with an evaluation of 6.

    As a professor myself (of economics), I think that it would be a mistake to eliminate teacher evaluations. They are a useful second-best. Yes, it would be great to have perfect evaluations, but short of those, what we have is not a bad system.

    Finally, let me point out the relevant result of a senior thesis which I supervised, by Eileen Tipoe. Eileen compared the teacher evaluations collected in Berkeley to evaluations online for the same professor from websites such as ratemyprofessor. While neither measure is perfect, she find a strong correlation between the two very different measures. To me, that suggests that the measures Berkeley uses capture something useful.

    In any case, thank you for a useful contribution.

  18. As a past member of the Budget Committee, I can testify to the three basic points: One, student evaluation scores are problematic measures of teaching effectiveness. Just about every concern raised here came up sometime or another in BIR discussions.

    Two, we use the scores anyway — not blindly, I think, but with some discrimination — because we definitely take teaching seriously in rewarding faculty and, importantly, there is little else available to measure teaching.

    Three, we hope for better ways of assessing teaching, incorporating both students’ reports and other evidence. I look forward to the following posts.

  19. Thanks for a thoughtful post on teacher evaluations.

    I wish you would do another series on the use of value-added models (VAM) for evaluating teachers in public schools. VAM is being adopted across the nation, and results show it is unreliable. There are many critiques of VAM by educators. I would like to see an independent evaluation of VAM by a statistician. My main issue from a statistical perspective it that there are systematic differences among teachers in crucial variables such as student socio-economic status that is supposedly controlled statistically. How is it possible to statistically correct for a confounded design?

    Given the significant effect of VAM on our teacher corp and our educational system, please consider a series on VAM.

  20. Regarding the comment above, student evaluations of faculty are not equivalent to faculty evaluations of students. A key main difference is anonymity. Students can evaluate teachers however they want with impunity. Faculty have to be able to justify their evaluations based on pedagogical criteria identified in the syllabus. Students have no training in evaluating teaching effectiveness. Faculty have a curriculum on which to base their evaluations of students — did he/she demonstrate learning of course content.

  21. If student evaluations of the instructor should not carry so much weight, why should instructor evaluations of students be so different? We as students have our entire academic lives depend on our instructors’ interpretations of our exams, and very little on anything else. Granted that a professor is extremely knowledgeable in their field, and so they are the most qualified to assign a grade to the student, but one cannot have ones cake and eat it too. Surely it is the student who is most qualified to determine if the instructor taught them well.

    It is conceded that there are some students who may simply give a professor a bad review because the class was hard, but I would like to think that would be mitigated at such a university as Berkeley, where students can presumably tell whether the class is hard because of the material or hard because the professor needs to improve their teaching abilities. For example, in my quantum mechanics class, I sometimes feel overwhelmed by the scope and difficulty of the lectures, but even if I were faced with an F for this class, I would never give this professor a bad evaluation. This is because I can tell that the difficulty comes from the material, and that I have seen firsthand that the professor is masterful at explaining these difficult concepts. Never has their been a question I’ve asked that he did not have an immediate, satisfactory answer to. I would like to think that the large majority of Berkeley students have this ability to see where the difficulties come from, and in fact I have seen that they do firsthand, as we students often discuss the quality of our professors amongst ourselves.

    In summary, while I agree that student evaluations should not be the only factor in determining a professor’s teaching ability, I find it hard to sympathize as I do not see it being significantly worse than most other standard forms of performance evaluation.

  22. You captured well the complexities and politics of teaching evaluations. Thank you for taking on its mythology.

Comments are closed.