This post might as well be subtitled “A rant on the misuse of student evaluation of teachers”. I’ll just get that out of the way right up front.
One of the defining attributes of being a scientist is that we’re really good at the practice of quantifying things in a repeatable, meaningful way. Take journal impact factors as an example. We’re able to talk about them and pretty quickly agree that journal impact factor is a flawed and noisy but useful one-dimensional representation of a high-dimensional quantity, journal quality, but a rubbish measure of the quality of a single paper. Or say you’re on the committee of a student who tells you they want to measure competitive effects. You’re pretty likely to lead them down several conversations. Do they mean per capita effects on population growth rates? or per unit biomass impacts on biomass? or coexistence effects on other species? Conversely is their measurement of competitive effect likely to be strongly impacted by overall species richness or by productivity of the system? And what are the error bars on their measurement methods likely to be? If they are looking at biomass impacts are they measuring wet or dry biomass? How will this be standardized? How much variability can be removed by standardizing techniques?
Phew! We scientists sure make measuring something into a sophisticated exercise (and I hasten to note that is because experience has taught us it is important to do this). So how come both faculty and administrators are content to just take student evaluations of teachers on a 1-5 Likert scale so seriously?
This being the first work day after Labor day, the vast majority of faculty members across North America are now teaching (barring a few faculty at schools with trimesters). And as sure as death and taxes, that means that their students will be given a scantron bubble form to solemnly fill out to evaluate their teachers at the end of the semester. And then those forms will be run through a scanner and bam! The whole semester of teaching will have been boiled down to a 3.8 or 4.6 out of 5. And, since on nearly every campus teaching matters to at least some degree towards tenure and promotion, P&T committees, department chairs and deans will look at those scores and use them as the primary assessment of teaching quality. Now I hasten to note that nearly every campus has language in tenure guidelines that teaching is to be assessed in a multifaceted way and appropriately contextualized. And no recommendation will be written as baldly as to say the candidates average evaluation score is 4.1 which is above/below faculty averages and we stopped looking there. But I will bet you that those scores and how they compare to averages is the single most important consideration in building a view of that faculty member as a teacher on the vast majority of campuses.
Yet we all know (or would know if we spent a few minutes thinking about it) that this is way below our standards of measurement for our research. This is not a matter of opinion. This is a matter of piles of rigorous research. Here’s at least three big reasons why scores on student evaluations of teachers (SET) are deeply flawed:
- SET scores are strongly correlated with attributes of the class that have nothing to do with what we want to measure. Want to build a good prior on what the SET score will be? Ask how many students are in the class, what fraction of the students are required to take that class to meet graduation requirements and how quantitative that class is. I’ve never seen a study that quantifies exactly how predictive those three factors are with a number like an R2. But the short answer is pretty darn predictive. Study after study has shown that large, required classes with more quantitative material scores lower on SET scores (reviewed in Wachtel 1998). Formal studies on whether course workload and expected grade influence SET scores are a little more ambiguous, but faculty themselves have little doubt and act accordingly. Resulting in a very perverse incentive system (Stroebe 2016). So if you want to get good SET scores just teach small, upper level classes without math, and just to be safe make it an easy course. What’s that? You don’t have control over the first three parts and you don’t believe the last part is in the student’s best interest? Too bad.
- SET scores may capture student biases on gender, race, etc – This is a complex topic. But the notion that asking students for qualitative assessments like “overall effectiveness” could open the door for implicit bias is certainly credible. Wachtel’s 1998 review (already linked) identifies many studies purporting to demonstrating a gender bias (usually against women), but on balance finds a mix of studies supporting and not-supporting gender biases. One of the challenges in evaluating this issue is that such studies rarely are randomized or have controls. A recent re-analysis of two randomized controlled studies (Boring et al 2016) found pretty clear evidence of gender-biased evaluations both overall and on specific attributes of teaching (although some of the standardized effect sizes were not huge, e.g. r=0.09 between gender & SET score – basically the means between males & females can be pretty large – a difference in means of roughly 0.2-0.8 points on 1-5 scale – but the variances in student scores of a given teacher are also large). Studies of biases against other minority groups (e.g. race or differently-abled) have been noticeably lacking. There are studies showing biases against non-native speakers (something I regularly saw in the student evaluations when I was a TA for one of the more innovative teaching professors I have ever worked with whose accent was not at all hard to understand).
- SET scores don’t correlate with actual learning – yes, you read that right. The best studies of this question use a multi-section design (different sections of the same course with same test taught by different professors so actual test scores and SET scores can be correlated). Several meta-analyses have claimed to find weak but significant correlations between SET scores and actual student learning. However a recent meta-analysis (Uttl et al 2017) showed quite decisively in my opinion that these results are a consequence of publication bias (the existence of publication bias is pretty decisively demonstrated using funnel diagrams). Using state-of-the-art methods to adjust for publication bias, they find no significant correlation between SET scores and student learning!
So there you have it. SET scores have no correlation with actual student learning, are correlated with factors beyond the individual instructor’s control (class size, electivity of the class, subject matter of the class) and might be biased by gender, race etc. In hindsight this shouldn’t really shock us. By design SET scores measure some mixture of how much a student likes a teacher and how much a student thinks they are learning. Neither of those factors are actually how much a student really learned. The first factor (liking the teacher) could be independent of or negatively correlated with actual learning. And as for the second factor, students are notoriously unperceptive about how much they learn (amazingly all students think they have learned a lot and deserve an A). As I say I’ve never seen a good quantification of this but the effect sizes of #1 (course properties) can be as much as 1 point out of 1-5 and #2 could be as big as 0.3 -0.5 Together those are larger than the variation between teachers by a good amount. SETs tend to swamp any measure of actual teaching ability by variability related to non-teaching related factors.
But wait some defenders will say. We know all of this and we control for it. We don’t just take raw SET scores. We only compare SET scores against appropriate comparison groups. For example on many campuses, classes are binned by college (broad subject area like science vs humanities) and by class size. And then SET scores are only compared within the relevant bin. Well that is certainly an improvement. But there is no way to control for all the factors (e.g. I’ve never been on a campus where there are enough classes to group by whether a class is required or not, and within a field like biology there is a great deal of variability in how quantitative classes are which is also uncontrolled for). And I’ve never seen a campus that binned by gender of the teacher just in case SETs are biased there. More importantly, this is really just a giant distraction. It is addressing my points #1 and #2 but missing the elephant in the room (point #3). Hello! SET scores don’t measure student learning!
Now this is getting to something useful! What should a dean actually be trying to assess about the their faculty’s teaching abilities? And it goes without saying for me that this is a legitimate exercise. Teaching is an important part of the job and faculty are accountable for how well they do their job. But what is it we want to assess? (Small Pond also discusses this) I can easily think of half a dozen things:
- How much does a student learn about the subject matter?
- How well is a student prepared for their next classes?
- How well is a student prepared for their career?
- How much did the course achieve broad educational educational objectives like teaching a student to think rigorously and critically or to create life long learners?
- How much is the student inspired and motivated about the subject?
- Were the students treated respectfully and with dignity?
- Did the teacher use known best pedagogical practices?
Arguably a student assessment of how they were treated is a valid metric for #6 and maybe #5 (and they are common question on SETs). But SETs are not really credible metrics of anything else. Again SETs, more or less by design, measure how much students like the teacher and think they’ve learned. They don’t in any credible way measure #1, #2, #3, #4, or #7. To demonstrate the point, did you know that just professing in class that you care about students experience in your class increases SETs? (sorry can’t find the link at the moment). That’s a pretty darn shallow measure of teaching effectiveness. I cannot emphasize enough that except for a few subjective interactional experiences related to treating students well, SETs don’t measure any of the things we really care about related to learning and preparation. Students aren’t really prepared to self assess how well they’re prepared for their career or have become deep thinkers. And we might think students could self-assess how much they’ve learned, but studies have pretty rigorously shown they cannot. And rightly or wrongly teachers widely perceive that they can improve their SET scores by softening up a course. SO HOW COME SET SCORES ARE THE CENTRAL TOOL FOR ASSESSING FACULTY TEACHING?
I cannot answer the rhetorical question I just posed. And I hope it changes. But now I want to turn to two more constructive directions. What are SETs good for? and how should we assess faculty teaching?
Well I’ve pretty thoroughly trashed SETs for evaluating overall faculty teaching effectiveness. But they’re not useless. I for one would be upset if SETs disappeared from my classes. There are three things I think SETs are good for (and use them for myself). First SETs do have some validity for assessing the nature of the personal interaction with students. Questions like were students treated with respect? Did the instructor increase your enthusiasm about the subject (adjusting for how excited you were before you came in)? Did the teacher explain things clearly (recognizing that this is highly student dependent)? These are interpersonal interactions and students obviously have an important and valid perspective on that. But you cannot combine those up into an overall assessment of teaching effectiveness (some of my best teachers have been pretty uncaring of my feelings). Second, SETs are valid for within-teacher, within-course comparisons. That means comparing how you taught BIO 3XX in 2017 vs 2015 is largely valid (unless the class is so small that student sample sizes are too small). One has to be a little cautious as the average quality of students varies from year to year, so you’re really measuring a mixture of average student quality and your teaching. But I have found that SET scores largely match my own self-perceptions of interannual variability. It is useful to be told in a hard number when I’m getting complacent and need to shake things up and when the new things I’ve tried worked or didn’t work. Finally, the 1-5 scores are not the only thing on student evaluations. I find the qualitative (i.e. free text response) portions of SETs extremely useful. You can’t take any one comment too seriously (in either direction), and you have to throw out the gratuitous comments made on fashion etc, but a blurry-focused reading across many comments is very helpful in understanding what is working and what isn’t and what the students are missing. And many of my most successful improvements to my teaching were suggested by students via these responses.
To continue the theme of being constructive, what should replace SETs as the lead methods for evaluating faculty teaching performance? Well here we have to start acting like the scientists we are and carefully think out what we want to measure (which of course depends on why we want to measure) and then be rigorous about it. But here are some thoughts (also see Small Pond’s list of commonly used tools):
- Discuss what we want to measure – We cannot possibly be precise about measuring something unless we we first have a conversation about what it is we really want to measure. “Teaching performance” is too vague to be measurable. College campuses and teaching would be much better if departments had regular discussions on what their educational goals were (both in terms of curriculum and student growth). Or even if teachers got more reflective on their own goals (and no the trend to require putting explicit goals in a syllabus hasn’t seemed to me to achieve this).
- Peer evaluations – This is probably the 2nd most common tool after SETs but it is not found on all campuses. Have experienced teachers in your department come in and make a qualitative, nuanced evaluation of your teaching. Have them write a page of what does and doesn’t work (no numerical scores). This can then be shared with both the teacher and anybody who needs to evaluate the teacher. This is not perfect but it can quickly assess many dimensions of teaching and provide very constructive feedback to a teacher. And if you have to write a P&T letter, in my experience quoting a statement from a peer evaluation is much more impactful and illustrative than citing a numerical score from an SET.
- Expert evaluations – another model is to bring in experts rather than peers to do teaching evaluations. Like somebody from your campus teaching center. This is also found on many campuses, although rarely used for performance evaluation. An obvious limitation is that the expert assessor will likely have limited knowledge of your subject area. And there is a risk they are a little over focused on trendy pedagogical techniques rather than basic aspects like connecting with your class. But aside from their pedagogy expertise such evaluations can often bring additional benefits like recording a class so you can see for yourself how your teaching is and isn’t working. Its a can of worms, but maybe these should play a bigger role in teaching evaluation (I’d vote for this over student evaluations).
- Go longitudinal – learning is by definition a transformational process, a rate, a change over time. If you really want to know how you’re impacting a student measure them over time. Assess students at the beginning of the class and the end of the class. Assess students when they join your department and when they leave. And hard to do but very valuable, assess them when they’re a couple of years into their career. Note that this call to go longitudinal can apply to any of the goals (e.g. subject matter learning or broad educational goals).
So, now you know I feel like the use of SETs as a primary tool for evaluating faculty teaching is deeply misguided. And I wouldn’t get rid of SETs but I sure would add some stronger tools for assessing faculty teaching. What do you think? Do you like SETs? Do you think the results of the studies I reported are wrong? What goals do you think we should be assessing in evaluating faculty teaching? How should we assess them? What will it take to get rid of SETs as the primary tool?
PS thanks to Jeremy and Meghan for encouraging me to write this post and to Meghan for providing many helpful links although of course the opinions expressed here are my own.
PPS I should probably note that SETs have generally been kind to me. I almost always have had solidly above average (albeit not superstar) SETs once course size is controlled for. So this is not a sour grapes post.