Why don’t we scientifically measure teaching effectiveness?

Posted on September 5, 2017 by Brian McGill

This post might as well be subtitled “A rant on the misuse of student evaluation of teachers”. I’ll just get that out of the way right up front.

One of the defining attributes of being a scientist is that we’re really good at the practice of quantifying things in a repeatable, meaningful way. Take journal impact factors as an example. We’re able to talk about them and pretty quickly agree that journal impact factor is a flawed and noisy but useful one-dimensional representation of a high-dimensional quantity, journal quality, but a rubbish measure of the quality of a single paper. Or say you’re on the committee of a student who tells you they want to measure competitive effects. You’re pretty likely to lead them down several conversations. Do they mean per capita effects on population growth rates? or per unit biomass impacts on biomass? or coexistence effects on other species? Conversely is their measurement of competitive effect likely to be strongly impacted by overall species richness or by productivity of the system? And what are the error bars on their measurement methods likely to be? If they are looking at biomass impacts are they measuring wet or dry biomass? How will this be standardized? How much variability can be removed by standardizing techniques?

Phew! We scientists sure make measuring something into a sophisticated exercise (and I hasten to note that is because experience has taught us it is important to do this). So how come both faculty and administrators are content to just take student evaluations of teachers on a 1-5 Likert scale so seriously?

This being the first work day after Labor day, the vast majority of faculty members across North America are now teaching (barring a few faculty at schools with trimesters). And as sure as death and taxes, that means that their students will be given a scantron bubble form to solemnly fill out to evaluate their teachers at the end of the semester. And then those forms will be run through a scanner and bam! The whole semester of teaching will have been boiled down to a 3.8 or 4.6 out of 5. And, since on nearly every campus teaching matters to at least some degree towards tenure and promotion, P&T committees, department chairs and deans will look at those scores and use them as the primary assessment of teaching quality. Now I hasten to note that nearly every campus has language in tenure guidelines that teaching is to be assessed in a multifaceted way and appropriately contextualized. And no recommendation will be written as baldly as to say the candidates average evaluation score is 4.1 which is above/below faculty averages and we stopped looking there. But I will bet you that those scores and how they compare to averages is the single most important consideration in building a view of that faculty member as a teacher on the vast majority of campuses.

Yet we all know (or would know if we spent a few minutes thinking about it) that this is way below our standards of measurement for our research. This is not a matter of opinion. This is a matter of piles of rigorous research. Here’s at least three big reasons why scores on student evaluations of teachers (SET) are deeply flawed:

SET scores are strongly correlated with attributes of the class that have nothing to do with what we want to measure. Want to build a good prior on what the SET score will be? Ask how many students are in the class, what fraction of the students are required to take that class to meet graduation requirements and how quantitative that class is. I’ve never seen a study that quantifies exactly how predictive those three factors are with a number like an R2. But the short answer is pretty darn predictive. Study after study has shown that large, required classes with more quantitative material scores lower on SET scores (reviewed in Wachtel 1998). Formal studies on whether course workload and expected grade influence SET scores are a little more ambiguous, but faculty themselves have little doubt and act accordingly. Resulting in a very perverse incentive system (Stroebe 2016). So if you want to get good SET scores just teach small, upper level classes without math, and just to be safe make it an easy course. What’s that? You don’t have control over the first three parts and you don’t believe the last part is in the student’s best interest? Too bad.
SET scores may capture student biases on gender, race, etc – This is a complex topic. But the notion that asking students for qualitative assessments like “overall effectiveness” could open the door for implicit bias is certainly credible. Wachtel’s 1998 review (already linked) identifies many studies purporting to demonstrating a gender bias (usually against women), but on balance finds a mix of studies supporting and not-supporting gender biases. One of the challenges in evaluating this issue is that such studies rarely are randomized or have controls. A recent re-analysis of two randomized controlled studies (Boring et al 2016) found pretty clear evidence of gender-biased evaluations both overall and on specific attributes of teaching (although some of the standardized effect sizes were not huge, e.g. r=0.09 between gender & SET score – basically the means between males & females can be pretty large – a difference in means of roughly 0.2-0.8 points on 1-5 scale – but the variances in student scores of a given teacher are also large). Studies of biases against other minority groups (e.g. race or differently-abled) have been noticeably lacking. There are studies showing biases against non-native speakers (something I regularly saw in the student evaluations when I was a TA for one of the more innovative teaching professors I have ever worked with whose accent was not at all hard to understand).
SET scores don’t correlate with actual learning – yes, you read that right. The best studies of this question use a multi-section design (different sections of the same course with same test taught by different professors so actual test scores and SET scores can be correlated). Several meta-analyses have claimed to find weak but significant correlations between SET scores and actual student learning. However a recent meta-analysis (Uttl et al 2017) showed quite decisively in my opinion that these results are a consequence of publication bias (the existence of publication bias is pretty decisively demonstrated using funnel diagrams). Using state-of-the-art methods to adjust for publication bias, they find no significant correlation between SET scores and student learning!

So there you have it. SET scores have no correlation with actual student learning, are correlated with factors beyond the individual instructor’s control (class size, electivity of the class, subject matter of the class) and might be biased by gender, race etc. In hindsight this shouldn’t really shock us. By design SET scores measure some mixture of how much a student likes a teacher and how much a student thinks they are learning. Neither of those factors are actually how much a student really learned. The first factor (liking the teacher) could be independent of or negatively correlated with actual learning. And as for the second factor, students are notoriously unperceptive about how much they learn (amazingly all students think they have learned a lot and deserve an A). As I say I’ve never seen a good quantification of this but the effect sizes of #1 (course properties) can be as much as 1 point out of 1-5 and #2 could be as big as 0.3 -0.5 Together those are larger than the variation between teachers by a good amount. SETs tend to swamp any measure of actual teaching ability by variability related to non-teaching related factors.

But wait some defenders will say. We know all of this and we control for it. We don’t just take raw SET scores. We only compare SET scores against appropriate comparison groups. For example on many campuses, classes are binned by college (broad subject area like science vs humanities) and by class size. And then SET scores are only compared within the relevant bin. Well that is certainly an improvement. But there is no way to control for all the factors (e.g. I’ve never been on a campus where there are enough classes to group by whether a class is required or not, and within a field like biology there is a great deal of variability in how quantitative classes are which is also uncontrolled for). And I’ve never seen a campus that binned by gender of the teacher just in case SETs are biased there. More importantly, this is really just a giant distraction. It is addressing my points #1 and #2 but missing the elephant in the room (point #3). Hello! SET scores don’t measure student learning!

Now this is getting to something useful! What should a dean actually be trying to assess about the their faculty’s teaching abilities? And it goes without saying for me that this is a legitimate exercise. Teaching is an important part of the job and faculty are accountable for how well they do their job. But what is it we want to assess? (Small Pond also discusses this) I can easily think of half a dozen things:

How much does a student learn about the subject matter?
How well is a student prepared for their next classes?
How well is a student prepared for their career?
How much did the course achieve broad educational educational objectives like teaching a student to think rigorously and critically or to create life long learners?
How much is the student inspired and motivated about the subject?
Were the students treated respectfully and with dignity?
Did the teacher use known best pedagogical practices?

Arguably a student assessment of how they were treated is a valid metric for #6 and maybe #5 (and they are common question on SETs). But SETs are not really credible metrics of anything else. Again SETs, more or less by design, measure how much students like the teacher and think they’ve learned. They don’t in any credible way measure #1, #2, #3, #4, or #7. To demonstrate the point, did you know that just professing in class that you care about students experience in your class increases SETs? (sorry can’t find the link at the moment). That’s a pretty darn shallow measure of teaching effectiveness. I cannot emphasize enough that except for a few subjective interactional experiences related to treating students well, SETs don’t measure any of the things we really care about related to learning and preparation. Students aren’t really prepared to self assess how well they’re prepared for their career or have become deep thinkers. And we might think students could self-assess how much they’ve learned, but studies have pretty rigorously shown they cannot. And rightly or wrongly teachers widely perceive that they can improve their SET scores by softening up a course. SO HOW COME SET SCORES ARE THE CENTRAL TOOL FOR ASSESSING FACULTY TEACHING?

I cannot answer the rhetorical question I just posed. And I hope it changes. But now I want to turn to two more constructive directions. What are SETs good for? and how should we assess faculty teaching?

Well I’ve pretty thoroughly trashed SETs for evaluating overall faculty teaching effectiveness. But they’re not useless. I for one would be upset if SETs disappeared from my classes. There are three things I think SETs are good for (and use them for myself). First SETs do have some validity for assessing the nature of the personal interaction with students. Questions like were students treated with respect? Did the instructor increase your enthusiasm about the subject (adjusting for how excited you were before you came in)? Did the teacher explain things clearly (recognizing that this is highly student dependent)? These are interpersonal interactions and students obviously have an important and valid perspective on that. But you cannot combine those up into an overall assessment of teaching effectiveness (some of my best teachers have been pretty uncaring of my feelings). Second, SETs are valid for within-teacher, within-course comparisons. That means comparing how you taught BIO 3XX in 2017 vs 2015 is largely valid (unless the class is so small that student sample sizes are too small). One has to be a little cautious as the average quality of students varies from year to year, so you’re really measuring a mixture of average student quality and your teaching. But I have found that SET scores largely match my own self-perceptions of interannual variability. It is useful to be told in a hard number when I’m getting complacent and need to shake things up and when the new things I’ve tried worked or didn’t work. Finally, the 1-5 scores are not the only thing on student evaluations. I find the qualitative (i.e. free text response) portions of SETs extremely useful. You can’t take any one comment too seriously (in either direction), and you have to throw out the gratuitous comments made on fashion etc, but a blurry-focused reading across many comments is very helpful in understanding what is working and what isn’t and what the students are missing. And many of my most successful improvements to my teaching were suggested by students via these responses.

To continue the theme of being constructive, what should replace SETs as the lead methods for evaluating faculty teaching performance? Well here we have to start acting like the scientists we are and carefully think out what we want to measure (which of course depends on why we want to measure) and then be rigorous about it. But here are some thoughts (also see Small Pond’s list of commonly used tools):

Discuss what we want to measure – We cannot possibly be precise about measuring something unless we we first have a conversation about what it is we really want to measure. “Teaching performance” is too vague to be measurable. College campuses and teaching would be much better if departments had regular discussions on what their educational goals were (both in terms of curriculum and student growth). Or even if teachers got more reflective on their own goals (and no the trend to require putting explicit goals in a syllabus hasn’t seemed to me to achieve this).
Peer evaluations – This is probably the 2nd most common tool after SETs but it is not found on all campuses. Have experienced teachers in your department come in and make a qualitative, nuanced evaluation of your teaching. Have them write a page of what does and doesn’t work (no numerical scores). This can then be shared with both the teacher and anybody who needs to evaluate the teacher. This is not perfect but it can quickly assess many dimensions of teaching and provide very constructive feedback to a teacher. And if you have to write a P&T letter, in my experience quoting a statement from a peer evaluation is much more impactful and illustrative than citing a numerical score from an SET.
Expert evaluations – another model is to bring in experts rather than peers to do teaching evaluations. Like somebody from your campus teaching center. This is also found on many campuses, although rarely used for performance evaluation. An obvious limitation is that the expert assessor will likely have limited knowledge of your subject area. And there is a risk they are a little over focused on trendy pedagogical techniques rather than basic aspects like connecting with your class. But aside from their pedagogy expertise such evaluations can often bring additional benefits like recording a class so you can see for yourself how your teaching is and isn’t working. Its a can of worms, but maybe these should play a bigger role in teaching evaluation (I’d vote for this over student evaluations).
Go longitudinal – learning is by definition a transformational process, a rate, a change over time. If you really want to know how you’re impacting a student measure them over time. Assess students at the beginning of the class and the end of the class. Assess students when they join your department and when they leave. And hard to do but very valuable, assess them when they’re a couple of years into their career. Note that this call to go longitudinal can apply to any of the goals (e.g. subject matter learning or broad educational goals).

So, now you know I feel like the use of SETs as a primary tool for evaluating faculty teaching is deeply misguided. And I wouldn’t get rid of SETs but I sure would add some stronger tools for assessing faculty teaching. What do you think? Do you like SETs? Do you think the results of the studies I reported are wrong? What goals do you think we should be assessing in evaluating faculty teaching? How should we assess them? What will it take to get rid of SETs as the primary tool?

PS thanks to Jeremy and Meghan for encouraging me to write this post and to Meghan for providing many helpful links although of course the opinions expressed here are my own.

PPS I should probably note that SETs have generally been kind to me. I almost always have had solidly above average (albeit not superstar) SETs once course size is controlled for. So this is not a sour grapes post.

36 thoughts on “Why don’t we scientifically measure teaching effectiveness?”

jeffollerton on September 5, 2017 at 9:18 am said:

Hi Brian – if you have not already done so it would be worth you taking a look at the UK’s Teaching Excellence Framework, the first run of which reported back earlier in the summer. This uses data from six metrics to rate an institution as Gold, Silver or Bronze:

Teaching on the course (from the National Student Survey of that year’s graduates)
Quality of the assessment and feedback (from the National Student Survey)
Quality of the academic support (from the National Student Survey)
Student drop out rate (from the Higher Education Statistics Agency and Individualised Learner Record data)
Employment or further study (from the Destination of Leavers from Higher Education survey)
Highly skilled-employment or further study (from the Destination of Leavers from Higher Education survey)

See: https://en.wikipedia.org/wiki/Teaching_Excellence_Framework

Even using such a broad range of metrics the TEF has proven controversial. Next iteration of it will drill down to subject level within universities which will probably cause even more angst….

Reply ↓
- Brian McGill on September 5, 2017 at 3:15 pm said:
  
  Very interesting. It sounds like as you say this is more targeted at overall institutions right now which is certainly a fair place to start. I’m surprised (disappointed) how much student surveys feature in this approach too.
  
  Reply ↓
  - jeffollerton on September 6, 2017 at 2:43 am said:
    
    Yes, it does rely too heavily on it, though their voice needs to be heard. Plus this survey is of students at the end of their final year who are about to graduate and who are looking across all of their courses and tutors rather than just one.
  - Brian McGill on September 6, 2017 at 9:25 am said:
    
    Ooh – we don’t have a retrospective survey in the US (or at least anywhere I’ve been). That would fix at least some of the issues.
Jeremy Fox on September 5, 2017 at 10:22 am said:

Brian, I wonder if you have any thoughts on “value-added models” of teaching effectiveness. (overview: https://en.wikipedia.org/wiki/Value-added_modeling). I don’t know much about their details but what I do know I don’t like (e.g.: https://mathbabe.org/2012/03/06/the-value-added-teacher-model-sucks/). I wouldn’t want to see value added models applied to college and university instructors. And fortunately, I haven’t heard about that being done (has it been?).

Reply ↓
- Brian McGill on September 5, 2017 at 3:46 pm said:
  
  Given that I am on a school board that is in the midst of discussing how to use test scores, I certainly have thoughts! Its more whether I can phrase them carefully in a way that won’t come back to bite me.
  
  I guess my thoughts are:
  1) if you wanted to come up with a standardized quantitative metric of teacher accountability for student learning of some form of basic content matter, this is about as logical as you can get (although comparing year-on-year is a bad idea – the summer melt in learning is very real and large and not a teacher problem – fall spring comparisons are better)
  But ..
  2) We have to recognize student learning is much more under the control of the student and their parents than the teacher. Teachers have bigger effect sizes than teaching methods. But student/socio-economic background/parent education, etc have way bigger effects than teachers.
  3) We have to recognize single test high-stake measurement has serious down sides and creates a lot of perverse incentives
  4) Is a quantitative score on an SAT-type (or Iowa or CAT or various equivalents for K-8) multiple choice exams the highest aspiration we have for teachers? I certainly don’t think so. Critical thinking, quality writing, creativity, determination, civic engagement, whole being growth (e.g. emotional & social) are also important.
  5) All that said teachers, like anybody else, have to be accountable for their performance and such test scores are a useful PIECE of such an assessment. I strongly favor be it K-12 or university that teaching assessment ultimately needs to be a thoughtful exercise by superiors and peers that evaluates many dimensions of teaching, from many sources of information over long and frequently obtained observations/data, assembled in a considered qualitative fashion.
  
  In K-12 education our problem is not the ability to know who is a good teacher and who is bad – I guarantee you the principal and senior teachers all know. We don’t need better assessment metrics. Its what the administrator can do about the variation in teaching quality they observe that is the problem. I think in higher education administrators are probably less aware of teaching quality but even if they did they share the same problems of what they can do about it.
  
  I find education such a contrast to my former career in business. It was opposite in every way. Evaluations were entirely subjective and qualitative (mostly your boss, but often peer or boss’-boss or even customer evaluations were also used). And companies could and did regularly act on poor evaluations (you had to build enough of a paper trail to show that there were real performance reasons and not just discrimination and ideally a chance to fix things, but that was it). There were downsides – mostly if you got a vindictive or incompetent boss the system broke down (but hopefully that boss was also getting evaluated and removed). But on the whole I think I preferred the business system. And being on a school board I can tell you that is how the administrators are treated (qualitative, high-dimensional assessments with consequences).
  
  Reply ↓
  - Matthew Holden on September 5, 2017 at 7:09 pm said:
    
    The most important part of any value added approach is that the code and anonymized versions of the data (at least some test data) be released to the public (open source). This allows the approach to be evaluated by those with statistical expertise. Admin will likely view such value added approaches as scientific and assign a lot of weight to them even if they don’t understand the method. When R^2 of learning outcomes is very low with respect to teaching effectiveness, and the algorithms are being used to make employment decisions, it is a necessity that every method used is open source, for transparency sake.
    
    Unfortunately, the value added approaches used in practice, as far as I have read, are entirely closed source. Several statisticians and data scientists have asked to see the code that school districts and education researchers use for value added modeling, and as far as I know, they’ve been routinely turned down!
    
    Have you read some of the pieces by Mathematician Cathy O’Neil on the subject.
    https://www.bloomberg.com/view/articles/2017-05-15/don-t-grade-teachers-with-a-bad-algorithm
    
    I’m curious as to your thoughts on it. Do you think it is an unfair critique? Of course feel free to not answer if you think it will “come back to bite you” that is an unenviable position.
  - Brian McGill on September 5, 2017 at 9:47 pm said:
    
    Very interesting. I would say the judge that ruled using a confidential algorithm to fire people “arbitrary and capricious” got it about right.
    
    There’s two issues with VAM. Both are innate human tendencies and both are wrong:
    1) The desire to boil something complex and high dimensional down to a single number on a one dimensional scale
    2) The desire to tie high stakes decision like bonuses and tenure to a single number so a formula rather than a person owns the decision.
    I don’t support either. And I’ve called out #1 and #2 in past blog posts on journal impact factors and AIC scores (so we scientists are anything but immune to these tendencies).
    
    And the notion that the basis of such high stakes assessments (even to firing decisions) should be kept secret is rubbish. Indeed illegal as the judge correctly ruled. Corporations have a lot more latitude than public employers like schools to fire people but even they cannot fire people for secret reasons (or at least they open themselves up to a lot of lawsuits if they do)
    
    And that article hints to me that they may be using something called expected growth (or its derivative called conditional growth) – basically assuming a student of a certain age with a certain test score will grow the same as an average of the students with those two properties last year. If so then I have gigantic problems with the formula. I have found those to be very uninformative and to not work in lots of situations. Much of what is captured in such expected growth indices is reversion to the mean and there are all kinds of reasons that doesn’t apply to students on either end of the spectrum.
    
    It would be nice if the world could be nuanced and rational. The position that a formula based on test scores should tell us everything we need to know about teacher performance and set pay and retention is deeply flawed and problematic. The notion that test scores have nothing to add to the discussion abut teacher (and school) performance is also deeply flawed and problematic. Somewhere in the middle lies a nice balance. Exactly where is probably a long discussion and context dependent.
    
    Although I’ll say again that assessing teacher performance is not really the main road block to improving K-12 education. The lack of funding, the low status we give to teachers in the US, and the inability to do things with what we already know about teacher performance are all bigger issues. Most of that applies to college/university education too.
  - Matthew Holden on September 6, 2017 at 1:04 am said:
    
    Great reply Brian. I agree with everything you just wrote, especially the last paragraph.
  - Meghan Duffy on September 7, 2017 at 11:56 am said:
    
    This comment thread reminds me of this tweet thread by my friend and colleague Megan Tompkins-Stange, who studies public education:
    https://twitter.com/tompkinsstange/status/846357210171092994
    
    From that:
    Teacher performance & student learning can’t be distilled to a single elegant metric, unlike financial profit. And metrics become the 5/x
    outcome, instead of the indicator of the outcome. It’s no longer about student learning, it’s about test scores. 6/x
  - jim on September 15, 2017 at 9:36 am said:
    
    “2) We have to recognize student learning is much more under the control of the student and their parents than the teacher.”
    
    Excellent. Thanks, Brian. This is why teacher-focused approaches will fail. There will never be a way to make the effectiveness of teachers outweigh the ability and/or effort of students (or for younger students, the impact of their parents).
Matthew Holden on September 5, 2017 at 11:58 am said:

I taught at a university for a few years, which had 10-20 concurrent 30 student sections of calculus, every semester. Each section had a different grad-student or faculty lecturer. The professors and some of the grad students taught multiple 30 student sections*. My general impression is that within teacher variation between the 2-3 sections they taught was just as high, if not higher, than between teacher variation (response variable was the score on a common final exam taken by all of the sections).

I think the number two big reasons we don’t rigorously look at teaching effectiveness is that (1) doing so would waste a lot of time (standardized tests?) which could be allocated to student learning/research and trade off against other objectives, such as inspiration and (2) I believe the effect size of “good teaching” on the response variable of actual learning outcomes, as can be measured by a standard exam or future course performance, is likely quite small (my guess is 1-2 % points on an exam). Basically, the data will mostly be noise once confounding factors are accounted for. I’d rather use SEs because there is no real attempt to claim that they are scientific. Their flaws are transparent, and at least they likely measure #5 and #6 to some degree. Standardized tests don’t measure #5 or #6 (and could even be negatively correlated with them) and it is still unclear if they are any improvement over SEs when it comes to other listed objectives. I suspect that longitudinal studies would also produce mostly noise.

If I had to guess #5 and #6 are the true hallmarks of a good teacher. But I absolutely agree with you Brian that we need to be clear that SEs do not measure how well students learned. However, the alternatives may also have a lot of the same

*most grad students only lectured one section. However, some of the more senior grad students had the option to lecture multiple sections in the first semester for two semesters worth of financial support.

Reply ↓
- Brian McGill on September 5, 2017 at 3:19 pm said:
  
  I agree with you that the R2 for teacher effectiveness explaining student learning is low.
  
  I’m not sure I agree that treating students with respect and getting them excited is a hallmark of good teaching though. It certainly looms large in my personal approach. But I know people who treat students with respect but are terrible teachers. And some of my very best teachers were to be honest kind of intimidating. I think that is part of how they got such good results. It’s not my path but I’m not sure I’m prepared to rule it out as a form of good teaching.
  
  Reply ↓
Judy Myers on September 5, 2017 at 12:01 pm said:

two other biases – time of day, 8:30 courses can be lower than 11:30 ones even when taught by the same person.

There is a tendency for evaluations to now be online and up to the students if they want to bother filling them out. This can bias results as students who are miffed by something are more likely to fill out the form than those who are satisfied with their experience. Response levels are lower with online versus set time in class and scantron.

Comments can be used to identify problems and are often, But not always the most useful.

Department heads should be trained in how to interpret these evaluations but can use them unfairly to discriminate in decisions about merit. This is unacceptable.

Reply ↓
- Jeremy Fox on September 5, 2017 at 12:04 pm said:
  
  My university went to online forms several years ago based on dubious pilot data purportedly showing that response rates wouldn’t drop, then (to their credit) reversed the decision when response rates dropped substantially.
  
  Reply ↓
- Brian McGill on September 5, 2017 at 3:22 pm said:
  
  Thanks for bringing up time of day. My all time worst SET (vs norms of the college) was an 8:00AM required course with 80 students and way more math than the students wanted. Talk about an SET trap!
  
  And great point about the move to online instead of dedicating 15 minutes within the class period. I hate that move. It completely undermines those aspects of SETs that are still valid
  
  Reply ↓
Meghan Duffy on September 5, 2017 at 12:12 pm said:

My experience differs from yours in terms of whether the interannual variability matches my personal impression. Most notably, I’ve taught Intro Bio in the same format with the same co-instructor two semesters. (This semester will be the third.) There were no major changes between semesters.* I thought the second year went better than the first (because I had been super stressed out the first time due to needing to write a gajillion quiz questions a week). But the students clearly thought differently, with my evaluations lower the second year. Most notably, in the first year they seemed to like me and my co-instructor about the same. In the second year, they clearly liked her a lot better. We were both doing the same thing both semesters so, again, I have no idea of how to explain this. (I have set up a meeting with our teaching center to get their ideas, though!) The course is big, too, so it’s not a small sample size issue. But that experience has left me even more confused about to take away from SETs.

*There was one major change to me: I was pregnant the second time. And, even though I would believe there is some amount of bias related to pregnant faculty (perhaps especially in the sciences), I don’t think this on its own could explain the difference.

Reply ↓
- Jeremy Fox on September 5, 2017 at 12:55 pm said:
  
  Yeah, there can be big cohort effects in course evals, even for big classes. I don’t understand why.
  
  The things students like about my teaching tend to be quite consistent from one term to the next and one class to the next (e.g., the students all think I’m enthusiastic and that I’m an expert on the material). But their overall opinion of my teaching bounces around for reasons that are hard to determine.
  
  Reply ↓
- Brian McGill on September 5, 2017 at 3:27 pm said:
  
  I agree getting a handle on what SETs from 100 level classes means is really hard.
  
  I only taught intro bio once (600 students). You would think I would get more nuggets out of comments from 600 students than I did from upper level 20-40 student classes. But of course I didn’t Most of the comments I remember were about how many movies I showed, how fair the tests were, my taste in clothes (I remember a particular clever poem composed about my taste for Hawaiian shirts at the time). But basically they were really immature comments.
  
  I don’t think its a shock that a 4th year undergrad gives much more reflective and valuable feedback (albeit still with the limitations of my original post) than a 1st year student who is still figuring out how to up their game to college level studying and learning.
  
  And yeah even with large sample sizes, cohort effects can be surprisingly strong. In my early years of teaching I assumed it was me. But I have now gotten pretty comfortable with the notion that there is large year-to-year variability in students (even student means).
  
  Reply ↓
jim on September 5, 2017 at 12:17 pm said:

Brian, great topic. But I’m not too confident that a scientific approach to teaching effectiveness will yield much that’s useful. The Gates foundation is spending hundreds if millions trying to figure this out and the lack of crowing on their part suggests they aren’t making much progress.

If we can’t even agree in the impact of a clearly quantitative phenomenon like raising min wage, I seriously doubt a scientific approach will make much progress in the classroom.

Reply ↓
- Brian McGill on September 5, 2017 at 3:29 pm said:
  
  Good point about the Gates foundation (not only have they not crowed, they’ve been embarassingly wrong after investing big money on several occasions).
  
  I think the real problem is trying to be quantitative. I favor qualitative evaluations of teaching.
  
  Reply ↓
  - Brian McGill on September 5, 2017 at 4:06 pm said:
    
    Although quantitative data can and should be part of that.
  - jim on September 15, 2017 at 10:11 am said:
    
    I was a big supporter of their approach but it clearly hasn’t worked at all. Zero. But at least it puts an end to that aspect of the debate.
  - jim on September 15, 2017 at 10:15 am said:
    
    Teaching is alot like writing. It’s easy to distinguish a well written paper from a poorly written one, but to come up with quantitative criteria? Not so easy.
Terry McGlynn on September 5, 2017 at 1:27 pm said:

Huzzah! Hooray! Thanks so much for giving this critical thought.

I suspect that the answer to these questions actually exists, it’s just an answer that nobody wants to hear. Outcomes assessment.

Regional accrediting commissions are pressing institutions big time for programs to develop metrics by which they are to measure student learning, and to operationalize this at multiple levels (college, department, and individual courses within each major). Each eyar, I now need to share with my chair, who passes on to the university assessment person, quantitative measures about how students in my courses are learning the main course objectives that are set forth in the syllabus. We design our exams to measure this outcome. But faculty pretty much are self-reporting these assessment values, and they design their own metrics, so when it comes to evaluating teaching quality, then professors can cook their own books.

But if we are really taking teaching seriously, then we could require faculty members to develop their own iterated assessment plan, and then evaluating the quality and sincerity of this assessment plan. I’m not saying I’m a fan of this approach, but it sure would say more about the teaching than student evaluations.

Reply ↓
- Brian McGill on September 5, 2017 at 3:30 pm said:
  
  Interesting. I do fear that such an approach could mire down in bureacratic failure (as such initiatives in K-12 education often have). But I agree with the basic spirit – define goals and then measure them.
  
  Reply ↓
Peter on September 6, 2017 at 4:17 am said:

With my experiences as an undergrad being more recent, I would like to add that I always disliked these evaluations at the end of a term. From my limited pov they seemed to originate more from the teacher’s personal interest in how much of their effort percolated than being institutionally imposed (at least in Europe, where I studied), hence why not every professor implemented them. I cannot recall one single time, where it was actually mandatory to complete an evaluation form.

Generally, teachers seemed to struggle where and when to place them (I have seen a fair share of handout forms to be completed within the remaining time of the lecture, some online questionaires and also a few evaluations directly after a final course exam on the computer) and I would wager that none of these could in any sensible way be compared to one another though the cohort taking these classes was largely the same. Frustration with these evaluations was usually high among students and a fair share among us did not make an effort with answering. The quality of these surveys also ranged from general and rather arbitrary questions to more narrow and explicit questions.

My gut feeling is that they would need to be applied on a more regular (weekly/post-lecture) basis in order to achieve two important milestones:

1) Make students better acquainted with this feedback process in general. Over the couse of a semester, if these evaluations were a basic (maybe even mandatory) component of a course program/structure, they could help to educate students on what to look out for in a teacher, to focus on qualities that go beyond the inclusion of adding movie clips and other dazzle and to reflect on why certain teachers may help them understand certain concepts/techniques better beyond sympathy for an individual person. It takes a long time to learn this, to become not only an honest evaluator of others, but in the end of oneself.

2) Signalize or even better (be able to) show that these evaluations matter/have an impact. If students get the (justified) impression that these evaluations are just another bureaucratic nuisance, they will be far less likely to treat them differently. A regular feedback process could help establish a loop that would ideally directly feed back into class content/structure. Pointing out that a certain element in your course is based on/modified due to student feedback can be a first step.

Having written all of this, I am not sure if there is a single remedy to all of your questions (#1-7). But I quite agree that a single score is certainly not the way to become informed about them. Especially since they are a colorful mix of short and longterm goals (some of them may even lead in opposing directions and may be very hard to measure at all).

Reply ↓
- Brian McGill on September 6, 2017 at 9:24 am said:
  
  Sounds like your experience with these forms in Europe is pretty different than the US. In the US the forms are typically mandatory for every class, the same questionnaire is used across all classes (and has at least some effort in well designed questions from survey experts), and is normally given in the last class period (or the class before if there is a final exam), and the teachers have to be out of the room when they are given and cannot see them until final grades have been given. That doesn’t solve the fundamental problems, but it does give some consistency. Also in the US students know that administrators look at these for peoples tenure so they are taken with some seriousness.
  
  I do think the notion of more frequent surveys within one class so real-time corrections can be made is useful though. I often administered my own mid-term feedback.
  
  Reply ↓
wbwhalley on September 6, 2017 at 11:50 am said:

Hi,

Re Jeff Ollerton and the UK TEF. I’ve been doing some work on ‘quality’ from an instututional basis to aspire to rather than via metrics. I can fill you in if people are interested but in the meantime, some comments about metrics:

Goodhart’s Law: ‘When a measure becomes a target, it ceases to be a good measure. Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes’ Stobart (in 2008 p.116).

Campbell’s Law is the more explicit:
“The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.”

and particularly, the Cobra Effect refers to:
“the way that measures taken to improve a situation can directly make it worse. See, svpow.com/2017/03/17/every-attempt-to-manage-academia-makes-it-worse).

Reply ↓
Raphaël on September 6, 2017 at 4:28 pm said:

University students, at least in North-America, are paying tuition fees to get a diploma. If your view of education is a consumable your are paying for, then the duty is on the teacher to provide you this service (students are clients). If your view of education is about learning new stuff and building your skillset, then the duty is on you to get the most out of it (students are students). Over the last few years, I have seen a major view shift from students being students to students being clients. Of course, each group makes a different evaluation of teaching effectiveness. I would not put the blame on the younger generation though. Students are becoming clients because we treat them as such.

Reply ↓
- Brian McGill on September 6, 2017 at 6:16 pm said:
  
  I agree. I am old school but I definitely fall in the camp that my job is to educate, not sell a diploma. But you’re right of course about the trend.
  
  Reply ↓
  - Jeremy Fox on September 6, 2017 at 9:41 pm said:
    
    I suspect the trend will only get worse in the US as demographic changes continue to bite and the job market remains reasonably strong. US undergrad enrollment has dropped for 5 straight years. Many colleges and universities are going to be increasingly desperate to attract “customers”.
Jeff Gore on September 7, 2017 at 4:24 am said:

The teaching of physics has been strongly influenced by the development of a simple pre/post test called the force concept inventory:
https://www.physport.org/assessments/assessment.cfm?A=FCI

In particular, active learning approaches (e.g. The use of think-pair-share in response to concept questions) have been demonstrated to significantly increase student learning. However, I am not aware of such pre/post testing to evaluate teachers in a systematic way. Of course, this would only be realistic for the large intro classes that are taught every year with a consistent set of topics covered. Interestingly, in the intro physics sequence at MIT we divide the class into ~8 different sections, each taught be a different professor but with shared problem sets and exams. As you can imagine there is a healthy competition between the faculty…

Reply ↓
Neil Hobbs on September 7, 2017 at 4:49 am said:

To me it seems odd that there is no pre-module questionnaire (or at least I have never experienced one), that asks what are a students motivation/reason for taking a module and what they would hope to gain, do they have any preconceptions about the classes/topics (eg. if you ask Zoologists about taking any plant based modules they will already likely be very much in the frame of mind that plants are boring).
Measuring satisfaction/effectiveness etc. after the module seems a bit odd if you don’t really sit down and think what you wanted to gain from taking the module and therefore if the module was actually successful in doing so.

Reply ↓
- Meghan Duffy on September 7, 2017 at 12:01 pm said:
  
  I know why students are taking my course (it’s required), but it would be interesting to ask more about what they would hope to gain from it. We’re doing a bit of trying to get at the preconceptions this year in a survey that I’m doing at the start of the semester asking students about their views on ecology, evolutionary biology, and genetics. But that’s at a coarser scale.
  
  I should think more about whether there’s a way of getting more information on what they’d like to learn….
  
  Reply ↓
Pingback: Recommended reads #112 | Small Pond Science