Evaluating Teaching

I spend a lot of time – arguably too much time – thinking about how teaching is evaluated, how I’d like to be evaluated, and what causes variation in teaching evaluations. This post covers several of those topics, and relates to my post from last week on pros and cons of flipping the classroom.

In comments on my post last week (including some on twitter), some people pointed out (correctly) that student frustration leading to lower teaching evaluation scores is primarily a problem if student teaching evaluations are weighted reasonably heavily. Heavily weighting student evaluations is a problem, given that such evaluations are known to be biased. Given that evaluations are affected by class size, grading reputation of the instructor, response rate, and instructor gender (among other things), ideally faculty teaching would be evaluated in ways beyond student evaluations. Fortunately for me, both Georgia Tech and Michigan had peer evaluations of teaching. These evaluations came at multiple points while pre-tenure, and were aimed to give constructive feedback on teaching, but also to try to evaluate teaching in a way where good instructors wouldn’t be penalized for being challenging, female, etc.

How much weight do student evaluations of teaching hold? My impression is that it’s variable. I don’t know of anywhere where they are completely ignored, but it seems like some places don’t put a lot of weight in them; on the other hand, at other schools they carry a lot of weight. Part of why I wrote the post last week was because of email conversations (spurred by this earlier blog post) with some pre-tenure faculty at other universities who are receiving very strong pressure to increase their teaching evaluations prior to coming up for tenure. And Terry McGlynn has said that “student evaluations are the main method used to evaluate [his] teaching.” Clearly there are institutions where teaching evaluations carry a lot of weight, despite all their flaws.

So, a quick poll:

Regardless of how heavily they are weighted, there are two questions related to teaching evaluations of students that I’ve been wondering about. First, what is the most important question on those evaluations at different institutions? And, second, what would be my ideal (free-response) feedback from a student?

At Georgia Tech, the most important question was “The instructor was an effective teacher.” At Michigan, it is “Overall, the instructor was an excellent teacher.” Based on conversations with someone at Michigan’s excellent Center for Research on Learning and Teaching, even a seemingly small change like that on the question can have an influence on how students rate an instructor. Given the typical 1-5 Likert scale (strongly disagree to strongly agree), some students who think an instructor was very good will put “strongly disagree” when asked if an instructor is an “excellent” teacher. That was surprising to hear to me, since I assumed most people mentally transform the scale to something like: The instructor was a [1=very bad   2=bad   3=okay   4=good 5=excellent] teacher. This got me to wondering how universities settle on the wording of these questions. (I’m sure there’s a committee!) I’d love to know more about how the wording of the question influences responses, but haven’t had time to research it. What is the main question at your college or university?

To move to the second question, related to what my ideal evaluation would be: I think my ideal student evaluation would be something like “Dr. Duffy’s class was challenging but fair. It taught me how to think critically and made me realize how interesting ecology is. I have a much better understanding of how science works now and feel prepared to move on to upper level courses.” I don’t expect to actually get that evaluation (for starters, I’d be kind of shocked if a student used the phrase “challenging but fair” in an evaluation), but thinking about it helps me think about what I think is important when teaching: teaching students to think critically, improving their skills related to the process of science (for example, reading figures, evaluating experimental design), and making them enthusiastic about ecology as a topic.

Coming back to the topic of how to evaluate teaching in ways that do not center on student teaching evaluations: I would love to see how students who have taken a flipped version of Intro Bio do compared to students who took a traditional format class when they get to Ecology, Evolution, and Genetics. Do they do better in upper-level courses than students who took a traditional lecture format version of Intro Bio? Worse? It would also be great to look at retention rates – are students who take courses in one format or the other more likely to move on to those upper level courses? Those measures are what I really want, but I don’t have them yet.

In the meantime, I wonder how much I should pay attention to student evaluations. I mostly try to focus on what pedagogical literature says is effective, but I would be lying if I said that I don’t think about student evaluations. I think it would be good if I thought about them less, but I haven’t figured out how to do that yet. But I have a post in the queue for next week that is mostly in jest, but that might have some truth in terms of how to do that. (Preview: it would involve bombarding myself with negative teaching evaluations daily.)

Finally, in preparing this post, I noticed that the Slate piece I linked to earlier said that, in that author’s experience, most people don’t read their teaching evaluations. I’ve heard people say that, but I think most people I know do read them. So, I’m curious, do you read yours?


18 thoughts on “Evaluating Teaching

  1. Some notes from my own experiences at Calgary; your mileage may vary of course:

    -the collective agreement with the faculty union specifies that student teaching evaluations can’t be the only means by which faculty teaching is evaluated. That’s adhered to in practice.
    -At Calgary, tenure and promotion evaluations have always been very sensible, in every respect not just in terms of how teaching is evaluated. For instance, my head of department knows that student evaluation scores in the massive intro biostats course I teach tend to run low, so he knows that’s not a sign I’m teaching badly. And there’s not any one question on the teaching evaluations that’s seen to be of overriding importance. I think our evaluation process is sensible in part because we have sensible procedures (as an aside, I doubt our formal procedures are unusual in this respect). But I think it’s mostly because we’ve always had sensible people doing the evaluating.
    -I’ve had students say “challenging but fair”, or “the course was tough but I learned a lot” on evaluations, and I’m not a great teacher. So I wouldn’t be surprised if you had some students say that on your evals. Although the best evaluation I ever got was “Your mom must be proud of you.” 🙂
    -I used to look at my student teaching evaluations very closely (both the numerical scores, and the written comments). Nowadays I just skim them, looking for evidence of widespread problems with a course. For instance, if a lot of students say the course was disorganized, that’s a problem. I skim for a few reasons. I found that whenever just a few students didn’t like the whole course, or some aspect of it, they were always contradicted by at least as many who liked the course or that aspect of it. The vast majority of the written comments are too brief to give me any guidance on how to change my teaching. And students aren’t always good at diagnosing the problems in a course, suggesting improvements, recognizing effective teaching when they receive it, etc. (nor are they always terrible at it, of course).
    -Having said that, I tend to look closely at all of my evaluations for small upper level courses. The feedback tends to be more useful. I skim the feedback from big entry-level courses.
    -We’re in the process of radically revising the questions we ask students to provide written feedback on. It used to be a series of questions, now it’s just going to be 2-3 quite open-ended questions. I confess I’m skeptical of whether asking really open-ended questions will be an improvement. But the folks leading the revision are smart colleagues whom I really respect. So I plan to take a closer look at my first evaluations using the new forms once they’re finalized.

    • I once had a student say, as the entire evaluation, “God bless Dr. Duffy!” You never can be sure if they mean that sort of thing tongue-in-cheek, but I decided to take it at face value.

      • Yeah, I choose to take such comments at face value too. And honestly, I do think they’re mostly meant to be taken at face value. In my experience, few students have both the inclination and ability to pull off undetectably-subtle sarcasm on comment forms. 🙂

  2. One thing to get our heads around, I suggest, is the fact that the way teaching is evaluated has scant connection to the effectiveness of teaching. The evaluations size up the perception of effective teaching. Which if done by students or faculty in the department may or may not be associated with actual effectiveness. To up your teaching eval game, focusing on teaching better might not help, unless it changes your teaching in a way that allows others to see and think that it’s effective.

    • I agree Terry. I taught half a subject last semester. I marked all the exams. I got the better student evaluation, but I think the students provided better essay responses to the topics the other lecturer took, so I think she was more effective than me, even though evaluation didn’t reflect this. Possible explanations are those cited in the article – response rate, gender bias, etc.

  3. At my college, the two questions people pay the most attention to are the following:

    “Overall, I rate this instructor an excellent teacher.”
    “Overall, I rate this course as excellent.”

    These are the questions from the IDEA instrument which is what our college uses.

    Strangely, we don’t pay particular attention to “Progress on relevant learning objectives” which are designated by the faculty member. In addition, the ‘progress on relevant learning objectives” is the students’ self assessment of their progress which may or may not correlate well with their actual progress.

    I use pre-test/post-tests to gauge learning on “facty” types of information gain.

    I have also moved to standards-based/specifications grading this semester. This seems like a more sensible way of evaluating students. I am grading my first literature summaries in an upper level animal behavior course and I can tell you that I am enjoying the grading task more than I ever have. It has also forced me to rethink what I am teaching and how I am teaching it (an unforeseen benefit of making that move). And I am evaluating whether students know and can do the things specified as learning goals for the semester.

    We also have a departmental assessment that I have found really useful in getting feedback that informs my teaching. I have pasted it below (full disclosure: I adapted this from an instrument designed by the faculty in our religion department):

    Your assessment will address both self-assessment and a course assessment. Begin by reviewing the stated goals of the course, which you will find in the syllabus. Review work that you have submitted this semester and respond to the following prompt for each of the stated course goals: “I have or have not made substantial progress in achieving this stated course goal and here are the reasons why.” This should be limited to two, typewritten double spaced pages with 1” margins and 12-point font with your name, the semester and the course in the header.
    For each of the stated course goals address the stated prompt by addressing each of the following sub-prompts:
    • Here is where I started. Offer a brief statement about what you knew about or “where you were” with respect to this stated goal when the semester began.
    • Here is where I am now. Here is where you will make a case for having made substantial progress. Perhaps for some, depending on where you were, you may not have made much progress. Others will have made substantial progress. You are to make your case by appeal to specific assignments, exams, projects, etc. You are to be candid where you have fallen short and, insofar as you are responsible for such, offer an assessment as to why.
    • How have I developed in the process of becoming a scientist during this course? To what degree has this course helped (or not) me develop the habits of practice and thinking that are indicative of a trained scientist? How have I contributed to or thwarted this process of becoming a scientist?
    • Here are some ways that course design and assignments contributed to my achieving this stated course goal. What worked well about the course, the way it was constructed, learning tools, homework assignments, exams, projects, class discussion and activities, etc., that contributed effectively to your learning?
    • Here are some ways that course design and assignments might assist future students to achieve the stated course goal more effectively. I think this is self-explanatory.

    This is filled out and submitted to an assessment email address and is only released to the faculty member after grades have been submitted as the responses are identifiable to a particular student. But these prompts generate lots of good information about how to teach better.

    I am not a fan of instruments like IDEA (they are not particularly informative) and am even less of a fan about how P&T committees interpret the “quantitative results” of these survey but I have written about that elsewhere.

    I do pay attention to the comments as well as the distributions of the responses but often these are not very helpful. As dept. chair, I also read everyone else’s evaluation in my department. I also prepare a report to the Provost of our college on those results for the year. This report actually allows me to provide contes=xt to the results as the responses of the students change from the first year to the second year and then again from the third year to the senior year. I think this is due to the natural cognitive development going on in our students across the curriculum.

    I also think we are also sacrificing some quality of data by assessing every course every semester. Students are bombarded with course evaluations and I wonder sometimes how much time students spend. One semester, I had a student sit outside the room and time how long it took the students in my class to answer a 57 question course evaluation form (this included filling out the name of instructor, course number, etc. Less than 5 minutes on average and the distribution was skewed right.

    And finally, I think I think about course evaluations too much in meta sense.

    • So many great thoughts here! A few questions/replies:
      1. By “standards-based/specifications grading”, do you mean ability to carry out certain tasks (such as a literature review or interpreting a figure they’ve never seen), as compared to assessing learning of facts? Some of my exam questions test content/factual knowledge, but some assess skills. Others do both. For example, I will show them a figure they’ve never seen before, and ask them to interpret it in light of the ecological interactions they’ve learned about. I think some students find it very frustrating (which presumably lowers evaluations), because they can’t memorize their way to doing well on this sort of assessment, but I find it so much more useful. But, fortunately, assessments here include evaluations of exams as part of evaluating teaching effectiveness.

      2. “I also think we are also sacrificing some quality of data by assessing every course every semester. Students are bombarded with course evaluations and I wonder sometimes how much time students spend. One semester, I had a student sit outside the room and time how long it took the students in my class to answer a 57 question course evaluation form (this included filling out the name of instructor, course number, etc. Less than 5 minutes on average and the distribution was skewed right.”
      This is an excellent point that I hadn’t thought of. I wonder if anyone has looked at varying how often evaluations are requested? And, if so, is there an effect on the thoughtfulness of the responses? I think you’re right that there probably would be.

      Somewhat related: I’ve also wondered how discipline affects how students rate faculty. Do they expect different things out of a calculus vs. a biology professor? Out of a biology vs. a literature professor? I don’t know how that would be assessed, but I think it’s interesting to think about. Which means that…

      3. …clearly we both think too much about course evaluations! 🙂

      • Another long reply here but…
        1. I mean both. So to get an A in the course you have to average at least 90% across all objective exams. But the exams test the two lowest levels of Bloom’s taxonomy so they ask students to recognize knowledge and understand knowledge. So the exams are multiple choice and short answer. I used to ask long, conceptually complex questions on exams and realized in the process of moving to a specs grading point of view that I was testing higher levels of Blooms taxonomy but I was simultaneously testing how fast they could get to that level because they also had to answer a bunch of multiple-multiple choice questions and short answer questions. So, the knowledge tested on exams is cognitively simpler so the spec for that is higher. And this is only one of the specs. The test spec for a C is 70% average across all exams. That represents the lowest I am willing to go in terms of facility with recognition and retrieval so there is no D in the course.

        Other specs are:

        1.a literature review (achieving Anderson et al.’s (2001) second highest level of cognitive process for a C or the highest level for an A or B).
        2. literature summaries of papers we read in class. Each has a summary portion and a critical thinking portion. For different course grades, you have to reach the highest levels of cognitive process more consistently for higher grades.
        3. an independent research project if you are shooting for an A.

        1-3 above have both specifications for scientific accuracy, depth of knowledge, etc. as well as writing specifications as I am trying to impress upon the students the importance of writing and thus we are working on writing better in the course. I can send you the syllabus if you like.

        Regarding point 2 above, I have not seen any data on the effect of how intensely they are surveyed. We do get comments from our students about what we have dubbed “survey fatigue”. I see the evaluations vary by level and there are some data out there about disciplinary differences and we see disciplinary differences at our institution.

        And yes to 3.

  4. Evaluations from students (and yes, primarily those two questions) are important in awards and the like for me, but less so in tenure or promotion.

    What I do is given the students a quiz (on line, multiple choice) at the beginning of the semester and at the end of the semester (same quiz) to evaluate if they have learned. Do they as a group score better after taking my class? They get a flat 10 points per “survey” as I call it, so they are neither rewarded nor penalized for their performance.

    I would love to be more sophisticated about this – to have a better more comprehensive test, but I think that even the simple way I do it is a very useful, low input way to quantify learning rather than popularity.

    It’s funny about being liked by the students and what that means. I am popular as a teacher in general, and while in some ways that’s great, I think that some of my colleagues might think I’m just too much of a softy, and that I don’t teach a rigorous course. The before-after quiz gives me data to show them that I do, in fact, teach the students pretty darned well.

    • I almost had a section of this post on the question of being liked by students and, more specifically, whether I should care if students view me as “nice”. I ended up cutting it in the interest of not having the post be too rambly, but this is also something I’ve thought a lot about. Should my ideal evaluation include something about me being nice? I don’t like if I hear that a student thinks I’m inaccessible or not nice, but I wonder what drives that sort of comment. I don’t know how I could be more accessible (I happily talk with students before and after class, I advertise office hours and am happy when students attend them, and I truly care about whether they learn). I think I’m a reasonably nice person. So, I generally view those comments as indicating something else (which may or may not have to do with me in particular), and I suspect that women receive criticisms about niceness more than men (though I don’t have any data on that). If these comments were more common, perhaps it would indicate something I should be worried about, but they’re not very common. I have brought this up with a colleague who thought it was comical that some students complained about this, and pointed out that they might be in for a rude awakening in future semesters (assuming that, at some point, they might have the misfortune of encountering a professor who is really is mean).

      I suppose now I’ve just moved my rambling on this topic to the comments. Sorry! 😉

      • I think you have basically the right attitude. If the students don’t like you because they think you’re unwilling to answer questions or whatever, yeah, that’s absolutely something you should worry about as an instructor. But if they don’t like you because, say, they think you’re a hard*ss, that could well not be a problem (say, if it just reflects the fact that it’s a challenging course, and some students don’t like that because it’s the first C they’ve ever gotten and so they take it out on you). In other words, don’t worry so much about whether students find you “nice” in the abstract, worry about the sources of their feelings.

        Also worth keeping in mind that you shouldn’t read much into what any one student says. If one or a few students in a big class thinks you’re a jerk, in all seriousness I would not sweat that at all. Especially since, if you do sweat it, that probably means you’re ignoring the positive comments you got from other students.

        So I guess I’d say that, if you’re a reasonably nice person and you’re doing the things you should do as a teacher, I think that’ll come through to the students. They’ll see that you care about them, care that they learn the material, and so they’ll like you. That’s been my experience at Calgary at any rate. But I guess I’d also say that at the end of the day the goal is not to have the students like you. Student learning is the goal. Teach the class and relate to the students in the best way you know how so as to maximize their learning, and in my admittedly-anecdotal experience student evaluations that are at least reasonably positive will tend to follow.

    • it would seem a pre/post test doesn’t really capture the *added value* of a particular teacher above another or above say a textbook or youtube. So for example, what if I just gave a pre-test on day 1, handed out an outline of each lecture, told the students to study from the textbook, and come back in 14 weeks for a post-test. Meanwhile, I’ll be in my research lab. I’m sure there would be learning but clearly I’ve not added value since I was absent. So how do we measure our added-value beyond what the student could learn on their own?

      As a follow-up, how do we determine which of our activities actually adds value? Some of the work on flipped classes at least suggests that it is not so much the flipped class that adds value but that a flipped classes forces the students to spend more time per week thinking about the class material. And *any* strategy that forces students to think about the class more hours/week might work just as well.

      I love the theme of this and the previous post. Much food for thought.

      • I completely agree that the pre/post test doesn’t measure the value added of being in the class. But I think it’s a start – taking data on learning. Plus, doing so makes me think about how to take better data on learning and what I’d need to really evaluate different approaches. It makes me approach my teaching more scientifically, and I think that’s good overall.

  5. It helps to think about two separate reasons for why we would ask students to evaluate a course/instructor:
    1) To improve the course while hopefully having students reflect on their own learning in a useful way
    2) To “grade” the instructor for purposes of promotion/tenure or merit review.

    I tend to solicit evaluations separately with these goals in mind.

    For item (1), check out the Student Assessment of Learning Gains: http://salgsite.org/

    Also, plug into the CIRTL network to learn more about Teaching-As-Research! There are lots of evidence-based approaches to figuring out whether your students are really learning what you want them to learn. No need to reinvent the wheel here. There is even a MOOC that covers TAR along with other evidence-based practices: http://www.cirtl.net/

    For item (2), I think it’s up to tenured faculty to fight for change on this. Here at UW College of Engineering, we have a question that says “Your rating of this instructor compared to all instructors you have had is” and responses can range from 1 to 5, with 5 representing a ranking among the highest 20% of instructors. A lot like a 5-point grading scale. This is pretty useless, as others have noted above about their respective evaluation tools.

    I also favor a peer-review approach. Here again, there are existing best-practices for peer review but not many people know about them or have adopted them. When I had peers conduct review for my tenure dossier, they just showed up to lecture one day and answered a bunch of cookbook questions about whether I was prepared and whether the students were engaged, and whether my syllabus was up to date. Some members of our Teaching Academy (https://teachingacademy.wisc.edu/) have developed a much richer approach where the reviewer meets in advance with the reviewee and asks what s/he wants to get out of the review. We then go through course materials and structure, assessment tools, etc. It’s sooo much more useful. I don’t have a website to point to, but I know one is under construction so check back soon.

    • Thank you for these links! That Teaching Academy sounds great. If you remember to update once there’s a website to link to, I’d love to see it!

  6. I couldnt’ answer the poll, because if the question is do I pay much attention to the quantitative scores on student the answer would have to be no. They mostly confirm what I already know. I am enthusiastic, know and communicate my material well, respect the students and not the best entertainer nor the fastest turnaround on grades nor the easiest workloads. This has been my “niche” for years and variations in quantitative scores are more stochastic than anything. And I guess the key is I’m OK in this niche and not actively trying to change it and my department isn’t expecting me to. At this point if I know the class size I can almost predict my quantitative scores before I teach the course.

    On the other hand do I pay much attention the qualitative comments. Absolutely. THe likes and dislikes there are at least 50% of what drives changes in how I teach year to year (my own assessment being the other 50%).

    I tell students up front that I highly value their qualitative comments and that they are more useful to me than the quantitatie so hopefully students spend more time on them than they otherwise would.

  7. Pingback: Student evaluations – how bias shows up when you’re just trying to get some honest feedback. | Diversity Journal Club

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.