Like many biology profs, Meghan often starts class by talking for a few minutes about the “organism of the day“, as a way to engage student interest and illustrate key concepts. I do something similar in my intro biostats course. I start some lectures with “statistical vignettes”: real-world examples that illustrate key statistical concepts and demonstrate their practical importance, hopefully in a fun way.
I’ll say right up front that I have no idea how successful these vignettes are.* I don’t know if they make much difference to how well students like the course, and honestly I doubt they move the needle in terms of student mastery of the material. But I like doing them, I can’t see how they’d do any harm, and I’m sure at least a few students like and appreciate them. So here are some of my statistical vignettes, which I’m sharing in case any of you might want to try them out in your own classes. In the comments, please share your own favorite statistical vignettes!
A few preliminary remarks:
- For each vignette I use a single slide, or at most two, showing a key graph/table or two. It takes me 3-5 minutes to set up the question or problem, explain why it’s interesting or important, walk the students through the graph/table, and explain the statistical issue.
- Sometimes I explain the statistical issue myself, sometimes I ask the students to talk to their neighbors and see if they can figure out the statistical issue. Depends on my mood and how much other material I need to cover that day (it takes longer if I let the students try to figure it out).
- You’ll notice that most of the vignettes aren’t about biology. That’s deliberate. My students see plenty of biological examples in lab and lecture. And many of them aren’t going to go on to become biologists or other scientists who do statistics for a living. I want them to see how familiarity with basic statistical concepts can help them be informed citizens.
- Many of these vignettes could be incorporated into regular lectures as illustrative examples, or incorporated into other class activities, rather than used as standalone asides at the start of class. Indeed, that might be better than my approach; my vignette of the day often isn’t about that day’s lecture topic.
- Most of my vignettes involve statistical mistakes. Anecdotally, I think that gives some students an overly-negative impression of just how bad typical statistical practice is. I need to come up with some exemplary vignettes. Please comment to suggest some!
- On the other hand, talking about basic statistical mistakes made by professionals with graduate degrees can be a way to build up student confidence. I tell the students, look, this stuff is hard, even people with advanced degrees get it wrong sometimes. But that’s precisely why you’re taking the class, and when you finish it you’re going to have a better grasp of statistics then many people with PhDs. At least, I hope that message inspires some confidence in the students!
- I also find these sorts of vignettes to be a good source of challenging exam questions. On the exam, show the students a vignette they haven’t seen and ask them to diagnose the statistical concept or mistake the vignette illustrates.
- For each vignette, I’ve given a link that gives you the key graph or table, and all the background information you’d need to use the vignette yourself.
- How a common misinterpretation of confidence intervals led to an unjustified RDA for vitamin D. The 95% confidence interval for a sample mean is not expected to contain 95% of the observations!
- Herding in political polling as election day approaches. Good example of what happens if observations aren’t sampled from the population independently of one another.
- Brian Wansink’s p-hacking. Good illustration of the importance of keeping exploratory analyses (=hypothesis generation) and hypothesis-testing analyses separate, with the hypothesis test being completely pre-specified so that it wouldn’t be done any differently no matter how the data turned out. I use this xkcd cartoon as a visual.
- Clinical trials with more discrepancies (e.g., the text says the sample size was X, the cited table says the sample size was X+1) report larger effect sizes. I use this to illustrate why the labs in my course emphasize detailed and accurate reporting. Small evident mistakes often are a sign of bigger non-evident mistakes. This paper, finding that discrepancies are more frequent in subsequently-retracted clinical trial reports than in unretracted reports, could be used to make the same point.
- Estimating the reproducibility of psychology experiments. Measures of sampling precision, such as the standard error and 95% confidence interval of a sample statistic, are estimates of what might happen if, hypothetically, you went back to the same statistical population(s) and did the same study again. So what happened when a team of researchers actually repeated a bunch of published psychology experiments? In particular, is it consistent with what you’d expect if sampling error and differences in sample size were the only differences between the original studies and the repetitions?
- “Wet bias” in commercial weather forecasting. Good illustration of what “bias” means in a statistical context.
- The distribution of published p-values as evidence for p-hacking and/or publication bias. Hey, how come the distribution of published z-statistics in economics journals is bimodal with one of the peaks falling exactly at z=1.96 (i.e. where p=0.05)? Seems suspicious! UPDATE: here’s another, even clearer example. Apparently lots of economists are p-hacking their analyses so as to find that minimum wage laws lead to statistically-significant job losses. UPDATE: And here’s a nice graph of the frequency distributions of published p-values across many different scientific fields. They’re bimodal with one of the modes at 0.05 in almost every field, including, um, “alternative medicine”.
- Men in many countries report having much higher numbers of heterosexual partners in their lifetimes (or any other time period) than do women. Which is weird because by definition a heterosexual partnership is comprised of a man and a woman. So either there’s a problem with the data, or (more likely) some people are lying in surveys. I use this to illustrate one way to check for sampling bias and measurement error: estimate or measure the same quantity using two different methods and see if they give the same answer.
- Why the 1936 Literary Digest poll of the US Presidential election was so far off. A famous, dramatic illustration of how increasing your sample size doesn’t protect you from sampling bias; precision and bias are two different things.
- How a basic statistical mistake led the Gates Foundation to waste millions of dollars pushing for smaller schools. Small schools are disproportionately likely to be among the best performing. Too bad they’re also disproportionately likely to be among the worst performing. A small school has a small sample of students, and so average student performance is more variable among small schools than it is among large schools.
- Do people prefer pop songs released early in the year? An illustration of how the quantitative algorithms that underpin the operations of many popular apps and social media sites aren’t always sensible. Here, Spotify’s algorithm thinks your favorite songs of the year are “whichever ones you listened to the most”–neglecting to adjust for the fact that you have more opportunity to listen to songs released earlier in the year.
- Do major league hitters really get no worse (or better) at hitting as they age? This one I constructed myself. I went to Baseball Reference (linked), downloaded data on the OPS+ (a summary measure of batting success) for all major league batters in 2014, and binned the data by player age. What you find is that the curve relating average OPS+ to player age is basically flat, except at the endpoints (players age 20, or age >38) where there are very few players. I ask the students if this shows that players get no worse or better at batting as they age. The answer is that it doesn’t, because of survivor bias: the sample omits all those players who weren’t good enough at batting to be major leaguers. The only old players who are still good enough hitters to be major leaguers are those who were even better when they were younger, or who have maintained their skills unusually well as they age. Similarly, the only really young players who are good enough to play in the majors are those who are exceptionally good compared to other young players. In fact, if you track the performance of individual players over time, rather than comparing cohorts of players of different ages, you find that major league players typically peak in their late 20s.
- Does firing the manager improve the performance of professional soccer teams? Great example of regression to the mean, and of the importance of having a control and (if possible) randomly assigning study units to treatment vs. control conditions. You cannot tell if firing the manager improves team performance just by comparing how teams that fired their managers performed before vs. after doing so. I’m sure there are non-sports examples of the same phenomenon out there, for instance regarding whether corporation financial performance improves after firing the CEO. In ecology, this is the problem with before-after study designs, and the motivation for BACI designs.
- “Responder despondency”: myths of personalized medicine. People vary in how they respond to medication. So what evidence would you need to be able to do “personalized medicine”–specifically, to tell whether particular patient X will respond to a medication? Turns out the US Food and Drug Administration’s own evidence (well, one line of evidence at any rate) involves a pretty basic statistical error. In personalized medicine, each individual patient is a “population” of interest, and you need to treat each patient as such in order to estimate their unknown true responsiveness to a medication. This example is a nice illustration of when you should be using a paired (“within subjects”) design rather than an unpaired (“between subjects”) experimental design.
- UPDATE: Benford’s Law. In any set of numbers spanning a wide range, the first digit will most often be small. More specifically, it’ll be “1” about 30% of the time, and “9” about 5% of the time. There’s an elegant explanation for this, that’s hard to come up with on your own (why wouldn’t the distribution be uniform?), but intuitive once it’s pointed out to you. Good illustration of how tricky it can be to define what’s “random” or what’s an appropriate “null” expectation. Has applications in fraud detection, because people who make up fake numbers mostly don’t know about Benford’s Law and so make up numbers that conflict with it.
- UPDATE: p-hacking and false discoveries in A/B testing. Marketing studies comparing, e.g., two different website designs (versions “A” and “B”) to see which one draws more clicks often repeat the experiment daily and stop it if the p-value ever drops below 0.05. Such “optional stopping” is a well-known way to p-hack. Well, well-known except to marketers, apparently.
*Similarly, Meghan reports that having an organism of the day has only sort of worked for her.