Like many biology profs, Meghan often starts class by talking for a few minutes about the “organism of the day“, as a way to engage student interest and illustrate key concepts. I do something similar in my intro biostats course. I start some lectures with “statistical vignettes”: real-world examples that illustrate key statistical concepts and demonstrate their practical importance, hopefully in a fun way.
I’ll say right up front that I have no idea how successful these vignettes are.* I don’t know if they make much difference to how well students like the course, and honestly I doubt they move the needle in terms of student mastery of the material. But I like doing them, I can’t see how they’d do any harm, and I’m sure at least a few students like and appreciate them. So here are some of my statistical vignettes, which I’m sharing in case any of you might want to try them out in your own classes. In the comments, please share your own favorite statistical vignettes!
A few preliminary remarks:
- For each vignette I use a single slide, or at most two, showing a key graph/table or two. It takes me 3-5 minutes to set up the question or problem, explain why it’s interesting or important, walk the students through the graph/table, and explain the statistical issue.
- Sometimes I explain the statistical issue myself, sometimes I ask the students to talk to their neighbors and see if they can figure out the statistical issue. Depends on my mood and how much other material I need to cover that day (it takes longer if I let the students try to figure it out).
- You’ll notice that most of the vignettes aren’t about biology. That’s deliberate. My students see plenty of biological examples in lab and lecture. And many of them aren’t going to go on to become biologists or other scientists who do statistics for a living. I want them to see how familiarity with basic statistical concepts can help them be informed citizens.
- Many of these vignettes could be incorporated into regular lectures as illustrative examples, or incorporated into other class activities, rather than used as standalone asides at the start of class. Indeed, that might be better than my approach; my vignette of the day often isn’t about that day’s lecture topic.
- Most of my vignettes involve statistical mistakes. Anecdotally, I think that gives some students an overly-negative impression of just how bad typical statistical practice is. I need to come up with some exemplary vignettes. Please comment to suggest some!
- On the other hand, talking about basic statistical mistakes made by professionals with graduate degrees can be a way to build up student confidence. I tell the students, look, this stuff is hard, even people with advanced degrees get it wrong sometimes. But that’s precisely why you’re taking the class, and when you finish it you’re going to have a better grasp of statistics then many people with PhDs. At least, I hope that message inspires some confidence in the students!
- I also find these sorts of vignettes to be a good source of challenging exam questions. On the exam, show the students a vignette they haven’t seen and ask them to diagnose the statistical concept or mistake the vignette illustrates.
- For each vignette, I’ve given a link that gives you the key graph or table, and all the background information you’d need to use the vignette yourself.
- Many students just barely pass the NY Regents exams. (scroll down for the histogram, which shows a big spike at the minimum passing exam score and hardly anyone just barely failing to pass) Good example of how just looking at a plot of the distribution of your data can reveal a signal in need of an explanation. Note that the point here for statistical purposes is not to criticize (or praise) how Regents exams are scored, but just to illustrate how looking at the distribution of your data can suggest hypotheses about the data-generating process.
- Smoking gun evidence for systemic racial bias in the application of US drug possession laws. Another example of how just looking at a histogram can be extremely revealing. Here, following a 2010 reform of US federal minimum sentencing laws to set a much higher minimum sentence for possession of at least 280 g of crack cocaine than for possession of smaller amounts, there was an enormous spike in the share of drug possession convictions of blacks and Hispanics that were for possession of 280-290 g. The equivalent spike for whites was much smaller.
- On the other hand, sometimes just eyeballing a histogram can be misleading. Like when it leads you to think you can identify six distinct modes in a distribution that’s being estimated from only 29 observations.
- Are supercentarians mostly superfrauds? Usually we think of removing untrustworthy observations as something that you do before statistical analysis. But here’s an interesting example of using statistics to detect untrustworthy observations.
- How a common misinterpretation of confidence intervals led to an unjustified RDA for vitamin D. The 95% confidence interval for a sample mean is not expected to contain 95% of the observations!
- Herding in political polling as election day approaches. Good example of what happens if observations aren’t sampled from the population independently of one another.
- Brian Wansink’s p-hacking. Good illustration of the importance of keeping exploratory analyses (=hypothesis generation) and hypothesis-testing analyses separate, with the hypothesis test being completely pre-specified so that it wouldn’t be done any differently no matter how the data turned out. I use this xkcd cartoon as a visual.
- Effect sizes in pre-registered vs. non-pre-registered studies. Another illustration of the importance of separating exploratory and hypothesis-testing analyses.
- Is at least one Auckland resident stabbed every day? No, surely not, contrary to the newspaper headline at the link. That’s a misleading way to summarize the fact that an average of about 1 knife injury per day is admitted to Auckland Hospital. Good example to use for introducing the Poisson distribution, and for thinking about count data more broadly. As the link points out, even if knife injuries aren’t Poisson, it seems pretty implausible that they’re evenly distributed, so that there’s at least one every day.
- Clinical trials with more discrepancies (e.g., the text says the sample size was N, the cited table says the sample size was N+1) report larger effect sizes. I use this to illustrate why the labs in my course emphasize detailed and accurate reporting. Small evident mistakes often are a sign of bigger non-evident mistakes. This paper, finding that discrepancies are more frequent in subsequently-retracted clinical trial reports than in unretracted reports, could be used to make the same point.
- Estimating the reproducibility of psychology experiments. Measures of sampling precision, such as the standard error and 95% confidence interval of a sample statistic, are estimates of what might happen if, hypothetically, you went back to the same statistical population(s) and did the same study again. So what happened when a team of researchers actually repeated a bunch of published psychology experiments? In particular, is it consistent with what you’d expect if sampling error and differences in sample size were the only differences between the original studies and the repetitions?
- “Wet bias” in commercial weather forecasting. Good illustration of what “bias” means in a statistical context.
- “Immortal time bias“. Good accessible biomedical example of the subtle biases that arise easily in observational studies.
- How one man’s basic error of data interpretation set English soccer astray for decades. I have yet to use this one, but may do so in future. Could be used to illustrate Bayes’ Theorem and the importance of considering base rates. The fact that most goals in soccer are preceded by three or fewer passes doesn’t mean teams should seek to avoid passing too much (the English “longball” style). That’s because most turnovers in soccer also are preceded by three passes or fewer. What you want to know is not “given that a goal was scored, how likely was it to have been preceded by three or fewer passes rather than some larger number of passes?”. What you want to know is “given a sequence of three or fewer passes, how likely is the end result to be a goal, compared to some larger number of passes?” You could easily use the data from here to construct a visual showing that frequency with which a team makes long passes is negatively correlated with the number of goals it scores. Which isn’t a perfect visual–it doesn’t quite speak directly to the original error–but would probably work.
- The distribution of published p-values as evidence for p-hacking and/or publication bias. Hey, how come the distribution of published z-statistics in economics journals is bimodal with one of the peaks falling exactly at z=1.96 (i.e. where p=0.05)? Seems suspicious! UPDATE: here’s another, even clearer example. Apparently lots of economists are p-hacking their analyses so as to find that minimum wage laws lead to statistically-significant job losses. UPDATE: And here’s a nice graph of the frequency distributions of published p-values across many different scientific fields. They’re bimodal with one of the modes at 0.05 in almost every field, including, um, “alternative medicine”.
- Men in many countries report having much higher numbers of heterosexual partners in their lifetimes (or any other time period) than do women. Which is weird because by definition a heterosexual partnership is comprised of a man and a woman. So either the data are dramatically undersampling some distinctive subpopulation of either women or men, or (more likely) some people are lying in surveys. I use this to illustrate one way to check for sampling bias and measurement error: estimate or measure the same quantity using two different methods and see if they give the same answer.
- Why the 1936 Literary Digest poll of the US Presidential election was so far off. A famous, dramatic illustration of how increasing your sample size doesn’t protect you from sampling bias; precision and bias are two different things.
- How a basic statistical mistake led the Gates Foundation to waste millions of dollars pushing for smaller schools. Small schools are disproportionately likely to be among the best performing. Too bad they’re also disproportionately likely to be among the worst performing. A small school has a small sample of students, and so average student performance is more variable among small schools than it is among large schools.
- Do people prefer pop songs released early in the year? An illustration of how the quantitative algorithms that underpin the operations of many popular apps and social media sites aren’t always sensible. Here, Spotify’s algorithm thinks your favorite songs of the year are “whichever ones you listened to the most”–neglecting to adjust for the fact that you have more opportunity to listen to songs released earlier in the year.
- Do major league hitters really get no worse (or better) at hitting as they age? This one I constructed myself. I went to Baseball Reference (linked), downloaded data on the OPS+ (a summary measure of batting success) for all major league batters in 2014, and binned the data by player age. What you find is that the curve relating average OPS+ to player age is basically flat, except at the endpoints (players age 20, or age >38) where there are very few players. I ask the students if this shows that players get no worse or better at batting as they age. The answer is that it doesn’t, because of survivor bias: the sample omits all those players who weren’t good enough at batting to be major leaguers. The only old players who are still good enough hitters to be major leaguers are those who were even better when they were younger, or who have maintained their skills unusually well as they age. Similarly, the only really young players who are good enough to play in the majors are those who are exceptionally good compared to other young players. In fact, if you track the performance of individual players over time, rather than comparing cohorts of players of different ages, you find that major league players typically peak in their late 20s.
- Does firing the manager improve the performance of professional soccer teams? Great example of regression to the mean, and of the importance of having a control and randomly assigning study units to treatment vs. control conditions. You cannot tell if firing the manager improves team performance just by comparing how teams that fired their managers performed before vs. after doing so. I’m sure there are non-sports examples of the same phenomenon out there, for instance regarding whether corporation financial performance improves after firing the CEO. In ecology, this is the problem with before-after study designs, and the motivation for BACI designs.
- The illusion of population declines. Following on from the previous bullet, if you want an ecological example of regression to the mean, this is a good one. If a species varies in abundance in space and time (without any long-term trend in time), but you only start monitoring it at the times/places where it’s currently abundant, it’s likely to appear to decline in future. I haven’t used this one yet but plan to start doing so.
- Is your gut microbiome making you fat? If you want an example of regression to the mean in a biomedical, experimental context (rather than the ecological, observational context of the previous bullet), Jeff Walker points us to Turnbaugh et al. 2006. Turnbaugh et al. 2006 is a famous, influential study of the effects of fecal transplants on obesity in mice.
- The famous Dunning-Kruger effect is not a thing. Another example of regression to the mean. (Aside: It occurs to me that you could assign students to go out and find their own candidate examples of regression to the mean. It shouldn’t be that hard to find candidates; any study claiming that individuals/entities with low initial values of X showed the biggest increases in X, or that individuals/entities with high initial values of X showed the biggest decreases in X, is a candidate. For instance.)
- Does money buy happiness? A recent PNAS paper says that more and more income makes people happier and happier, with no upper bound. But that’s not what the paper’s data show. Good example to illustrate the importance of paying attention to effect size and R^2 value, not just statistical significance. Also a good example for illustrating the challenge of interpreting data plotted on transformed scales (here, log- and z-transformed). I’m increasingly thinking that students should be taught to back-transform their data and model fits for plotting purposes. And here’s another reanalysis of that PNAS paper that reaches the same conclusion.
- “Responder despondency”: myths of personalized medicine. People vary in how they respond to medication. So what evidence would you need to be able to do “personalized medicine”–specifically, to tell whether particular patient X will respond to a medication? Turns out the US Food and Drug Administration’s own evidence (well, one line of evidence at any rate) involves a pretty basic statistical error. In personalized medicine, each individual patient is a “population” of interest, and you need to treat each patient as such in order to estimate their unknown true responsiveness to a medication. This example is a nice illustration of when you should be using a paired (“within subjects”) design rather than an unpaired (“between subjects”) experimental design.
- Benford’s Law. In any set of numbers spanning a wide range, the first digit will most often be small. More specifically, it’ll be “1” about 30% of the time, and “9” about 5% of the time. There’s an elegant explanation for this, that’s hard to come up with on your own (why wouldn’t the distribution be uniform?), but intuitive once it’s pointed out to you. Good illustration of how tricky it can be to define what’s “random” or what’s an appropriate “null” expectation. Has applications in fraud detection, because people who make up fake numbers mostly don’t know about Benford’s Law and so make up numbers that conflict with it. If you want a more topical vignette on Benford’s law, here’s an application to detect unreliable COVID-19 mortality figures reported by the Czech Republic.
- p-hacking and false discoveries in A/B testing. Marketing studies comparing, e.g., two different website designs (versions “A” and “B”) to see which one draws more clicks often repeat the experiment daily and stop it if the p-value ever drops below 0.05. Such “optional stopping” is a well-known way to p-hack. Well, well-known except to marketers, apparently.
- if smartphones are destroying the mental health of today’s teens, then so are potatoes and eyeglasses. Good topical illustration of “researcher degrees of freedom” in action. Hard to pick out one good visual for this one, though.
- everything we eat both causes and prevents cancer (scroll down for the graph). I haven’t used this one yet in class but it looks like a really good one. You could ask the class a good clicker question about this one: does the graph look as you would expect if nothing we eat causes or prevents cancer, but estimates of cancer risk are subject to sampling error?
- Phenology is not becoming less temperature-sensitive as the earth warms. Good example of the importance of transforming your data to meet the assumption of linearity. Also a good example of how the decision as to whether to transform your data can be informed by scientific background information, not just the assumptions of your statistical null hypothesis test. Here’s a blog post about the paper, with a nice cartoon graph explaining the intuition behind the linearizing transformation.
*Similarly, Meghan reports that having an organism of the day has only sort of worked for her.