Like many biology profs, Meghan often starts class by talking for a few minutes about the “organism of the day“, as a way to engage student interest and illustrate key concepts. I do something similar in my intro biostats course. I start some lectures with “statistical vignettes”: real-world examples that illustrate key statistical concepts and demonstrate their practical importance, hopefully in a fun way.

I’ll say right up front that I have no idea how successful these vignettes are.* I don’t know if they make much difference to how well students like the course, and honestly I doubt they move the needle in terms of student mastery of the material. But I like doing them, I can’t see how they’d do any harm, and I’m sure at least a few students like and appreciate them. So here are some of my statistical vignettes, which I’m sharing in case any of you might want to try them out in your own classes. In the comments, please share your own favorite statistical vignettes!

A few preliminary remarks:

- For each vignette I use a single slide, or at most two, showing a key graph/table or two. It takes me 3-5 minutes to set up the question or problem, explain why it’s interesting or important, walk the students through the graph/table, and explain the statistical issue.
- Sometimes I explain the statistical issue myself, sometimes I ask the students to talk to their neighbors and see if they can figure out the statistical issue. Depends on my mood and how much other material I need to cover that day (it takes longer if I let the students try to figure it out).
- You’ll notice that most of the vignettes aren’t about biology. That’s deliberate. My students see plenty of biological examples in lab and lecture. And many of them aren’t going to go on to become biologists or other scientists who do statistics for a living. I want them to see how familiarity with basic statistical concepts can help them be informed citizens.
- Many of these vignettes could be incorporated into regular lectures as illustrative examples, or incorporated into other class activities, rather than used as standalone asides at the start of class. Indeed, that might be better than my approach; my vignette of the day often isn’t about that day’s lecture topic.
- Most of my vignettes involve statistical mistakes. Anecdotally, I think that gives some students an overly-negative impression of just how bad typical statistical practice is. I need to come up with some exemplary vignettes. Please comment to suggest some!
- On the other hand, talking about basic statistical mistakes made by professionals with graduate degrees can be a way to build up student confidence. I tell the students, look, this stuff is hard, even people with advanced degrees get it wrong sometimes. But that’s precisely why you’re taking the class, and when you finish it you’re going to have a better grasp of statistics then many people with PhDs. At least, I hope that message inspires some confidence in the students!
- I also find these sorts of vignettes to be a good source of challenging exam questions. On the exam, show the students a vignette they haven’t seen and ask them to diagnose the statistical concept or mistake the vignette illustrates.
- For each vignette, I’ve given a link that gives you the key graph or table, and all the background information you’d need to use the vignette yourself.

**Statistical vignettes**

- Many students just barely pass the NY Regents exams. (scroll down for the histogram, which shows a big spike at the minimum passing exam score and hardly anyone just barely failing to pass) Good example of how just looking at a plot of the distribution of your data can reveal a signal in need of an explanation. Note that the point here for statistical purposes is not to criticize (or praise) how Regents exams are scored, but just to illustrate how looking at the distribution of your data can suggest hypotheses about the data-generating process.
- Smoking gun evidence for systemic racial bias in the application of US drug possession laws. Another example of how just looking at a histogram can be extremely revealing. Here, following a 2010 reform of US federal minimum sentencing laws to set a much higher minimum sentence for possession of at least 280 g of crack cocaine than for possession of smaller amounts, there was an enormous spike in the share of drug possession convictions of blacks and Hispanics that were for possession of 280-290 g. The equivalent spike for whites was much smaller.
- On the other hand, sometimes just eyeballing a histogram can be misleading. Like when it leads you to think you can identify six distinct modes in a distribution that’s being estimated from only 29 observations.
- Are supercentarians mostly superfrauds? Usually we think of removing untrustworthy observations as something that you do before statistical analysis. But here’s an interesting example of using statistics to
*detect*untrustworthy observations. - How a common misinterpretation of confidence intervals led to an unjustified RDA for vitamin D. The 95% confidence interval for a sample mean is
*not*expected to contain 95% of the observations! - Herding in political polling as election day approaches. Good example of what happens if observations aren’t sampled from the population independently of one another.
- Brian Wansink’s p-hacking. Good illustration of the importance of keeping exploratory analyses (=hypothesis generation) and hypothesis-testing analyses separate, with the hypothesis test being completely pre-specified so that it wouldn’t be done any differently no matter how the data turned out. I use this xkcd cartoon as a visual.
- Effect sizes in pre-registered vs. non-pre-registered studies. Another illustration of the importance of separating exploratory and hypothesis-testing analyses.
- Is at least one Auckland resident stabbed every day? No, surely not, contrary to the newspaper headline at the link. That’s a misleading way to summarize the fact that an
*average*of about 1 knife injury per day is admitted to Auckland Hospital. Good example to use for introducing the Poisson distribution, and for thinking about count data more broadly. As the link points out, even if knife injuries aren’t Poisson, it seems pretty implausible that they’re evenly distributed, so that there’s at least one every day. - Clinical trials with more discrepancies (e.g., the text says the sample size was N, the cited table says the sample size was N+1) report larger effect sizes. I use this to illustrate why the labs in my course emphasize detailed and accurate reporting. Small evident mistakes often are a sign of bigger non-evident mistakes. This paper, finding that discrepancies are more frequent in subsequently-retracted clinical trial reports than in unretracted reports, could be used to make the same point.
- Estimating the reproducibility of psychology experiments. Measures of sampling precision, such as the standard error and 95% confidence interval of a sample statistic, are estimates of what might happen if, hypothetically, you went back to the same statistical population(s) and did the same study again. So what happened when a team of researchers actually repeated a bunch of published psychology experiments? In particular, is it consistent with what you’d expect if sampling error and differences in sample size were the only differences between the original studies and the repetitions?
- “Wet bias” in commercial weather forecasting. Good illustration of what “bias” means in a statistical context.
- “Immortal time bias“. Good accessible biomedical example of the subtle biases that arise easily in observational studies.
- How one man’s basic error of data interpretation set English soccer astray for decades. I have yet to use this one, but may do so in future. Could be used to illustrate Bayes’ Theorem and the importance of considering base rates. The fact that most goals in soccer are preceded by three or fewer passes doesn’t mean teams should seek to avoid passing too much (the English “longball” style). That’s because most turnovers in soccer also are preceded by three passes or fewer. What you want to know is not “given that a goal was scored, how likely was it to have been preceded by three or fewer passes rather than some larger number of passes?”. What you want to know is “given a sequence of three or fewer passes, how likely is the end result to be a goal, compared to some larger number of passes?” You could easily use the data from here to construct a visual showing that frequency with which a team makes long passes is negatively correlated with the number of goals it scores. Which isn’t a perfect visual–it doesn’t quite speak directly to the original error–but would probably work.
- The distribution of published p-values as evidence for p-hacking and/or publication bias. Hey, how come the distribution of published z-statistics in economics journals is bimodal with one of the peaks falling
*exactly*at z=1.96 (i.e. where p=0.05)? Seems suspicious! UPDATE: here’s another, even clearer example. Apparently lots of economists are p-hacking their analyses so as to find that minimum wage laws lead to statistically-significant job losses. UPDATE: And here’s a nice graph of the frequency distributions of published p-values across many different scientific fields. They’re bimodal with one of the modes at 0.05 in almost every field, including, um, “alternative medicine”. - Men in many countries report having much higher numbers of heterosexual partners in their lifetimes (or any other time period) than do women. Which is weird because by definition a heterosexual partnership is comprised of a man and a woman. So either the data are dramatically undersampling some distinctive subpopulation of either women or men, or (more likely) some people are lying in surveys. I use this to illustrate one way to check for sampling bias and measurement error: estimate or measure the same quantity using two different methods and see if they give the same answer.
- Why the 1936 Literary Digest poll of the US Presidential election was so far off. A famous, dramatic illustration of how increasing your sample size doesn’t protect you from sampling bias; precision and bias are two different things.
- How a basic statistical mistake led the Gates Foundation to waste millions of dollars pushing for smaller schools. Small schools are disproportionately likely to be among the best performing. Too bad they’re also disproportionately likely to be among the worst performing. A small school has a small sample of students, and so average student performance is more variable among small schools than it is among large schools.
- Do people prefer pop songs released early in the year? An illustration of how the quantitative algorithms that underpin the operations of many popular apps and social media sites aren’t always sensible. Here, Spotify’s algorithm thinks your favorite songs of the year are “whichever ones you listened to the most”–neglecting to adjust for the fact that you have more opportunity to listen to songs released earlier in the year.
- Do major league hitters really get no worse (or better) at hitting as they age? This one I constructed myself. I went to Baseball Reference (linked), downloaded data on the OPS+ (a summary measure of batting success) for all major league batters in 2014, and binned the data by player age. What you find is that the curve relating average OPS+ to player age is basically flat, except at the endpoints (players age 20, or age >38) where there are very few players. I ask the students if this shows that players get no worse or better at batting as they age. The answer is that it doesn’t, because of survivor bias: the sample omits all those players who weren’t good enough at batting to be major leaguers. The only old players who are still good enough hitters to be major leaguers are those who were even better when they were younger, or who have maintained their skills unusually well as they age. Similarly, the only really young players who are good enough to play in the majors are those who are exceptionally good compared to other young players. In fact, if you track the performance of individual players over time, rather than comparing cohorts of players of different ages, you find that major league players typically peak in their late 20s.
- Does firing the manager improve the performance of professional soccer teams? Great example of regression to the mean, and of the importance of having a control and randomly assigning study units to treatment vs. control conditions. You
*cannot*tell if firing the manager improves team performance just by comparing how teams that fired their managers performed before vs. after doing so. I’m sure there are non-sports examples of the same phenomenon out there, for instance regarding whether corporation financial performance improves after firing the CEO. In ecology, this is the problem with before-after study designs, and the motivation for BACI designs. - The illusion of population declines. Following on from the previous bullet, if you want an ecological example of regression to the mean, this is a good one. If a species varies in abundance in space and time (without any long-term trend in time), but you only start monitoring it at the times/places where it’s currently abundant, it’s likely to appear to decline in future. I haven’t used this one yet but plan to start doing so.
- Is your gut microbiome making you fat? If you want an example of regression to the mean in a biomedical, experimental context (rather than the ecological, observational context of the previous bullet), Jeff Walker points us to Turnbaugh et al. 2006. Turnbaugh et al. 2006 is a famous, influential study of the effects of fecal transplants on obesity in mice.
- The famous Dunning-Kruger effect is not a thing. Another example of regression to the mean.
- Does money buy happiness? A recent PNAS paper says that more and more income makes people happier and happier, with no upper bound. But that’s not what the paper’s data show. Good example to illustrate the importance of paying attention to effect size and R^2 value, not just statistical significance. Also a good example for illustrating the challenge of interpreting data plotted on transformed scales (here, log- and z-transformed). I’m increasingly thinking that students should be taught to back-transform their data and model fits for plotting purposes. And here’s another reanalysis of that PNAS paper that reaches the same conclusion.
- “Responder despondency”: myths of personalized medicine. People vary in how they respond to medication. So what evidence would you need to be able to do “personalized medicine”–specifically, to tell whether particular patient X will respond to a medication? Turns out the US Food and Drug Administration’s own evidence (well, one line of evidence at any rate) involves a pretty basic statistical error. In personalized medicine, each individual patient is a “population” of interest, and you need to treat each patient as such in order to estimate their unknown true responsiveness to a medication. This example is a nice illustration of when you should be using a paired (“within subjects”) design rather than an unpaired (“between subjects”) experimental design.
- Benford’s Law. In any set of numbers spanning a wide range, the first digit will most often be small. More specifically, it’ll be “1” about 30% of the time, and “9” about 5% of the time. There’s an elegant explanation for this, that’s hard to come up with on your own (why wouldn’t the distribution be uniform?), but intuitive once it’s pointed out to you. Good illustration of how tricky it can be to define what’s “random” or what’s an appropriate “null” expectation. Has applications in fraud detection, because people who make up fake numbers mostly don’t know about Benford’s Law and so make up numbers that conflict with it. If you want a more topical vignette on Benford’s law, here’s an application to detect unreliable COVID-19 mortality figures reported by the Czech Republic.
- p-hacking and false discoveries in A/B testing. Marketing studies comparing, e.g., two different website designs (versions “A” and “B”) to see which one draws more clicks often repeat the experiment daily and stop it if the p-value ever drops below 0.05. Such “optional stopping” is a well-known way to p-hack. Well, well-known except to marketers, apparently.
- if smartphones are destroying the mental health of today’s teens, then so are potatoes and eyeglasses. Good topical illustration of “researcher degrees of freedom” in action. Hard to pick out one good visual for this one, though.
- everything we eat both causes and prevents cancer (scroll down for the graph). I haven’t used this one yet in class but it looks like a really good one. You could ask the class a good clicker question about this one: does the graph look as you would expect if nothing we eat causes or prevents cancer, but estimates of cancer risk are subject to sampling error?
- Phenology is not becoming less temperature-sensitive as the earth warms. Good example of the importance of transforming your data to meet the assumption of linearity. Also a good example of how the decision as to whether to transform your data can be informed by scientific background information, not just the assumptions of your statistical null hypothesis test. Here’s a blog post about the paper, with a nice cartoon graph explaining the intuition behind the linearizing transformation.

*Similarly, Meghan reports that having an organism of the day has only sort of worked for her.

On the reported discrepancy in heterosexual partners, one of my favorite all-time essays is Richard Lewontin’s NYRB essay “Sex, Lies, and Social Science, which unfortunately is behind a paywall. It’s worth the price of admission. The basic theme is that our autobiography — our internal (brain’s) story of our life — is unreliable because we, in short, lie to ourselves. While I love the essay, Lewontin uses it to take a below-the-belt shot at E.O. Wilson.

Anthony Lane (New Yorker film critic and essayist) has an interesting old piece on a major survey of sexual behavior in the US. He picks up on the issue of people lying in surveys. But more broadly, his piece is an attack on the whole idea of using statistics and data to understand human behavior, at least sexual behavior. The claim is that sexual behavior is irreducibly personal and inscrutable and so can’t be “understood” in any meaningful sense via data. I don’t agree with that claim. I think individual lived experience and data/statistics are complements, not competitors. But it’s a provocative claim, and he articulates it well.

ps to any readers who object to me referring to Anthony Lane: I am aware that other pieces of his have come in for criticism for sexism. My hope is that my endorsement of this one particular essay of his won’t be read as endorsement of everything he’s ever written.

wasn’t aware of that essay – I’ll have to check it out.

I have critter of the week, and usually try to attach it to the message I am trying to get across. For example, a skeletal drawing of Ardipithecus was used to illustrate how organisms are seldom ‘perfectly’ adapted. I also used it to probe the depth of prior learning: “So kids, having looked at Ardipithecus, how long do you think it is since humans evolved from Chimpanzees?” The correct answer being, of course that they didn’t.

Stats wise, I like to inject some humour – So, the grail scene from “Indiana Jones and the Last Crusade” as an illustration of biased sampling, and a YouTube vid of a woman spraying vinegar at chem trails: correlation doesn’t equal causation.

I use xkcd’s correlation vs causation cartoon in my intro biostats course: https://xkcd.com/552/ 🙂

I use it on my grad course as well. Usually nobody laughs but me, but I keep on trying. 😀

To offer testimony from n=3, our advanced organic chemistry prof (mid-1980’s) started each class with a Molecule of the Day. I am sure we were not the only impressed students; the attendance in her class of 150 was consistently the highest I had yet seen.

It often involved student participation. (e.g. on Valentines Day everyone was requested to hug their neighboring student and in one word express how that felt. It was a segue into molecular pathways of neuroendocrines.) My two study partners agreed that her vignettes captured our attention and interest in what could be a very boring subject and class. She successfully conveyed the relationships between not only chemistry and other fields, but it’s role in daily life.

One of our assignments was to write a short science-fiction story incorporating organic chemistry. After all these decades, I recall many of those classes and the material quite vividly.

My favorite was the the class covering the chemistry and history of psychedelics and how to make LSD.

My own organic chem prof (mid-90s) was quite good, though without those sorts of creative vignettes or exercises as far as I can recall. He did have a clever way of relieving student anxiety about exams, without actually making the exams easy. The exams were all syntheses–given a specified starting compound, and the reactions you’ve been taught, synthesize the specified endpoint compound. This was difficult because you might have an idea of how to get from A to B, but discover along the way that you get some intermediate that you don’t know how to further modify in the necessary way. So you have to start over. To alleviate our anxiety about this possibility, we were allowed one “miracle” per exam without penalty (inspired by the famous Sydney Harris “then a miracle occurs” cartoon; see this gallery: http://www.sciencecartoonsplus.com/gallery/math/index.php#). A miracle was a single synthesis step (e.g., removing a particular hydroxide group) that you didn’t know how to do. You just wrote “then a miracle occurs”, followed by the product of your miraculous reaction. As a student, it was very reassuring to have a “miracle” in your back pocket. Even though of course you only had one, which wasn’t going to make much difference to your mark in the context of an exam comprised of many syntheses.

I confess that I couldn’t see asking students to hug their neighbors, even on Valentine’s day and even as a well-meaning joke. Maybe have them shake hands or high five instead?

“I confess that I couldn’t see asking students to hug their neighbors, even on Valentine’s day and even as a well-meaning joke. Maybe have them shake hands or high five instead?”

Many students did not hug their neighbors. That was part of the lesson, too: the myriad of emotions in response to that request and the role of chemistry involved.

Recall from those days that over many weeks, most students formed habitual chosen seats and got to know their seating partners. Many did not object, many did. Some thought about it a bit and then hugged their neighbor, some laughed and shook hands.

How we react and the choices we make are complex. Inside each of us is a cascading chain of chemical reactions going all over the place. That was the point of the lesson (with a lot of bioorganic chemical dances ;).

I also learned from that class the importance (and fun) of engaging the audience in educational presentations. It usually works.

Whether or not these types of activities are very effective or just somewhat effective from a learner’s point of view, I think you can’t underestimate how much they add to the teacher’s experience. If this is a fun thing to research and talk about, it will make the instructor more lively and engaged, and that rubs off on the students. A vignette I use when I teach nonvascular plants is to give the example of sphagnum and show a short video clip on bog bodies. Students definitely remember how mosses can acidify the water around them when they see some pickled people from the iron age.

There’s probably an interesting debate to be had over the extent to which it’s ok for instructors to just teach in whatever way they enjoy most, on the grounds that (i) they enjoy it, and (ii) teaching in a way they enjoy aids student mastery of the material, at least in some small way. In particular, there’s a case to be made that lecturing all the time isn’t the most effective way to teach students mastery of the material, even if you enjoy lecturing as a prof. On the other hand, there’s also a case to be made that different profs teach best in different ways, because of their different preferences and training, and because of the different topics they teach and the different constraints under which they teach. Old post on this: https://dynamicecology.wordpress.com/2016/04/20/how-much-do-scientists-lecture-and-why-poll-results-and-commentary/.

I agree with emilyatlas but also agree it would be an interesting debate because evidence either way is highly problematic and effect sizes are probably highly conditional and small.

Judging by link clicks, most popular vignette so far (by some distance) is the vitamin D RDA one. Even though that one was just in last week’s Friday links so many readers presumably saw it already. Wonder to what extent that’s because it’s the first one listed?

One of my favorites, which I have used with students, represents a very much more basic level of statistical incompetence. Every election year, somebody at the New York Post will “handicap” the race by publishing odds of each candidate winning, horse-race style (e.g., Smith; 1:3; Jones, 2:7, Putzwilliger, 2:15 . . .). Invariably, the percentage odds add up to far above 100%. *Every* *damn* *year*.

Good one. I seem to recall reading that some prominent sports pundits’ predictions of (say) NFL team records for the upcoming season invariably fail to add up. In total, all NFL teams combined are predicted by the pundits to win more than 50% of the games, which is impossible. Our intro biostats course use to have a unit on the rules of probability, and an example like this would’ve been a good one to use in that unit.

Only vaguely related: I have an old series of posts on mathematical constraints in ecology and evolution, such as that the probabilities of all possibilities have to sum to 1. Starts here: https://dynamicecology.wordpress.com/2016/10/05/mathematical-constraints-in-ecology-part-1-species-cant-all-covary-really-negatively/

‘What does an error bar mean?’ might makes for a a good statistical vignette. Especially when shown that senior researchers mix them up more often than keeping them straight.

Belia, S., F. Fidler, J. Williams, and G. Cumming. 2005. Researchers misunderstand confidence intervals and standard error bars. Psychological Methods. 10: 389-396. http://dx.doi.org/10.1037/1082-989X.10.4.389

(btw, 2nd author Fiona Fidler’s name shows up again on the March 30, “Friday Links” to the preprint ” Questionable Research Practices in Ecology and Evolution.”

Good suggestion, thanks!

I just used the preprint on QRP in ecology and evolution as the statistical vignette in today’s class…

Pingback: May 18th, 2018 | ireadthis

Pingback: Ask us anything: the “perfect” intro biostats course | Dynamic Ecology

Pingback: Friday links: how to stop freaking out about climate change, a rare EEB retraction, and more | Dynamic Ecology

Pingback: Recommendations for good videos and interactive online resources for teaching introductory statistical concepts? | Dynamic Ecology

Pingback: Friday links: concrete ideas for improving data integrity, SoftBank vs. multilevel selection, and more | Dynamic Ecology

Pingback: Friday links: RIP Katherine Johnson, #PeepYourScience, and more | Dynamic Ecology

Pingback: Friday links: Fast Grants, NIH vs. harassment, and more | Dynamic Ecology

Pingback: Friday links: a major case of fake data in psychology, the Avengers vs. faculty meetings, and more | Dynamic Ecology

Pingback: Timely posts for the start of the academic year | Dynamic Ecology

Pingback: Friday links: Josh Van Buskirk 1959-2021, and more | Dynamic Ecology