Ask us anything: the “perfect” intro biostats course

Every year we invite readers to ask us anything! Here’s today’s question, from Pavel Dodonov (paraphrased; click that last link for the original):

How would you structure the “perfect” statistics course for biology undergraduate students?

Jeremy’s answer:

I teach intro biostats, so I wish I could answer by just linking to the course syllabus. 🙂 And I actually do think the course I teach (in alternate terms with my amazing colleague Kyla Flanagan) is a good version of a traditional intro biostats course. But there are some tweaks I’d like to make. More broadly, one can make the case that the traditional intro biostats course has had its day and should be replaced by something very different. So my short answer is that I don’t think there’s any one “perfect” intro biostats course.

Longer answer: here are some things I think our course gets right:

  • We teach R in the labs. R coding basics are a highly-transferable skill students can build on in future, no matter what direction life takes them. R is widely used in the private sector, for instance. Every year we get several students taking the course primarily because they want to learn R, often because R wasn’t taught in the stats courses in their home departments.
  • We teach permutation tests and bootstrapping instead of classical nonparametric tests. Permutation tests and bootstrapping are clever, elegant ideas that students can grasp and that many get a kick of out of. I feel pretty strongly that there’s no longer any reason for anybody to use or teach Mann-Whitney U tests or Kruskal-Wallis tests or etc. But perhaps I’m overgeneralizing from my own experience! Feel free to disagree in the comments!
  • We do a good job teaching the students to think, not just blindly follow set procedures. I really want them to understand why they’re doing the stats they’re doing, and appreciate that all the various statistical tests we teach are just variations on the same underlying logic.
  • I think we do a good job engaging the students in statistical material and showing them how being able to think statistically is helpful for any informed citizen. Statistical vignettes are one way among others that I do that. But it’s hard because relatively few students come in to the course with a pre-existing interest in the material. Many students are only there because they’re required to be there.

As to what I’d change, the intro biostats course here at Calgary emphasizes study design and null hypothesis testing. That’s because you’ve got to start somewhere. Plus, many of our students will go on to upper-level courses or graduate studies in which they’ll have to design and interpret their own experiments, and interpret others’ experiments. Anyway, the null hypothesis tests we teach are the usual ones: t-tests, contingency tables, single-factor fixed-effect ANOVA, Tukey’s HSD test, linear regression, correlation. (Note that there’s much more in the course besides the null hypothesis tests I just listed…). In other words, general linear models (GLMs) with at most one predictor variable. We teach all those separate tests because they’re often referred to in the scientific literature, and because that’s the way our textbook (Whitlock & Schluter) teaches them. Students who go on to take advanced biostats learn that everything they learned in intro biostats is just special cases of GLMs. But I increasingly feel like we should just teach GLMs from the get-go in intro biostats, and mention in passing that, by the way, this special case is called “ANOVA” for historical reasons, this other special case is called “regression” for historical reasons, etc. It’s just hard for students to memorize a bunch of separate tests and when to use them. The main things holding us back from switching to a GLM-first approach are (i) the course in its current form seems to work pretty well and we don’t want to fix what isn’t broken (students find it challenging to memorize a bunch of tests, but they do mostly manage it), (ii) it would require a lot of new prep and being a prof is always a matter of triaging demands on one’s time, and (iii) I haven’t found a textbook that I think would work well. Whitlock & Schluter only get to the general case of GLMs in one chapter late in the book. But the GLM-based undergrad biostats textbooks I’ve seen (e.g., Alan Grafen’s book) are too advanced and abstract for our students. Whitlock & Schluter is perfect for our needs in every other way, so the textbook I want is “Whitlock & Schluter, but structured around GLMs from the get-go”. If you know of such a textbook, let me know!

If you wanted to completely revamp the traditional intro biostats course topic list, I think the way to go would probably be towards teaching “intro data science for biology” instead. You’d put a lot more emphasis on data cleaning and processing than we do. More emphasis on exploratory data analysis (though we do teach a bit of that). You’d de-emphasize null hypothesis testing in favor of description and estimation. And you’d familiarize students with some common sorts of observational data biologists work with. Terry McGlynn recently suggested that a course along these lines would be more useful to many biology undergraduates than, say, the traditional calculus or organic chemistry requirements. Curious to hear what readers think of this suggestion.

I think it’d be very hard to get all the way into Bayesian stats in intro biostats. Currently, we don’t even get there in advanced biostats. But I’m sure somebody somewhere does it, and I would be very interested to hear comments on this.

Brian’s answer:

I teach a graduate statistics course (with the only prerequisite being that they have taken a semester stats course at the undergraduate level). So I can tell you a lot about what knowledge student’s actually have coming out of such a course “on average”. But that is not what you asked.

If I were teaching undergraduate stats, I would cover the following:

  • Data & experimental design -Types of variables (categorical, binary, ordered, continuous). Responsible data collection. I’ve said more than once that basic experimental design is losing out to advanced stats which is a bad thing. You cannot of course spend the whole course on experimental design. But a week on the difference between observation and experiment, on rigorous control and on breaking confoundment gets pretty far. (1.5 weeks)
  • I would definitely teach it using R.That is one big change in the 15 years I’ve been teaching grad stats – when I started literally none of my students knew R coming in. Now 80-90% have already had the basic exposure. And that is thanks to undergrad stats (and other classes) so I wouldn’t break that streak. Have a lab this week on data wrangling, maybe dplyr. Or maybe more time on data visualization and exploration (1 week to get started).
  • The general linear model – i.e. the regression with effect size as coefficient view of linear statistics. While I agree with Jeremy that there is no obvious textbook, I just cannot conceive of why I would teach a t-test and anova and a regression separately any more. I would teach up to interaction between categorical variables but probably not ANCOVA (in my experience ANCOVA is something some undergrads have been exposed to and some have not but mostly they don’t understand it regardless so it is probably a bridge too far). And cover a bit of collinearity in multiple regression. This is the heart of the course (5-6 weeks) – ramp them up to be really good at this – diagnostics, interpretation of all R output, contrasts, etc.
  • Modes of inference – primarily focusing on NHST/p-values and AIC (both of which I’m not a big fan of but they’re what is used) as well as Bayesian, Monte Carlo/randomization/reshuffling, exploratory (including stepwise) and estimation/confidence intervals. Now you’re not going to teach somebody how to be a Bayesian modeller, but you can get across the conceptual framework. And you can show them how you can view a simple linear regression from each of these frameworks (i.e. show, not have them learn to do). And you might get them to do a simple randomization test (2 weeks)
  • Four extensions to the GLM – logistic regression, regression trees, principal component analysis, and a simple site blocking factor as a fixed effect (no way I’m going to mixed models). Purely from a here’s how to do it, not a theory point of view. Just to give them a flavor of what’s coming, not to become experts.  Plus these four techniques get them far into being able to read the literature (2 weeks)

That’s 12.5 weeks which is about the average semester at an American university. If I had more time it would go into more data wrangling, data visualization and the last unit (four extensions). (Jeremy adds: boy howdy, Brian wants to cover material a lot faster than we do in intro biostats at Calgary! He’s proposing to cover a bit more in one course than we cover in 1  course plus 2/3 of a follow-up course! Not criticizing, necessarily–a lot of variables affect how fast you can move through material. Just noting the contrast.)


31 thoughts on “Ask us anything: the “perfect” intro biostats course

    • Sorry, way too advanced for intro biostats in my view. I do mention phylogenetic relatedness as an example of non-independence of observations. But that’s literally a one-minute mention. There’s no way I could actually get to phylogenetic mixed models in intro biostats at Calgary. Even students who take our follow-up advanced biostats course don’t get that far.

    • I feel the phylogenetic framework is to specific. Biology students not doing comparative analyses will probably not need them; but a general view ou confounding factors and why they matter – including phylogenetic autocorrelation – seems something worth covering!

  1. The reason Brian proposes to cover much more material than we cover in intro biostats here at Calgary is that his GLM module takes much less time than it takes us to cover the various special cases of GLMs that we cover. Never having taught a GLM-based intro biostats course (“intro” meaning “first stats course the students have ever taken”), I have no sense of how feasible it is to cover GLMs that quickly. Thoughts from others who have experience with this?

    • I once tried this as a grad two-weeks intro stats course, for ecology. Unfortunately I didn’t get to teach that same course again (due to getting a permanent job elsewhere 🙂 ), but in general I found it quite challening to teach the general formula (like Y = b0 + b1*X1 + … + error), especially considering categorical explanatory variables. In general it took me the same time as it used to take me to teach the classic tests, but I think I managed to provide a more general view of statistics than in the previous years.

  2. I’ve taught an introductory R biostats course for MSc students (some of whom had little or no stats background) using an “lm” approach for everything to begin with, then expanding it into “glm” once they were more confident. They seemed to grasp the concept of ‘response ~ explanatory variable(s)’ much more quickly than having to learn a whole set of cryptically named stats tests. There a good website showing everything in R at but I haven’t found a textbook yet that uses this approach.

  3. Excellent. Two things: 1) Many practicing, applied biologists think non-parametric is the bees-knees and students will move into labs/companies that don’t realize that non-parametric is largely a pre-computer solution, so how to deal with this on the ground reality at the level of intro-teaching (remember Ewen Birney’s list of things he wished he’d learn in grad school, which included “learning non-parametric”)? And 2) many, many biologists (almost everyone doing bench science, so molecular, cell, micro, neuro) use Prism for analysis – and this includes biotech companies, which has a very different way of thinking about analysis and is not set up for mixed/generalized models at all. Many of your students will enter this world. This is a very very NHST world and the idea of “effect size” is effectively non-existent and what is a “meaningful” effect size is a meaningless question because there simply isn’t enough known about the systems to come up with a number. This raises the question about teaching NHST v modeling estimation/uncertainty because the science that most of these biologists do is built around an NHST model of science (by this I mean we don’t have quant, mechanistic models of expected effects, only qual models of “presence” of effect) – so if we moved teaching to linear models and extensions with coefficients and CIs (I think where we all want to go) then we have to nudge this field in re-thinking how science is done (that said, many ecology experiments that I see are firmly built around an NHST model of science). Regardless, I have a very small sample but students/postdocs in bench biology I’m talking to are yearning for learning R, probably because they see this leaking down from the bioinformatics stuff.

    • I add this because we are all thinking about a glm modeling approach to teaching but I, at least, live/work in a bit of a bubble and think that the big problems are estimation problems. But if Prism cannot do GLMs then should I be teaching quasi-poisson or negative binomial? Should I care if negative binomial with permutation test ( has more power than a t-test on log transformed data if most students that go on in labs/companies are trying to get it done with Prism? I am *not* conceding to NHST but it does check my assumptions about what/how to teach.

    • …last bit on bubble. I love bootstrap/permutation because it helped me understand what “frequentist” and null distribution really means. But…I don’t think Prism implements either bootstrap SE/CI or permutation tests. This raises definition of “perfect” class. Perfect in ideal world or perfect given the reality of this world. As teachers, what is our role in teaching bootstrap/permutation when a major major software solution doesn’t implement these?

      • Three thoughts (all of which I’m guessing you’ll agree with Jeff):

        First, the answer to the question “how many of our undergrads are likely to go on to jobs where they have to use Prism?” probably varies a *lot* among institutions. For instance, none of the cell/molecular labs here at Calgary use Prism AFAIK. And even if a few of them will go on to use Prism, well, a lot of them won’t. Nobody’s intro biostats course can function as job training for every possible career path.

        Second, training in the fundamentals of statistical thinking should be pretty software-independent and transferable. So for instance, I would hope that anybody who’s been through our intro biostats course would be able to pick up classical nonparametric tests pretty quickly. Just as somebody who’s been through our intro biostats course ought to be able to build on that and go on to pick up generalized linear models or PCA or anything else we don’t teach in intro biostats.

        Finally, if you’ve trained students to be thoughtful about things that the Prism software implicitly discourages its users from thinking about (e.g., effect sizes), well, I don’t know that that makes your training a *bad* thing. I mean, isn’t it generally better for any user to be smarter than the software? Everybody’s always complaining about people who use statistical software in a mindless, rote way without really understanding what the software is doing. Surely it’s not *also* a problem if well-informed people use the software thoughtfully (and perhaps reluctantly!), because they understand very well what it’s doing!

      • Hi Jeremy – I do agree. That said, teaching “how to think” is not software independent. It takes much more work to “think ANOVA-ly” than “think statistical model-ly” in R – that is R effectively forces you to think about statistics a certain way — using statistical models. Graphpad Prism forces one to think about statistics as choosing the right test. I focus on Graphpad Prism because its a marker for how working biologists actually think about and do statistical analysis. The point isn’t to “teach to Graphpad prism” but to use this knowledge to inform us how to address bad practices and misconceptions that we know will arise in their future because many of these students will be in a lab/company that uses this software designed around NHST thinking. How many? Here is a test of your point #1, using google scholar, for recent (2018-) publications. Play around with other keywords (“microbiome”) and sources (source:neuroscience).

        search “knockout” AND (limit to 2018 to present)
        “Graphpad” 17400 results
        “JMP” 1270 results
        “SAS” 4890 results
        (“” OR “R statistical”) 1500 results

        search “knockout” AND Calgary AND (limit to 2018 to present)
        “Graphpad” 190 results
        “JMP” 5 results
        “SAS” 25 results
        (“” OR “R statistical”) 12 results

        search “source:plos” AND (limit to 2018 to present)
        “graphpad” 3360 results
        “JMP” 540 results
        “SAS” 2880 results
        “cran r project org” OR “R statistical” 1520 results

        search “Calgary” AND “source:plos” AND (limit to 2018 to present)
        “graphpad” 19 results
        JMP 2 results
        SAS 20 results
        “cran r project org” OR “R statistical” 14 results

        search “source:nature” AND (limit to 2018 to present)
        “graphpad” 9740 results
        “JMP” 789 results
        “SAS” 3240 results
        “cran r project org” OR “R statistical” 2160 results

        search “Calgary” AND “source:nature” AND (limit to 2018 to present)
        “graphpad” 63 results
        JMP 2 results
        SAS 30 results
        “cran r project org” OR “R statistical” 16 results

  4. Thank you for this excellent post!

    I was glad to see that I implement most of Jeremy’s ideas in the course that I teach, except for the statistical vignettes (but I show some ted talks and other videos about statistics. Students seem to be interested in them!). I agree with Brian’s suggestions, wondering here whether they can be actually implemented…

    I have two follow-up questions:

    – I totally agree about teaching R. But do you think R should be the only software used, or should we provide other options as well? Many students are scared of R when they first see it, and sometimes labs with R may become a search for the lost commas that keep the code from working. 🙂 So I start my course by teaching the good use of spreadsheets for data organization and basic spreadsheet operations (pivot tables etc); then I teach software that does not require coding (I use Past, because it’s free and performs most of the things we need, except for GLMs); and for the remaining half of the course I use R, with a focus on the response~explanatory syntax. Do you think using only R would be a better approach?

    – How to include an overview of PCA and regression trees into such a course? I feel that it is a bit too advanced. But maybe not?

    • I’d recommend an R-only approach myself. I mean, here at Calgary that’s what we do! Yes, students do find it challenging to learn R coding as well as statistical principles at the same time. But they manage it. OTOH, just because something works for our students doesn’t necessarily mean it will work for your students.

      I don’t think you can get all the way to PCA and regression trees in intro biostats, even if you teach GLMs from the get-go as Brian suggests. But I dunno, maybe at some institutions you can move that fast without losing most of the students?

  5. Interesting post – thanks, both!
    Brian – if you’re covering colinearity, I hope you mention Morrisey & Ruxton’s (2018) excellent article “Multiple Regression Is Not Multiple Regressions: The Meaning of Multiple Regression and the Non-Problem of Collinearity”! doi:10.3998/ptpbio.16039257.0010.003

    (Disclaimer – one of the authors was my PhD supervisor, so I have to plug their work!)

    • Mike – that paper is my go to paper for collinearity when teaching it and I largely agree with the message. But they will see the issue come up a lot.

  6. Dyanmic ecology has several excellent blogs on setting up intro biostat courses. I have learned a lot. I am wondering what should be covered in a biostat course for graduate students in ecological program then? It may be more difficult to decide than an intro-level course because there are so many worthy topics to consider.

  7. Interesting answers and comments. I don’t teach stats, but I have two thoughts. One is that I’m surprised that no one mentioned information-theoretic model selection and multi-model inference tools. These are far more relevant than null hypothesis testing in ecological research, and are becoming sufficiently widely used that it would seem to be a good thing to expose students too.

    My second thought is that if I could convey one principle to students about statistics it would be: whatever you learn in this class, or in any class, is only the beginning. You should expect to keep having to teach yourself more, and newer, statistical theory and methods throughout your career. That has certainly been true for me.When I was studying statistics, Sokal and Rholf appeared as a brand new text. I learned how to calculate expected mean squares for fractional replicated designs. No one could do permutation tests. The bootstrap was still a thing you used to put on your shoes. And, we had to walk to school through the snow, uphill both ways.

    New stuff keeps coming along. In an academic career, you find yourself wanting to learn things that will help you answer new questions, and you have to teach yourself. I don’t have firsthand experience with a career in industry, but I assume that such a career requires reading and understanding papers that use new methods, even if your work doesn’t directly apply them. So, the comments about teaching how to think about statistics, as well as how to do (some subset of) statistics, seem right on to me.

  8. I teach intro biostats, but my approach is completely different. I am wondering if what I do could be too unusual, but I usually receive good feedback. I do not teach any kind of analysis, no t-test or ANOVA. I want to focus on understanding the principle behind null-hypothesis testing based on the frequentist approach. I start teaching how to present data, how to summarize information, and ultimately how to interpret histograms, mean, median, and dispersion estimators. After that, I introduce the concept of probability and derive common probability mass functions for trivial binomial and poison “experiments”. I teach how to derive probabilities (p-values) for specific observations “given” (“assuming” would be more appropriate since there is no conditional probability) specific success probabilities and poison rates, and slowly implant on the students minds the notion of p-values as the P (observing something | “assumption”). After that, I introduce PDFs (e.g., normal distribution) and introduce the idea that we can create a frequency distribution for some metrics (that measure something of interest) assuming some special condition, and based on the value of these metrics for some observed data we can calculate the probability of finding that value or a greater value assuming that special condition. I go back a little and use the specified “special condition” as a criterion to explain the meaning of permutation tests, showing the connection between permutation tests and the probability of finding specific values from the frequency distribution of some metric. I forgot to mention that I also teach Bayes formula when I introduce probability with a famous example of the probability of testing positive for a rare disease. In the end, my course is about probability and the meaning of p-values. I would only feel comfortable to teach any specific analysis when teaching experimental design in conjunction, but I do not think that this would be “introductory”. I may have to rethink what I am doing, this is maybe too oldfashion.

  9. If I had infinite time I would like to revamp the calculus for life sciences curriculum in favor of some sort of “quantitative reasoning” sequence – it would cover the big ideas of calculus, linear algebra, and programming but through the lens of problem-solving. There would be a big emphasis on probabilistic thinking and why calculus is conceptually the big tool behind much of statistics and dynamic modeling. But it would skip memorizing tricks for computing tough derivatives, integrals, and series (since most folks use a computer for all that stuff these days anyway). Instead, programming would be introduced. It would have to be a year-long sequence and it would replace the math requirement for bio majors. The course would serve the foundation for courses in bio-stats, theoretical ecology, and modeling in general. So many folks come to me at the start of the Ph.D. knowing very little quantitative reasoning, partially I think their math courses scared them all away. This I think to some degree is due to how we teach the courses in high school and first-year undergrad where the emphasis is on formulas and procedures instead of critical thinking.

    End random tangent to the ideal biostats curriculum question. But I do think part of the issue is that the foundational classes, prior to biostats are even more out of date.

  10. During my first two years of my undergrad I remember not being all that fond of stats and data analysis. Mainly because at the time I could never see how it would fit with the other stuff I was being taught at the same time as it was very much taught separately and not directly linked to all the other course material.
    Therefore, would it be better to put more statistics, data analysis and study design teaching into non-stats specific modules?

    • That’s how I learned some biostats as an undergrad. Various statistical tests were taught here and there as part of my various biology courses. It worked out fine for me in the end–but only because I learned in grad school all the stuff that undergrads learn in their dedicated biostats courses. I think it’s very hard to teach students fundamental statistical concepts when each biology course just teaches whatever statistical tests are needed in the course.

  11. First of all, I’d like to congratulate Jeremy and Brian for the blog and for the post.
    Here in Brazil, just as elsewhere in the world, that is a hot topic among ecologists since stats is a basic knowledge to perform ecological research, yet, especially undergrad students are sometimes afraid of getting their hands on stats and programming.

    If you were going to give a one-week stats course. Which of the contents brought on the text you think it’s the most important to undergrad students that probably will go on other stats courses?
    What about a two-hour class?

    Thinking about this challenge, I thought about spending a little time on the normal curve and concepts that would help then to understand hypothesis testing, such as mean, deviation and error, then going to hypothesis testing.
    What do you think of this?

    • Good questions!

      For a 1-week course, how much total class time (including any labs) are we talking about? If the students would be studying nothing but statistics all day every day for a week, you could actually cover quite a bit of material!

      For a single 2-hour class, I guess you’d just want to cover some core foundational concepts that students will need to know about no matter what other courses they go on to take. So maybe things like populations, samples, sampling error (and the idea that we can quantify it), random sampling, and sampling bias as opposed to sampling error.

Leave a Comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.