Why, and how, to do statistics (it’s probably not why and how you think)

I’ve actually been planning this post for a while, so while it might seem like a reply to Brian’s “insidious evils of ANOVA” post, it’s not, or if it is it’s only by accident. Think of it rather as part of a running theme on this blog: how to do ecology, and how to use statistics to help us do ecology.

This post is mostly philosophical, though there’s practical advice too. I hope that in saying that I didn’t immediately drive away most readers (and not just because I’m jealous of the massive audiences Brian’s wonderful posts have been drawing!) I do think good statistical practice starts from a firm conceptual/philosophical foundation, and once in a while it’s worth thinking about that philosophy explicitly. Doing statistics well requires thought and decision-making; it isn’t like following a recipe. And so you need to have a really clear and firm grasp of why you’re doing what you’re doing in order to do it well. Plus, the philosophical stuff I want to talk about in this post actually is stuff I’ve only briefly mentioned before on this blog, and have never discussed at length. So while I freely admit that that preamble is not nearly as attention-grabbing as declaring ANOVA “an insidious evil” (though I did do my best to give the post a provocative title), I hope you’ll still want to read on! 😉

I’ll start with a deliberately-basic question. One to which you may think you know the answer, and indeed one which you may think has a boringly obvious or familiar answer. I’ll then go on to suggest that the answer to this question is at least somewhat different from what you probably think it is. Here’s the question:

Why do statistics?*

Here’s a broad-brush answer: to obtain reliable scientific knowledge about the world, despite uncertainty and the threat of error. We never have perfect, exhaustive information. So if as scientists we’re going to separate truth from falsehood, figure out how the world is and how it isn’t, we’re going to need to make inferences. And we need help to make those inferences, we can’t just make them unaided. Often, we can’t even keep all the relevant information in our heads at once, much less make inferences from it. Plus, we humans are awfully good at fooling ourselves–at seeing “patterns” that are really just noise, at interpreting the data in an unconsciously-biased way so as to conform to our own preconceptions, etc. Statistics is a set of techniques for taking imperfect information (i.e. data that may not perfectly mirror how the world actually is), and using it to make reliable inferences about how the world is or isn’t.

Of course, that broad-brush answer needs an awful lot of fleshing out, and different people have fleshed it out in different ways. This post is a plug for one particular way of fleshing out that broad-brush answer, which is probably somewhat (though not entirely) different from the way you were taught, or that you teach to your own students: philosopher Deborah Mayo’s way. She lays out her view, which she calls “error statistics”, at length in her book, Error and the Growth of Experimental Knowledge (follow the link to read it for free online!) I read that book in grad school, on Bob Holt’s recommendation, and it had a big influence on me, so I highly recommend it to everyone. But if for whatever reason you don’t want to take on that whole book, Deborah Mayo has a much shorter and quite accessible 2011 article, co-authored with statistician Ari Spanos, summarizing many important aspects of her views, particularly in the context of the ordinary, everyday applications of statistics. Mayo is no head-in-the-clouds, totally-impractical, never-met-an-actual-scientist philosopher–she’s very much out to understand, and improve, how actual scientists do actual science (and get her fellow philosophers to come down from the clouds and start doing the same). So she has a lot to say that practicing scientists will find well worth thinking about. What follows are some insights from the Mayo and Spanos article, to give you some of the flavor, and encourage you to read the whole thing (there are many bits of the article that I’m omitting).

Drawing a substantive scientific conclusion often requires lots of statistical tests, for learning about, and from, different errors. This is a point Mayo and Spanos actually don’t talk about at great length, but which Mayo talks about more in her book. It’s rarely (not never, but rarely) the case in science that a single experiment, analyzed with a single statistical test, can justify drawing a substantive scientific conclusion. More commonly, we have to build up to a substantive scientific inference in piecemeal fashion. Making different observations or different experiments to check different assumptions or test different predictions of some hypothesis. Running different controls so as to rule out different artifacts. Even activities as humble as calibrating our instruments. All of that and much more goes into identifying, quantifying, ruling out, and learning from different sorts of errors, so that we can ultimately draw some well-justified, substantive scientific conclusion. What Mayo calls “error statistics” is both an overarching philosophy of science, and the various statistical methods, principles, and approaches that help us pursue science according to that overarching philosophy. Statistics, on Mayo’s philosophy, is a set of tools for “making errors talk”.

In ecology, I wonder about how good we are (individually and collectively) at this sort of piecemeal process of building up to a substantive and well-justified conclusion. I worry that, because ecology is hard (ecosystems are hugely complicated, they change over time, no two are exactly alike, etc. etc.), we’re too often willing to settle for getting partway there. We too often do something that’s “useful” or “suggestive” or “gives part of the story”. Which is fine as far as it goes. But then instead of building on that by doing the (often more difficult) things one would need to do to really nail down the conclusion, we move on to doing “useful” or “suggestive” work that partly addresses some other question. Perhaps consoling ourselves with the thought that, well, ecology is hard, and getting partway to a really well-justified answer is better than not getting anywhere, and maybe somebody else will be able to really nail things down in the future. But then again, it’s not as if I can actually quantify how good we as a field are at really nailing down our conclusions (how would you quantify it, anyway?), or how good we could be if we somehow changed how we operate. So my worry here probably doesn’t amount to much more than one person’s grouchiness. Take it (or leave it!) for what it’s worth.

The severity principle. Just saying that “we do statistics to obtain reliable knowledge despite uncertainty and the threat of error” is a useful starting point, but far from a specific, concrete guide to practice. So precisely how does statistics help us obtain reliable knowledge, help us build our way up to substantive scientific conclusions? Mayo’s answer to that question is broadly Popperian in spirit, but it differs from Popper in many important ways (so Brian, before you start channeling your inner Lakatos, keep reading!) In particular, a strict Popperian would insist that we can only infer what’s false, not what’s true. We can learn how the world isn’t, but not how it is. In contrast, Mayo has a much richer and more positive view of learning from error, insisting that we can infer the truth. There are lots of different kinds of errors, which we test for, quantify, control, and learn from in different ways, but underpinning all of that methodological hodgepodge is a big general principle of scientific and statistical inference, which she calls the “severity principle”:

Data x produced by process G provides good evidence for hypothesis H just to the extent that test T “severely passes” H with data x. Test T severely passes H with data x if x accords with H, and if, with very high probability, test T would have produced a result that accords less well with H than x does if H were false. 

In other words, if you’ve looked for something (here, an error in hypothesis H, though it could be something like an error in a parameter estimate) and not found it in your data, and if you quite probably would have found it if it were there to find, you’re entitled to infer that it’s not there. And if through a series of appropriate tests, you systematically rule out all the possible errors, then eventually you’re entitled to infer that there are no more errors to find—i.e. that H is true, not merely “not rejected”.

Perhaps this principle sounds obvious or even trivial to you. In the remainder of the post, I hope to convince you that the severity principle isn’t trivial or obvious, at least in terms of its consequences for our statistical practice.

The severity principle is a broad principle of scientific inference, but it also applies in the narrower context of specific statistical tests (again, statistical inferences are just a means to the end of scientific inferences). In particular, Mayo and Spanos do a really nice job of showing how doing frequentist statistical inference the way it ought to be done (which is not always the way it’s usually taught!) is a matter of obeying the severity principle, on which more below.

Mayo and Spanos also do a nice job of explaining how the severity of a statistical test is related to, but distinct from, more familiar properties of tests. For instance, the severity of a statistical test is not the same thing as its power. Here’s an analogy from Mayo and Spanos: the smaller the mesh of a fish net (i.e. the more powerful the test), the more capable it is of catching even very small fish. So given the information that (i) a fish was caught, and (ii) the net is highly capable of catching even 1 inch fish, we should not infer that, say, a fish at least 9 inches long was caught. If the mesh were larger (i.e. the test less powerful), then the inference that a fish at least 9 inches long was caught actually would be more reliable (i.e. the test of the hypothesis would be more severe). Nor is the severity of a test the same thing as either the standard Type I or Type II error rate.

As I was drafting this post, it occurred to me that the severity principle actually is implicit in a lot of things I’ve written on this blog. Many of my posts question how severely certain ecological ideas have been tested, and whether certain popular research approaches can provide severe tests of our ideas. For instance, this post argues that a popular approach in phylogenetic community ecology actually does not provide a severe test of the hypotheses it purportedly tests, this post points out that testing for curvilinear local-regional richness relationships is not a severe test of the strength of local species interactions, and this post suggests that we too often compromise severity by neglecting to test the assumptions underpinning our predictions, as well as the predictions themselves. And I’m far from the only one who’s raised issues of severity or “well-testedness” in ecology, albeit without actually using Mayo’s terminology. In various posts and exchanges of comments, my fellow blogger Brian and I have had good-natured debates about how to severely test claims at ‘macroecological’ scales. In his own papers, Brian has argued strongly that fitting neutral and non-neutral models to the species abundance distribution does not provide a severe test of neutral models, because both neutral and non-neutral models predict similarly-shaped species abundance distributions. Siepielski and McPeek (2010) recently pointed out that, for all that community ecologists talk about coexistence, no one has ever gone out to any system and systematically and severely tested all the different classes of mechanism that might allow a set of species to coexist. Ranging outside of ecology, biologist and blogger Rosie Redfield famously made the case that the claim of arsenic-based microbial life had not been severely tested—conventional life forms as well as arsenic-based ones could pass the tests, meaning that the tests weren’t severe (as subsequent research confirmed). I could keep going, but you get the idea. At some level I suspect most every practicing scientist agrees with the severity principle or something very close to it. But as the examples I’ve just listed illustrate, in practice it’s not at all uncommon for our science to violate the severity principle.

But I don’t want to be purely negative. So just off the top of my head, here are some excellent ecology papers that I think exemplify an “error statistical” approach to science: building up to a severely-tested, substantive ecological conclusion in piecemeal fashion by systematically identifying, quantifying, learning from, and ruling out different sorts of errors. Harrison (1995) is a wonderful reanalysis of a series of classic predator-prey microcosm experiments by Leo Luckinbill, showing which features of the data can and cannot be explained by each of a series of predator-prey models of increasing biological complexity. It’s a nice positive example of learning from error, that doesn’t just use mismatches between data and model to reject the model. Rather, mismatches between the data and the initial model are used to guide the development of a modified model, which is then tested by comparing its predictions to other data, and so on. Holyoak and Lawler (1996) is a wonderful set of microcosm experiments identifying the mechanisms by which spatial patchiness enhances the persistence of an extinction-prone predator-prey interaction. Together, the experiments comprise a severe test that rules out certain mechanisms, and rules in others. Adler et al. (2006) is a good example of testing assumptions as well as predictions in order to make a severely-tested ecological inference, in their case about how species coexistence depends on environmental fluctuations. That paper also provides an illustration of severe testing without experimental data, and in the context of parameter estimation rather than hypothesis testing. My fellow blogger Megan Duffy and her collaborator Spencer Hall have published a series of papers identifying and testing different mechanisms governing the course of disease outbreaks in Daphnia populations using a combination of observations, experiments, and mathematical modeling to home in on the truth (Duffy and Hall 2008 is one entry point into this work). I don’t know if they yet consider themselves to have a fully worked-out and severely-tested story, but they’re clearly working their way towards that. The work of Bill Murdoch, Cheri Briggs, and colleagues, systematically testing and ruling out the various mechanisms that might explain how a parasitoid controls an important crop pest (California red scale) at a low yet stable density, is a terrific, classic example in population ecology. This work is reviewed and brought to a conclusion in Murdoch et al. (2005). While hardly in the same class as any of those papers (and many others I haven’t mentioned), Fox and Barreto (2006) is an attempt I made to systematically test various hypotheses for how competing protists coexist—which ended up ruling out all the possibilities I could think of and leaving me stumped!

A great summary of the logic of frequentist statistical inference. Most all of us have done frequentist statistical tests. But why, exactly? Precisely what’s the logic of those tests? Don’t be so sure you know! Mayo and Spanos consider the simple example of weighing a man named George at two different times to test whether his weight gain over that time is no more than some positive value X. At each time, we weigh George on each of several well-calibrated and very sensitive scales, all of which register no weight gain. From that, we’d infer that George’s weight gain was less than X. But why is that a justified inference? Is it because our weighing procedure would very rarely make mistakes in a long series of trials? That is, are we arguing that, because our procedure rarely errs in the long run, we can safely assume it hasn’t erred in this particular case? That’s the standard answer, I think, but Mayo and Spanos argue that that answer is wrong, or at least incomplete. A low long-run error rate may be a necessary condition for reliability of our inference in this particular case, but it’s not sufficient. Rather, the reason this particular inference (that George hasn’t gained X pounds) is reliable is what philosophers call the “argument from coincidence”. What we’re arguing is that, had George gained at least X pounds, it would be an amazing coincidence if none of our scales—all of which are sensitive and well-calibrated—had registered it. It would be an amazing coincidence for all of our scales to systematically mislead us about George’s weight in this particular case, even though they don’t systematically mislead us at other times (e.g., when weighing objects of known weight). What properly-used frequentist statistical tests do is quantify how strong our arguments from coincidence are, and precisely what inferences our arguments from coincidence do or do not (severely) warrant.

How the severity principle helps us avoid common fallacies and mistakes. Certain misunderstandings and mistaken applications of frequentist statistics are very common, which leads to calls for scientists to use other forms of statistical inference, and calls to reform frequentist statistical practice. Brian raised one common complaint (a mindless focus on rejecting null hypotheses at the P<0.05 level) in his recent post. Mayo and Spanos agree that these misunderstandings and mistakes are common, but argue powerfully that a focus on severity is the proper way to address them. So if like most scientists you’ve complained, or heard others complain, that we shouldn’t just mindlessly focus on rejecting null hypotheses at the P<0.05 level, or that frequentist statistical tests treat all statistically-significant results the same, or that P-values don’t tell you about effect sizes, or that confidence intervals are more informative than P-values, or that “overly powerful” tests can detect trivial discrepancies from the null, or that the null is always false, or that statistically-insignificant results are used to infer that the null hypothesis is true, or etc., you really ought to read this paper. Mayo and Spanos show how the severity principle addresses all of these oft-repeated complaints and more besides.

p.s. This post is about explicating Mayo’s idea of “error statistics”, not about Bayesian vs. frequentist statistics. You can use Bayesian methods and do “error statistics” well (e.g., as Andrew Gelman does), and as noted in the post you can use frequentist methods and still do “error statistics” badly. Bayesian vs. frequentist statistics has been debated on this blog before and I do not want to go over that ground again. So in the comments please don’t veer off into debating Bayesian vs. frequentist stats.

*Protip for grad students preparing for their candidacy exams: You should be able to confidently and succinctly answer this sort of question, and be able to elaborate if asked. Handling this sort of seemingly-easy, fundamental question with aplomb is a good way to impress your committee.

22 thoughts on “Why, and how, to do statistics (it’s probably not why and how you think)

  1. Jeremy great post. I’ve been intrigued as I’ve heard you talk abut Mayo’s work. I find myself in very substantial agreement. I like the emphasis on severity of tests and that severity often spans across multiple studies. As you note I have written multiple peoples arguing that the early tests of neutral theory were very non-severe as are many tests of macroecological theory. It is really important.

    A couple of questions:
    1) Severe tests seem a lot like my favorite philosopher’s (Lakatos) idea of making successful predictions that are novel and bold and risky. Also the idea that you can actually eventually say something is true, not just failed to falsify. And the idea that this process extends over an entire body of research. Does anybody make these links?
    2) While I agree that frequentist statistics clearly have a link to making severe tests, it seems that this also when simplistically applied will let people off the hook of thinking a good p value is severe where as severity has much more to do with how many other theories make the same prediciton, etc.
    3) How does testing multiple models simultaneously factor into this? This is in my mind an important feature of strong scientific inference. A null model is a step in the right direction but simultaneously contrasting genuine models is even stronger. Jarret also mentioned this in his comments on my post yesterday.
    And a comment – I totally agree that we get too caught up with the methods and numbers of statistics and lose the philosophical context. This is a theme in a lot of both of our posts. And you are right that it is those really basic seemingly obvious questions that often trip people during comps!

    • Glad you liked the post Brian. In response to your questions:

      1) In her book, Mayo does talk at least briefly about Lakatos, but it’s been long enough since I read the book that I don’t remember what she has to say.
      2) Yes, one can certainly have a low p-value and yet have it not constitute a severe test of some hypothesis. And yes, severe testing of substantive scientific theories often is a matter of comparing alternate theories which may make some of the same predictions. Mayo would also emphasize that testing substantive scientific theories, whether a single one or several competing ones, typically is a matter of doing lots of different tests directed at all sorts of different errors. In her book, she has a couple of very detailed historical discussions to illustrate and exemplify how this is done. One of them is Perrin’s studies of Brownian motion, and how he built up from very simple observations and experiments to test very simple, narrowly-defined, and frankly rather boring empirical hypotheses to a body of evidence that could be considered to severely warrant substantive scientific theories about the nature of matter and heat.
      3) Mayo is quick to emphasize that severely testing a null hypothesis is not the same thing as severely testing some other hypothesis. A severe test of the hypothesis that some parameter equals zero may well not be a severe test of the hypothesis that it equals, say, 0.4. More broadly, a severe test of, say, whether the relationship between two variables is linear may well not be a severe test of whether the relationship is, say, quadratic. A body of evidence that severely tests the substantive ecological claim that diversity in some lake is maintained by source-sink dynamics may not severely test the substantive ecological claim that diversity in that lake is maintained by environmental fluctuations that permit a storage effect. Etc. Does that answer your question?

    • p.s. On further reflection re: severity and its connection to Lakatos’ notion of novel, bold, risky predictions, I don’t know that they’re really the same. Mayo’s notion of severity is all about well-testedness. A bold, novel, risky prediction may or may not be one that is, or can be, well-tested. I can sort of imagine how one might attempt to draw such a connection. The idea is that we should be particularly impressed if a theory really “sticks its neck out” and yet survives the “guillotine” of testing, and so be very inclined to infer the truth of a theory that passes such a test. That is, the idea is that the “severity” of a test is not just a function of our testing procedure, the data, and the match between the data and the hypothesis, severity also depends on the “riskiness” or “boldness” of the hypothesis itself. I’m not sure that I buy that idea, and I don’t know if Mayo would either. I can certainly imagine reasons why we might want to have “bold” hypotheses, independent of how well-tested they are or could be. But to say that the “well-testedness” of a hypothesis depends in part on the “boldness” of the hypothesis? Sounds a little fishy to me…

      For instance, say Einsteinian relativity makes incredibly bold and risky prediction P, and we go and severely test that claim and find it to be true. I think Mayo would say, and I’d agree, that that in itself doesn’t constitute a severe test of Einsteinian relativity. You haven’t fully probed the theory, you haven’t severely tested all the ways it could be wrong. Making one true, “bold” prediction isn’t a sort of “doctor’s excuse” that lets you get out of being tested in other ways. Put differently, making one true prediction does not warrant the *assumption* that your other predictions are true, not even if the prediction is a really bold one. Making a true, bold prediction isn’t “indirect evidence” of the truth of your other predictions. Or would you (or Lakatos) say it is? (As I’ve noted in other posts, there are cases in science where we do draw on indirect evidence…)

      • I think Lakatos (and I) would say a single successful bold prediction is not the endpoint. Lakatos definitley emphasizes total weight of evidence.

        Can we maybe agree that a good model:
        1) Makes multiple predictions
        2) Makes bold/risky/novel predicitons
        3) Makes predictions that are well testable
        If such model then survives the multiple tests, then we are really sure we’re on to something?

  2. Thanks for this Jeremy. I only gave it the briefest glances and am looking forward to reading it closely asap.

    Just one comment. I have seen this idea:

    “…we humans are awfully good at…seeing “patterns” that are really just noise, at interpreting the data in an unconsciously-biased way so as to conform to our own preconceptions, etc”

    repeated so many times now (not by you, just generally) that am becoming increasingly suspicious of its veracity–or at least that its as common/ubiquitous/important as people generally seem to be portraying it as. I don’t know where the origin of this idea rests (probably old and cryptic I guess). One could make a very strong argument I think, that such a trait would be highly counter-productive for any species, causing it to frequently make incorrect conclusions as to the nature of its environment, thereby wasting time and energy in many possible small and large ways.

    • Hi Jim,

      The broad topic of cognitive biases and heuristic reasoning (heuristics being “shortcuts” that may work well in general but which mislead in specific circumstances) is the subject of an enormous body of psychology research. While I’m hardly an expert on that literature, it does exist. This point isn’t oft-repeated simply because everybody’s just repeating what everyone else has said, with no actual data to back it up. Duncan Watts’ book, Everything is Obvious (Once You Know the Answer) is one entry point into this literature which I’ve read (there are many other possible entry points).

      I think my remark about heuristics addresses your question about why cognitive biases haven’t been selected against. I freely admit I’m just waving my arms here, but I’d say that there’s great selective value in coming to quick decisions, and that means using heuristics. That those heuristics work in many contexts (which is why they’re favored by selection) doesn’t mean they can’t break down in other contexts. Also, the fitness costs of failing to detect certain sorts of patterns or signals (e.g., the hint of a tiger hiding in the grass, about to pounce on you) may in many contexts be lower than the fitness costs of seeing patterns and signals that aren’t really there. Again, I’m just speculating here, but I don’t think these are totally implausible speculations.

      • Thanks Jeremy; I apologize for firing off a question without having read the whole piece. Yes, I assumed the standard explanation for the phenomenon would involve differential fitness costs of the two possible, opposite types of errors that an organism could make. Another important issue is being clear on the difference between the generating process, and the result. Just because the process is stochastic does not mean that there is not in fact a very real pattern generated in certain instances. But anyway, I need to read your piece first before raising more issues.

  3. Pingback: Friday links: weirdest foraging experiment ever, beer and stats, abuse of parental leave, and more | Dynamic Ecology

  4. I’ve been meaning to post on this since the day it appeared, but such is the end-of-semester madness that I’m only getting to it today. You said that Spencer and I “have published a series of papers identifying and testing different mechanisms governing the course of disease outbreaks in Daphnia populations using a combination of observations, experiments, and mathematical modeling to home in on the truth (Duffy and Hall 2008 is one entry point into this work). I don’t know if they yet consider themselves to have a fully worked-out and severely-tested story, but they’re clearly working their way towards that.”

    Do we consider that we have the story fully worked out? Definitely not. We have been accused by some of having too many hypotheses, but I also think we’ve done our best to test and rule some out. Something that I think will be interesting will be to see how our hypotheses hold up to a new set of field sites here in SE Michigan.

    Another thing that I find interesting is that Spencer and I frequently have different preferred explanations for a given pattern — for example, I will think evolution is the key driver of the pattern, whereas Spencer will think it is temperature. I think the reason we work well together is that, before we ever get to the point of trying to convince a reviewer or reader, we have to convince the other (and neither of us is easily convinced).

    • Thanks for the background Meg, that’s really interesting. Sounds like you and Spencer hit what I think of at the “collaborator sweet spot”: similar enough in terms of how you think to collaborate productively, but different enough so that you complement one another and even push each other.

      I’d like to think that Dave Vasseur and I work together well for similar reasons, but the truth is that he’s just super-smart and I often wonder if I really bring all that much to the collaboration!

      Re: having too many hypotheses, I think a few years ago at the ESA I asked Spencer a question about that after his talk. It was a good-natured question, though. I agree it will be very interesting to see whether what’s going on in SE Michigan is the same as at your existing field sites.

  5. Pingback: “Null” and “neutral” models are overrated | Dynamic Ecology

  6. Pingback: Stuart Hurlbert rips Sokal & Rohlf and the state of biostatistical training | Dynamic Ecology

  7. Pingback: The downside of data sharing: more false results | Dynamic Ecology

  8. Pingback: Does peer review ever increase “researcher degrees of freedom” and compromise statistical rigor? | Dynamic Ecology

  9. Pingback: Why ecologists might want to read more philosophy of science | Dynamic Ecology

  10. Pingback: Friday links: math with bad drawings, the rejection that wasn’t, holiday caRd, and more | Dynamic Ecology

  11. Pingback: The power of “checking all the boxes” in scientific research: the example of character displacement | Dynamic Ecology

  12. Pingback: Friday links: valuing scientists vs. science, real stats vs. fake data, Pigliucci vs. Tyson, and more | Dynamic Ecology

  13. Pingback: Why AIC appeals to ecologist’s lowest instincts | Dynamic Ecology

  14. Pingback: Friday links: the prevalence of questionable research practices in ecology, and more | Dynamic Ecology

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.