I’ve actually been planning this post for a while, so while it might seem like a reply to Brian’s “insidious evils of ANOVA” post, it’s not, or if it is it’s only by accident. Think of it rather as part of a running theme on this blog: how to do ecology, and how to use statistics to help us do ecology.
This post is mostly philosophical, though there’s practical advice too. I hope that in saying that I didn’t immediately drive away most readers (and not just because I’m jealous of the massive audiences Brian’s wonderful posts have been drawing!) I do think good statistical practice starts from a firm conceptual/philosophical foundation, and once in a while it’s worth thinking about that philosophy explicitly. Doing statistics well requires thought and decision-making; it isn’t like following a recipe. And so you need to have a really clear and firm grasp of why you’re doing what you’re doing in order to do it well. Plus, the philosophical stuff I want to talk about in this post actually is stuff I’ve only briefly mentioned before on this blog, and have never discussed at length. So while I freely admit that that preamble is not nearly as attention-grabbing as declaring ANOVA “an insidious evil” (though I did do my best to give the post a provocative title), I hope you’ll still want to read on!😉
I’ll start with a deliberately-basic question. One to which you may think you know the answer, and indeed one which you may think has a boringly obvious or familiar answer. I’ll then go on to suggest that the answer to this question is at least somewhat different from what you probably think it is. Here’s the question:
Why do statistics?*
Here’s a broad-brush answer: to obtain reliable scientific knowledge about the world, despite uncertainty and the threat of error. We never have perfect, exhaustive information. So if as scientists we’re going to separate truth from falsehood, figure out how the world is and how it isn’t, we’re going to need to make inferences. And we need help to make those inferences, we can’t just make them unaided. Often, we can’t even keep all the relevant information in our heads at once, much less make inferences from it. Plus, we humans are awfully good at fooling ourselves–at seeing “patterns” that are really just noise, at interpreting the data in an unconsciously-biased way so as to conform to our own preconceptions, etc. Statistics is a set of techniques for taking imperfect information (i.e. data that may not perfectly mirror how the world actually is), and using it to make reliable inferences about how the world is or isn’t.
Of course, that broad-brush answer needs an awful lot of fleshing out, and different people have fleshed it out in different ways. This post is a plug for one particular way of fleshing out that broad-brush answer, which is probably somewhat (though not entirely) different from the way you were taught, or that you teach to your own students: philosopher Deborah Mayo’s way. She lays out her view, which she calls “error statistics”, at length in her book, Error and the Growth of Experimental Knowledge (follow the link to read it for free online!) I read that book in grad school, on Bob Holt’s recommendation, and it had a big influence on me, so I highly recommend it to everyone. But if for whatever reason you don’t want to take on that whole book, Deborah Mayo has a much shorter and quite accessible 2011 article, co-authored with statistician Ari Spanos, summarizing many important aspects of her views, particularly in the context of the ordinary, everyday applications of statistics. Mayo is no head-in-the-clouds, totally-impractical, never-met-an-actual-scientist philosopher–she’s very much out to understand, and improve, how actual scientists do actual science (and get her fellow philosophers to come down from the clouds and start doing the same). So she has a lot to say that practicing scientists will find well worth thinking about. What follows are some insights from the Mayo and Spanos article, to give you some of the flavor, and encourage you to read the whole thing (there are many bits of the article that I’m omitting).
Drawing a substantive scientific conclusion often requires lots of statistical tests, for learning about, and from, different errors. This is a point Mayo and Spanos actually don’t talk about at great length, but which Mayo talks about more in her book. It’s rarely (not never, but rarely) the case in science that a single experiment, analyzed with a single statistical test, can justify drawing a substantive scientific conclusion. More commonly, we have to build up to a substantive scientific inference in piecemeal fashion. Making different observations or different experiments to check different assumptions or test different predictions of some hypothesis. Running different controls so as to rule out different artifacts. Even activities as humble as calibrating our instruments. All of that and much more goes into identifying, quantifying, ruling out, and learning from different sorts of errors, so that we can ultimately draw some well-justified, substantive scientific conclusion. What Mayo calls “error statistics” is both an overarching philosophy of science, and the various statistical methods, principles, and approaches that help us pursue science according to that overarching philosophy. Statistics, on Mayo’s philosophy, is a set of tools for “making errors talk”.
In ecology, I wonder about how good we are (individually and collectively) at this sort of piecemeal process of building up to a substantive and well-justified conclusion. I worry that, because ecology is hard (ecosystems are hugely complicated, they change over time, no two are exactly alike, etc. etc.), we’re too often willing to settle for getting partway there. We too often do something that’s “useful” or “suggestive” or “gives part of the story”. Which is fine as far as it goes. But then instead of building on that by doing the (often more difficult) things one would need to do to really nail down the conclusion, we move on to doing “useful” or “suggestive” work that partly addresses some other question. Perhaps consoling ourselves with the thought that, well, ecology is hard, and getting partway to a really well-justified answer is better than not getting anywhere, and maybe somebody else will be able to really nail things down in the future. But then again, it’s not as if I can actually quantify how good we as a field are at really nailing down our conclusions (how would you quantify it, anyway?), or how good we could be if we somehow changed how we operate. So my worry here probably doesn’t amount to much more than one person’s grouchiness. Take it (or leave it!) for what it’s worth.
The severity principle. Just saying that “we do statistics to obtain reliable knowledge despite uncertainty and the threat of error” is a useful starting point, but far from a specific, concrete guide to practice. So precisely how does statistics help us obtain reliable knowledge, help us build our way up to substantive scientific conclusions? Mayo’s answer to that question is broadly Popperian in spirit, but it differs from Popper in many important ways (so Brian, before you start channeling your inner Lakatos, keep reading!) In particular, a strict Popperian would insist that we can only infer what’s false, not what’s true. We can learn how the world isn’t, but not how it is. In contrast, Mayo has a much richer and more positive view of learning from error, insisting that we can infer the truth. There are lots of different kinds of errors, which we test for, quantify, control, and learn from in different ways, but underpinning all of that methodological hodgepodge is a big general principle of scientific and statistical inference, which she calls the “severity principle”:
Data x produced by process G provides good evidence for hypothesis H just to the extent that test T “severely passes” H with data x. Test T severely passes H with data x if x accords with H, and if, with very high probability, test T would have produced a result that accords less well with H than x does if H were false.
In other words, if you’ve looked for something (here, an error in hypothesis H, though it could be something like an error in a parameter estimate) and not found it in your data, and if you quite probably would have found it if it were there to find, you’re entitled to infer that it’s not there. And if through a series of appropriate tests, you systematically rule out all the possible errors, then eventually you’re entitled to infer that there are no more errors to find—i.e. that H is true, not merely “not rejected”.
Perhaps this principle sounds obvious or even trivial to you. In the remainder of the post, I hope to convince you that the severity principle isn’t trivial or obvious, at least in terms of its consequences for our statistical practice.
The severity principle is a broad principle of scientific inference, but it also applies in the narrower context of specific statistical tests (again, statistical inferences are just a means to the end of scientific inferences). In particular, Mayo and Spanos do a really nice job of showing how doing frequentist statistical inference the way it ought to be done (which is not always the way it’s usually taught!) is a matter of obeying the severity principle, on which more below.
Mayo and Spanos also do a nice job of explaining how the severity of a statistical test is related to, but distinct from, more familiar properties of tests. For instance, the severity of a statistical test is not the same thing as its power. Here’s an analogy from Mayo and Spanos: the smaller the mesh of a fish net (i.e. the more powerful the test), the more capable it is of catching even very small fish. So given the information that (i) a fish was caught, and (ii) the net is highly capable of catching even 1 inch fish, we should not infer that, say, a fish at least 9 inches long was caught. If the mesh were larger (i.e. the test less powerful), then the inference that a fish at least 9 inches long was caught actually would be more reliable (i.e. the test of the hypothesis would be more severe). Nor is the severity of a test the same thing as either the standard Type I or Type II error rate.
As I was drafting this post, it occurred to me that the severity principle actually is implicit in a lot of things I’ve written on this blog. Many of my posts question how severely certain ecological ideas have been tested, and whether certain popular research approaches can provide severe tests of our ideas. For instance, this post argues that a popular approach in phylogenetic community ecology actually does not provide a severe test of the hypotheses it purportedly tests, this post points out that testing for curvilinear local-regional richness relationships is not a severe test of the strength of local species interactions, and this post suggests that we too often compromise severity by neglecting to test the assumptions underpinning our predictions, as well as the predictions themselves. And I’m far from the only one who’s raised issues of severity or “well-testedness” in ecology, albeit without actually using Mayo’s terminology. In various posts and exchanges of comments, my fellow blogger Brian and I have had good-natured debates about how to severely test claims at ‘macroecological’ scales. In his own papers, Brian has argued strongly that fitting neutral and non-neutral models to the species abundance distribution does not provide a severe test of neutral models, because both neutral and non-neutral models predict similarly-shaped species abundance distributions. Siepielski and McPeek (2010) recently pointed out that, for all that community ecologists talk about coexistence, no one has ever gone out to any system and systematically and severely tested all the different classes of mechanism that might allow a set of species to coexist. Ranging outside of ecology, biologist and blogger Rosie Redfield famously made the case that the claim of arsenic-based microbial life had not been severely tested—conventional life forms as well as arsenic-based ones could pass the tests, meaning that the tests weren’t severe (as subsequent research confirmed). I could keep going, but you get the idea. At some level I suspect most every practicing scientist agrees with the severity principle or something very close to it. But as the examples I’ve just listed illustrate, in practice it’s not at all uncommon for our science to violate the severity principle.
But I don’t want to be purely negative. So just off the top of my head, here are some excellent ecology papers that I think exemplify an “error statistical” approach to science: building up to a severely-tested, substantive ecological conclusion in piecemeal fashion by systematically identifying, quantifying, learning from, and ruling out different sorts of errors. Harrison (1995) is a wonderful reanalysis of a series of classic predator-prey microcosm experiments by Leo Luckinbill, showing which features of the data can and cannot be explained by each of a series of predator-prey models of increasing biological complexity. It’s a nice positive example of learning from error, that doesn’t just use mismatches between data and model to reject the model. Rather, mismatches between the data and the initial model are used to guide the development of a modified model, which is then tested by comparing its predictions to other data, and so on. Holyoak and Lawler (1996) is a wonderful set of microcosm experiments identifying the mechanisms by which spatial patchiness enhances the persistence of an extinction-prone predator-prey interaction. Together, the experiments comprise a severe test that rules out certain mechanisms, and rules in others. Adler et al. (2006) is a good example of testing assumptions as well as predictions in order to make a severely-tested ecological inference, in their case about how species coexistence depends on environmental fluctuations. That paper also provides an illustration of severe testing without experimental data, and in the context of parameter estimation rather than hypothesis testing. My fellow blogger Megan Duffy and her collaborator Spencer Hall have published a series of papers identifying and testing different mechanisms governing the course of disease outbreaks in Daphnia populations using a combination of observations, experiments, and mathematical modeling to home in on the truth (Duffy and Hall 2008 is one entry point into this work). I don’t know if they yet consider themselves to have a fully worked-out and severely-tested story, but they’re clearly working their way towards that. The work of Bill Murdoch, Cheri Briggs, and colleagues, systematically testing and ruling out the various mechanisms that might explain how a parasitoid controls an important crop pest (California red scale) at a low yet stable density, is a terrific, classic example in population ecology. This work is reviewed and brought to a conclusion in Murdoch et al. (2005). While hardly in the same class as any of those papers (and many others I haven’t mentioned), Fox and Barreto (2006) is an attempt I made to systematically test various hypotheses for how competing protists coexist—which ended up ruling out all the possibilities I could think of and leaving me stumped!
A great summary of the logic of frequentist statistical inference. Most all of us have done frequentist statistical tests. But why, exactly? Precisely what’s the logic of those tests? Don’t be so sure you know! Mayo and Spanos consider the simple example of weighing a man named George at two different times to test whether his weight gain over that time is no more than some positive value X. At each time, we weigh George on each of several well-calibrated and very sensitive scales, all of which register no weight gain. From that, we’d infer that George’s weight gain was less than X. But why is that a justified inference? Is it because our weighing procedure would very rarely make mistakes in a long series of trials? That is, are we arguing that, because our procedure rarely errs in the long run, we can safely assume it hasn’t erred in this particular case? That’s the standard answer, I think, but Mayo and Spanos argue that that answer is wrong, or at least incomplete. A low long-run error rate may be a necessary condition for reliability of our inference in this particular case, but it’s not sufficient. Rather, the reason this particular inference (that George hasn’t gained X pounds) is reliable is what philosophers call the “argument from coincidence”. What we’re arguing is that, had George gained at least X pounds, it would be an amazing coincidence if none of our scales—all of which are sensitive and well-calibrated—had registered it. It would be an amazing coincidence for all of our scales to systematically mislead us about George’s weight in this particular case, even though they don’t systematically mislead us at other times (e.g., when weighing objects of known weight). What properly-used frequentist statistical tests do is quantify how strong our arguments from coincidence are, and precisely what inferences our arguments from coincidence do or do not (severely) warrant.
How the severity principle helps us avoid common fallacies and mistakes. Certain misunderstandings and mistaken applications of frequentist statistics are very common, which leads to calls for scientists to use other forms of statistical inference, and calls to reform frequentist statistical practice. Brian raised one common complaint (a mindless focus on rejecting null hypotheses at the P<0.05 level) in his recent post. Mayo and Spanos agree that these misunderstandings and mistakes are common, but argue powerfully that a focus on severity is the proper way to address them. So if like most scientists you’ve complained, or heard others complain, that we shouldn’t just mindlessly focus on rejecting null hypotheses at the P<0.05 level, or that frequentist statistical tests treat all statistically-significant results the same, or that P-values don’t tell you about effect sizes, or that confidence intervals are more informative than P-values, or that “overly powerful” tests can detect trivial discrepancies from the null, or that the null is always false, or that statistically-insignificant results are used to infer that the null hypothesis is true, or etc., you really ought to read this paper. Mayo and Spanos show how the severity principle addresses all of these oft-repeated complaints and more besides.
p.s. This post is about explicating Mayo’s idea of “error statistics”, not about Bayesian vs. frequentist statistics. You can use Bayesian methods and do “error statistics” well (e.g., as Andrew Gelman does), and as noted in the post you can use frequentist methods and still do “error statistics” badly. Bayesian vs. frequentist statistics has been debated on this blog before and I do not want to go over that ground again. So in the comments please don’t veer off into debating Bayesian vs. frequentist stats.
*Protip for grad students preparing for their candidacy exams: You should be able to confidently and succinctly answer this sort of question, and be able to elaborate if asked. Handling this sort of seemingly-easy, fundamental question with aplomb is a good way to impress your committee.