In praise of courtesy p-values: perfectly correct p-values vs. pragmatically approximate p-values

According to a text mining analysis of the papers ecologists publish, the number of p-values per paper has increased about 10-fold from 1970 to 2010. Where 0 p-values was sufficient to get a paper published in 1930, about 1 p-value per paper was expected to be published in 1970, and now about 10 p-values per paper are needed in the 2010s (Low-Decarie et al 2014 Figure 2). Our science must now be at least 10 times as rigorous! The only thing in the way of the p-value juggernaut is AIC which has been gaining at the expense of slowing down p-value growth. I’ve already shared my opinions that AIC is appealing to ecologists for some not so good reasons. Here I want to argue that we have gotten into some pretty sloppy thinking about p-values a well.

I would characterize the modern approach to p-values as “I used massively correct statistics and p<0.05 so you can’t possibly question my conclusions”. That view bothers me already, but the flip side of this view is “if there is the slightest violation of assumptions in your data, then you cannot use p-values”. This bothers me even more. The “you cannot use p-values” claim is a ritualistic inner-circle protection of the p-value that lets ecologists slip into ascribing unreal powers to p-values. P-values have become talismans (“an object … that is thought to have magic powers and bring good luck”). If I treat my p-value well by rubbing it frequently and carrying it around with me all the time (err … I mean by ostentatiously being more rigorous and perfect in my statistics than everybody else), then I will have good luck (err…reviewers cannot argue that my experiment was poorly designed and inconclusive). And no where in sight is there a hint of an effort to assess strength of experimental design or scientific inference.

The first thing I want to take on is the notion that there is a black-and-white correctness to a p-value. IE its either valid or its not with no shades of gray. This ties into the theme of violations of assumptions. Modern thinking is that if any assumption is violated then the p-value is useless and should not be reported. Many would trace this view back to Hurlbert’s 1984 paper on pseudo-replication.

But the notion that the statistical test leading to a p-value can ever be perfect and free of challenges to assumptions is not really tenable. No statistical test is perfect. Every statistical test has some assumptions that are violated (especially in messy ecology data!) or at least subject to opinion and judgment. Statistics is ultimately philosophy and there are not absolute rights and wrongs in philosophy. Here are a few reasons why the statistical tests are never perfect:

  1. Nearly every assumption is violated to some degree. Assumptions are matters of degree not absolutes. Take normality. Its a core assumption of many many tests. Is your data normal? I”m willing to bet if you do a test like Kolmogorov-Smirnov or Shapiro-Wilkes it is statistically different from normal. There was a brief phase in ecology when everybody did these tests and almost every dataset was found to be not normal (the multicausal nature of ecology is good at creating outliers that are supposed to be very rare in the normal distribution). For a while everybody did this test and then begrudgingly switched to non-parametric tests when their data failed. But then saner heads prevailed. Most tests that assume normality are actually fairly robust to deviations from normality according to computer simulations. And you lose a lot of power by downgrading your data into ranks or worse binary over/under. So we went back to using the assumption of normality, fortunately. Or take independence of error terms. Are two leaves on the same tree independent experimental units? Probably not, but maybe; depends on the problem. It might actually be a good degree of control to use genetically identical leaves. Are two leaves on different plants independent? A bit more, but maybe, maybe not? Are two leaves on different plants on different shelves in the same growth chamber independent? On the same plot? On the same continent? On this single replication of life on planet Earth where all living organisms share a common ancestor? Psuedoreplication is a real issue, but it is a shades of gray issue and one that requires careful, nuanced contextual thinking. What about linearity? is your data linear? Really? Did you test for linearity? Is the response truly linear from minus infinity to plus infinity? I bet not. Linearity is usually a good approximation though. All assumptions are violated. They exist on a slippery slope of shades-of-gray and are not black and white. And don’t get me started on when you replace simple-well understood assumptions like normality and linearity with even more complex assumptions about negative-binomial distributions, hierarchical models and closure assumptions of detection probability models. That is just kicking the assumption can down the road in the hopes nobody has thought about it yet (or knows enough to call you out on it) (and the subject of a future post).
  2. Which assumption violations we care about are more based in historical contingency than careful assessment of their importance. Take non-independence of errors which can be addressed by using GLS regression. In ecology we have non-independence across all three dimensions of taxon, space and time. Yet we don’t worry about the three non-independences to the same degree. In general you cannot get a comparative paper published without correcting for phylogenetic non-independence, spatial regression is a mixed-bag, sometimes demanded sometimes not, and temporal regression (e.g. abundance vs year) is almost never flagged in ecology. Is this based on a detailed assessment of the relative likelihood of these different aspects changing the results? Definitely not. Phylogenetic regression is demanded because an influential group of people (the Oxford school) pushed it really hard and they won. Is temporal correlation a non-issue? In ecology it is, but in economics that is a huge no-no, you cannot get published if you ignore it. Its not because our data is different, its because we have different traditions. Spatial regression is somewhere in between because people started pushing for spatial regression after the phylogenetic regression wars already happen and people knew where the spatial regression wars were headed and started pushing back very hard very quickly (e.g. Diniz-Filho et al 2003). As a result spatial regression is used at intermediate levels and in my opinion reasonably sensibly. But the point is these three issues are identical statistically with the same statistical solution, but their usage (or not) is entirely based on social history and is discipline specific, not based on a rational quantitative assessment.
  3. There is no perfect fix to multiple comparisons and post-hoc tests. Most ecologists know that if you assess a lot of p-values, you need to do a correction for multiple comparisons (i.e. you need p<x<0.05). So many ecologists will just throw in a Bonferroni correction. Even though we know that is overly conservative and greatly expands Type II error. And that proper control of Type I vs Type II error is a philosophical question (are we trying to control Type I error for all of science, a career, a paper, an experiment?)(see same Garcia 2004 paper for a great discussion).And if the average paper is reporting 10 p-values these days, then this is certainly an issue worth discussing. p<0.05 does not mean the same thing it did in 1970 when only one p-value was reported in a paper! Yet most of those papers with 10 p-values don’t do any type of error rate correction (and personally I’m not advocating that they should).
  4. The use of a binary p<0.05 is binary and arbitrary. The notion that we will let our decision of whether a paper is strong evidence or not depend on whether p=0.049 or p=0.051 is ridiculous!

The only rational conclusion from the above four points is that p-values cannot be considered perfect and accurate to the six decimal places to which they are sometimes reported. Given the shades-of-gray correctness of statistical model assumptions that are policed according to human whims all while swimming in a complex philosophical soup of Type I vs Type II error trade-offs, the only rational approach to a p-value is to think of it is an approximation, as a guidepost, as a hint of what is going on.

And if you do that, it undermines the whole logic of refusing to report p-values when they are known not to be perfect. Although Hurlbert 1984 certainly advocated dropping p-values (“be liberal in accepting good papers that refrain from using inferential statistics when these cannot be validly applied”), he was not rigid about it. His previous two points were more about disclosure of limitations (“insist that statistical analysis applied be specified in detail”, “when [p-values] are marginally allowable, insist on disclaimers and explicit mentions of the weakness of the experimental design”). What bothered Hurlbert the most was people puffing themselves up over p-values inappropriately (“Disallow ‘implicit’ pseudoreplication which, as it often appears in the guise of very ‘convincing’ graphs is especially misleading”). My own approach follows much more closely to that of Lauri Oksanen (2001 “Logic of experiments in ecology: is pseudoreplication a pseudoissue” – an absolute must read!). Oksanen acknowledges pretty much everything Hurlbert says, but disagrees in suggesting we should thow out p-values. Instead, he says “computing inferential statistics is just courtesy towards the reader.”

So while the Hurlbert and Oksanen papers are often pitched as in deep disagreement they actually agree on much. The problem with p-values is when we treat the test of statistical significance of “p<0.05” as more important than the inferential power of the experimental design, and become befuddled by once again ritually rubbing on the talisman of “p<0.05”. Hurlbert is a bit more inclined to allow that sometimes p-values are correct and to omit them when they’re not. But Hurlbert and Oksanen both agree that perfectly good papers can exist with pseudoreplication. In a nutshell, the problem is that we have gotten lazy and allowed a very quick check of “p<0.05” which is a matter of statistical inference to replace the much harder work of scientific inference. IE, is the logic of the experimental design convincing (or the best achieved to date, or weak but combined with an important enough result to make it worth publishing)? And once you accept that scientific inference (as opposed to statistical inference) is the judgment criteria of a paper, then I think the Oksanen view makes a lot of sense. Report the p-value as a courtesy to the reader (we all know the reviewers are going to ask for them anyway).  It is a courtesy because a p-value is a meaningful approximate estimate of something we are interested in. It is useful to have some estimate (even if imperfect) of the odds that the observed data could have come from the null hypothesis.Then acknowledge that they are approximate and highlight the limitations – i.e. clearly label and disclose. Of course then you cannot hang your whole paper on whether p<0.05. Instead you have to put emphasis on a candid discussion of the strength of your scientific inference. Then the problem that the p-value is imperfect and approximate goes away. And we can stop having ritualistic defenses of the perfectness of p-values because we used the “right’ statistics that ‘fixed” “all” the violations of assumptions. And turn our attention to where it should be – how strong is the scientific inference in the paper?

What do you think? Have you heard of the notion of p-values as a courtesy? Do you agree with it? Do you agree that p-values are necessarily approximations? Should you never report a p-value if your statistics are imperfect? I’m aware I’m swimming against the stream here – the replication crisis in social psychology is all about more rigorous p-values where as I would argue it should be all about treating p-values like talismans and avoiding critical thinking about the scientific inference of the papers. What do you think?

31 thoughts on “In praise of courtesy p-values: perfectly correct p-values vs. pragmatically approximate p-values

  1. “All models are wrong. Some models are useful.” -George Box

    “All assumptions are violated. [Some assumptions are good enough for scientific inference.]” -Brian McGill

    I definitely agree with you that critical appraisal of work is more desirable than any methodological approach to rigour. The trick is to somehow encourage that among the hugely heterogeneous research community.

    • Nice play on the quotes (not just funny but actually captures some pretty profound parallels).

      You are right that I think people despair that we’ll ever do deep thinking about papers and so latch onto shallow gatekeepers. I tend to give people less of the benefit of the doubt that it is only because we are heterogeneous (which in principle could be good for evaluating science) and think maybe people are just lazy or we’re cranking too many papers through the system. But perhaps I’m cynical.

  2. “And you lose a lot of power by downgrading your data into ranks or worse binary over/under”

    Not at all, just p-hack the threshold you use to define over/under until you get a significant result. Problem solved!

    (Never believe anything with a dichotomised variable unless there is an astoundingly good justification for the chose cutoff)

    • A little off topic from my post, but of course I 100% agree. Ultimately most of ecology came around on this point too with respect to nonparametric tests. Although of course many ecologists still do this in the range modelling arena where they pick a threshhold to convert probability of presence (or some other continuous variable) into a binary present/absent range.

  3. Hi Brian; I am interested in your take on how statistics are used in applied ecology, particularly fisheries and wildlife. While they do experiments, they are much more into sampling & estimation of parameters[ often for models of pop dynamics], and very big in Bayesian estimation. BUT since the sampling and estimation are usually for some applied purpose, the methods seem much more flexible/practical; I say this without a lot of recent personal experience.
    I would also characterize them as into scientific inference, as opposed to statistics, per se. It may surprise you but this dichotomy was clearly present among the early ecologists who also taught statistics.
    My undergrad advisor in Fisheries was Frederick Smith [ yes, of HHS fame] and he taught the 2 course stat sequence at UM’s school of Natural resources; he let me take the advanced class[ 1969?], and it was scientific inference all the way.
    Like wise my grad advisors Gerald Paulik and Douglas Chapman were deep into fisheries /wildlife decision making and it showed in their stat classes;[ Chapman was a student of Jerzy Neyman, and a hard core math stat type; decades of working on real field problems had converted him to a very practical chap.].
    Schools like the College of Fisheries and Aquatic science at UWash have whole programs devoted to learning applied statistics for resource mgt problems: see their Quantitative Ecology and Resource Management program……MS/PHD/Undergrad-Minor.

    • Personally I wish more researchers in the non-applied parts of ecology recognized parameter estimation (and characterization of the errors of those estimates) as a worthy pursuit.

      I’m not opposed to a straight up p-value hypothesis test (more of a basic research framing) IF there genuinely is an a priori hypothesis either. But whether p-values or AIC, I perceive that a great deal of what is published is in this weird middle ground of neither truly exploratory nor truly a priori hypothesis testing.

      Its very hard to get people to slow down and teach scientific inference in graduate stats classes (and more broadly in graduate courses) when there is a fire hose of new and complex stats to master. But I try to do it in my stats class and I wish more classes did it. Fancy stats does not ever save bad experimental design and scientific inference even though we often act like it does.

  4. Hi Brian,

    Great post as always.

    Regarding your questions, I view p-values as a courtesy statistic only because I sat in on your stats class!

    “Do you agree that p-values are necessarily approximations?”

    Yes, they must be by definition, right? I came to that perspective by reading this paper: If p-values are just a monotonic transformation of a test statistic, which is a random variable, then p-values themselves must be random variables (and thus are approximations). Therefore, it seems nonsensical to me to use a black-and-white cut-off of p=0.05 (which itself is arbitrary) if your p-value has implicit variation to it. Throw in additional variation from sample size, experimental design, and the like, then your p-value becomes a really fuzzy number.

    “Should you never report a p-value if your statistics are imperfect?”

    This one I’m struggling with a bit. If you calculate a p-value on a time series that is highly autocorrelated, we know the p-value is biased downward. But by how much? I assume one needs to know the structure of temporal autocorrelation to deduce this. But if one presents a p-value of 0.001 on a highly autocorrelated time series, what is the “true” p-value? By how much should this p-value be adjusted? If I don’t know the amount of adjustment, then I don’t know how to interpret the calculated p-value (even disregarding all the other issues that come with p-values). I’d like to hear your thoughts on this when you have time to respond!

  5. I argued some of these points a while ago (1999) in a commentary in “Animal Behaviour”, but the points are general. I discuss a pretty telling example (from the journal) about slavish adherence to the cut-off:

    “One final example from Animal Behaviour is particularly telling. In a study of isopod microhabitat use, the authors investigated the relationship between substrate choice and food choice in both sexes of a marine isopod (Merilaita & Jormalainen 1997). They report in their results section that ‘substrate choice correlated signifi- cantly with food choice in males (rS =0.32, N=44, P<0.05; Figure 4b) but not in females (rS = 0.32, N = 24, NS).’ Later in the paper, the authors discuss the biological reasons why there is a relationship between these variables in males, but not in females. The correlation coefficients, however, are identical; substrate choice and food choice had exactly the same relationship in males and females. With only 24 females, the authors could not possibly have concluded that there was a correlation between substrate choice and food choice (assuming they require a P value of less than 0.05 to conclude a correlation exists) until the rS value was over 0.40 (Siegel & Castellan 1988). A much more reasonable conclusion is that the relation- ship between substrate choice and food choice is very similar between the sexes, but that a larger sample of females is needed to increase our confidence in this conclusion."

    • Yep. Andrew Gelman talks about how the difference between statistically significant and not statistically significant is not itself statistically significant. That is, it’s not appropriate to infer whether two means (say) truly are different by comparing each to the same null hypothesis value; you should compare them directly to one another.

      I confess that, if you went through my old papers, you’d probably find at least one in which I made this mistake.

    • The error you describe is called a “difference in nominal significance” or DINS error. It is one of the most common statistical errors… frustrating to see, for sure. In your example, I believe they could have gotten around it with some sort of multiple regression, where substrate choice is a function of food choice (or vice versa) with a covariate for sex. The sex covariate would certainly have been non-significant, saving them a lot of words in the discussion!

  6. Via Twitter:

    In reply, I’ll link to this old post explicating Deborah Mayo’s idea of “severe testing”:

  7. Brian, I love this post. Your use of p-values seems to more consistent with Deborah Mayo’s (although I won’t pretend to have read a great deal of her work) – the idea that p-values combined with a critical examination of experimental design provide you with the severity of the test. Jeremy, you probably have a better idea of how this approach matches up with Deborah’s position on p-values.
    As an aside, I think any post-hoc test is inexcusable without knowing the power of your test and explicitly considering the relative costs of Type I and II errors. Lowering the probability of a Type I error without knowing what the consequences for Type II errors will be and without explicitly addressing what you think the balance between Type I and II error should be is the worst kind of rote statistical practice.


    • Re: Deborah Mayo’s work, I’d note that she distinguishes two senses of “severity”: severity as a broad, overarching principle, and severity as a technical property of a statistical test, like Type I or Type II error rate.

      I read Brian’s post as arguing for severity as a broad overarching principle, conformity to which is not enhanced by paying overly-close attention to the assumptions underpinning your p-value calculations. I generally agree with Brian on that, though I might quibble with a few of his examples (e.g., I think correcting for multiple comparisons, or alternatively controlling for false discovery rate, often is a sensible thing to do).

      Severity as a technical property of a statistical test hasn’t ever really taken off and I don’t know that it will. I’ve only ever seen it calculated in really simple toy examples by Deborah Mayo herself, and I’m not sure if it could be calculated in a broader range of circumstances.

      As an aside, severity as a technical property of a statistical test isn’t some compromise between type I and type II error rates. The notion of a severe test (even of a null hypothesis) isn’t based on any consideration of the costs of type I vs. type II errors.

      • re “severity as a broad overarching principle, conformity to which is not enhanced by paying overly-close attention to the assumptions underpinning your p-value calculations”
        Agreed although I would go even further and suggest that people often substitute careful attention to assumptions underpinning a p-value as a substitute for looking for severity as a broad overarching principle.

        Although I think one would be better not to be in a position to do multiple comparisons, if you do have multiple comparisons, I agree you should try to correct for them. I just don’t think its possible to perfectly correct for multiple comparisons (or that that is a well defined question).

    • I guess I’ll add my name to the already large senior white male demographic that comments on blog posts 🙂

      While courtesy p-values spark a worthwhile and healthy debate, and the concept of power is useful, Type I and II error, which require setting some arbitrary number as an arbiter of significance, don’t make much sense to me as a useful tool for science for the many reasons discussed ad nauseum in the literature. FDR is tantalizing; it seems to be so much what we want but I think in the end, it supports an illusion of objectivity.

      • Jeff what do you think of p-values reported but not assessed against p<0.05? That is where I personally lean. Of course p-values are basically a blend of power and effect size and you could argue those should be reported separately.

      • Brian – I tried to avoid answering that with my “healthy debate” comment! I have no strong feeling either way. A p-value is useful tool to help guide inference. A confidence interval can be used similarly but adds additional information (do both ends of the interval support a similar story?).

  8. Brian, what are your thoughts on preregistration? As currently conceived and practiced in psychology, it seems to function mostly as a way to prevent (witting or unwitting) p-hacking. From your perspective, it sounds like you would regard this as at best minimally helpful? Maybe even harmful on balance, to the extent that it encourages a narrow focus on rigorously calculating p-values and “crowds out” thinking about how to achieve severe tests (in a broad sense) of substantive scientific hypotheses?

    • I’m ambivalent. On the one hand stronger scientific inference comes with a priori hypotheses, and pre-registration helps that. But on the other side, stronger scientific inference comes with strong hypotheses (risky, novel, derived from theory) and pre-registration doesn’t force that. Also I see many practical problems in ecology with pre-registration. Just for example what to do with the fact it is the norm in ecology to collect preliminary data to get the grant? And what is going to be pre-registered? The sites of study? The years of collection? Those often change on the fly for valid reasons given the vagary of nature (weather, cows eating experiments).

  9. Thanks for the great post. It is striking how some model assumptions gets fixated in the research community (normality, independence of errors) while others are almost never explored (homogeneity of variance for lm, linearity), in that regard the ordered list of the important assumptions of linear models in the Gelman and Hill multilevel regression book is something I keep going back to.

    About the independence of error I am also experiencing more and more colleagues routinely applying phylogenetic correction to their models and also building up complex mixed-effect models with crossed and nested random terms with both low number of grouping terms and rather small sample size. All of this is done with little understanding of the underlying complexity of fitting such models. And this got me wondering, how did we do in the past when GLMM where not available or when MCMCglmm was not there for you to account for phylogenetic signal? Is this all just statistical machismo (I am hope I am using the term in the right settings 🙂 )?

    • It sounds like you and I are perceiving the world (and what is wrong with it) very similarly! And yes, this is the kind of thing I would throw in the basket of statistical machismo.

      I think your last point was telling (how did we get along before GLMM came along)? One interpretation is that we now can fix these problems that everybody knew were horribly bothersome before but could do nothing about. But that is not my memory of the pre-GLMM world. We were all pretty happy with the models we had then (and certainly I am aware of no pre-GLMM models that have now been found to be very wrong once we can use GLMM). An alternative interpretation is that we have embraced it so thoroughly because it is new and shiny and becomes a badge by which the new generation (or the statistical insiders) can separate themselves. Which is not such a charitable interpretation. While there are times and places where GLMM can be very helpful, I personally fear that far more often it has enabled people to fit things that really shouldn’t be fit that way.

      • That would make for an interesting crowd-sourced post: name the biggest *scientific* (not statistical) conclusion that has been overturned or substantially revised by mixed models/hierarchical models/correcting for phylogenetic non-independence/etc.

        I mean, I have a published example that I use in my advanced biostats course in which using a generalized linear model, rather than a general linear model on transformed data, alters the scientific conclusion. But it’s one study in which the conclusion is altered, and as I tell my students it’s probably an exception. In practice, in my admittedly-finite experience, it doesn’t *often* make all *that* much difference to your scientific conclusions if you use a generalized linear model rather than a general linear model on transformed data. I suspect that the same is true for phylogenetic correction, etc.

      • If we base big scientific conclusions on multiple lines of evidence, including rigorous probing and replication, then can there even exist a big scientific conclusion that could be overturned or revised by using a glmm instead of a lm or glm?

      • @Jeff Walker:
        That’s what we’d be trying to find out by writing that post! 🙂 (I doubt it, but I’m open to being convinced otherwise.)

        Also, I’m thinking back to Andrew Gelman’s old post in which he doubts that *the entire world” would be all that different, or all that much worse, if statistics had never been invented.

      • I should add that I disagree with Gelman on that, while also thinking it a very hard counterfactual to think sensibly about. But that someone as sharp as Andrew Gelman could even think that is one striking (flawed, but striking) measure of just how hard it is to show that statistics makes all *that* much difference to anything.

      • “If we base big scientific conclusions on multiple lines of evidence, including rigorous probing and replication, then can there even exist a big scientific conclusion that could be overturned or revised by using a glmm instead of a lm or glm?”
        No. Because building a good statistical model is about analyzing the data at hand, for a specific (well-posed) purpose. It is about inference in the “small world” (sensu Lindley). Scientific learning as a whole is larger than any particular small world inference. But I would argue that this is NOT a reason to NOT account for grouping structure, etc. (the usual goals of “GLMM” compared to vanilla “LM”), as a sensible default 🙂

  10. Pingback: How do we move beyond an arbitrary statistical threshold? | Small Pond Science

Leave a Comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.