According to a text mining analysis of the papers ecologists publish, the number of p-values per paper has increased about 10-fold from 1970 to 2010. Where 0 p-values was sufficient to get a paper published in 1930, about 1 p-value per paper was expected to be published in 1970, and now about 10 p-values per paper are needed in the 2010s (Low-Decarie et al 2014 Figure 2). Our science must now be at least 10 times as rigorous! The only thing in the way of the p-value juggernaut is AIC which has been gaining at the expense of slowing down p-value growth. I’ve already shared my opinions that AIC is appealing to ecologists for some not so good reasons. Here I want to argue that we have gotten into some pretty sloppy thinking about p-values a well.
I would characterize the modern approach to p-values as “I used massively correct statistics and p<0.05 so you can’t possibly question my conclusions”. That view bothers me already, but the flip side of this view is “if there is the slightest violation of assumptions in your data, then you cannot use p-values”. This bothers me even more. The “you cannot use p-values” claim is a ritualistic inner-circle protection of the p-value that lets ecologists slip into ascribing unreal powers to p-values. P-values have become talismans (“an object … that is thought to have magic powers and bring good luck”). If I treat my p-value well by rubbing it frequently and carrying it around with me all the time (err … I mean by ostentatiously being more rigorous and perfect in my statistics than everybody else), then I will have good luck (err…reviewers cannot argue that my experiment was poorly designed and inconclusive). And no where in sight is there a hint of an effort to assess strength of experimental design or scientific inference.
The first thing I want to take on is the notion that there is a black-and-white correctness to a p-value. IE its either valid or its not with no shades of gray. This ties into the theme of violations of assumptions. Modern thinking is that if any assumption is violated then the p-value is useless and should not be reported. Many would trace this view back to Hurlbert’s 1984 paper on pseudo-replication.
But the notion that the statistical test leading to a p-value can ever be perfect and free of challenges to assumptions is not really tenable. No statistical test is perfect. Every statistical test has some assumptions that are violated (especially in messy ecology data!) or at least subject to opinion and judgment. Statistics is ultimately philosophy and there are not absolute rights and wrongs in philosophy. Here are a few reasons why the statistical tests are never perfect:
- Nearly every assumption is violated to some degree. Assumptions are matters of degree not absolutes. Take normality. Its a core assumption of many many tests. Is your data normal? I”m willing to bet if you do a test like Kolmogorov-Smirnov or Shapiro-Wilkes it is statistically different from normal. There was a brief phase in ecology when everybody did these tests and almost every dataset was found to be not normal (the multicausal nature of ecology is good at creating outliers that are supposed to be very rare in the normal distribution). For a while everybody did this test and then begrudgingly switched to non-parametric tests when their data failed. But then saner heads prevailed. Most tests that assume normality are actually fairly robust to deviations from normality according to computer simulations. And you lose a lot of power by downgrading your data into ranks or worse binary over/under. So we went back to using the assumption of normality, fortunately. Or take independence of error terms. Are two leaves on the same tree independent experimental units? Probably not, but maybe; depends on the problem. It might actually be a good degree of control to use genetically identical leaves. Are two leaves on different plants independent? A bit more, but maybe, maybe not? Are two leaves on different plants on different shelves in the same growth chamber independent? On the same plot? On the same continent? On this single replication of life on planet Earth where all living organisms share a common ancestor? Psuedoreplication is a real issue, but it is a shades of gray issue and one that requires careful, nuanced contextual thinking. What about linearity? is your data linear? Really? Did you test for linearity? Is the response truly linear from minus infinity to plus infinity? I bet not. Linearity is usually a good approximation though. All assumptions are violated. They exist on a slippery slope of shades-of-gray and are not black and white. And don’t get me started on when you replace simple-well understood assumptions like normality and linearity with even more complex assumptions about negative-binomial distributions, hierarchical models and closure assumptions of detection probability models. That is just kicking the assumption can down the road in the hopes nobody has thought about it yet (or knows enough to call you out on it) (and the subject of a future post).
- Which assumption violations we care about are more based in historical contingency than careful assessment of their importance. Take non-independence of errors which can be addressed by using GLS regression. In ecology we have non-independence across all three dimensions of taxon, space and time. Yet we don’t worry about the three non-independences to the same degree. In general you cannot get a comparative paper published without correcting for phylogenetic non-independence, spatial regression is a mixed-bag, sometimes demanded sometimes not, and temporal regression (e.g. abundance vs year) is almost never flagged in ecology. Is this based on a detailed assessment of the relative likelihood of these different aspects changing the results? Definitely not. Phylogenetic regression is demanded because an influential group of people (the Oxford school) pushed it really hard and they won. Is temporal correlation a non-issue? In ecology it is, but in economics that is a huge no-no, you cannot get published if you ignore it. Its not because our data is different, its because we have different traditions. Spatial regression is somewhere in between because people started pushing for spatial regression after the phylogenetic regression wars already happen and people knew where the spatial regression wars were headed and started pushing back very hard very quickly (e.g. Diniz-Filho et al 2003). As a result spatial regression is used at intermediate levels and in my opinion reasonably sensibly. But the point is these three issues are identical statistically with the same statistical solution, but their usage (or not) is entirely based on social history and is discipline specific, not based on a rational quantitative assessment.
- There is no perfect fix to multiple comparisons and post-hoc tests. Most ecologists know that if you assess a lot of p-values, you need to do a correction for multiple comparisons (i.e. you need p<x<0.05). So many ecologists will just throw in a Bonferroni correction. Even though we know that is overly conservative and greatly expands Type II error. And that proper control of Type I vs Type II error is a philosophical question (are we trying to control Type I error for all of science, a career, a paper, an experiment?)(see same Garcia 2004 paper for a great discussion).And if the average paper is reporting 10 p-values these days, then this is certainly an issue worth discussing. p<0.05 does not mean the same thing it did in 1970 when only one p-value was reported in a paper! Yet most of those papers with 10 p-values don’t do any type of error rate correction (and personally I’m not advocating that they should).
- The use of a binary p<0.05 is binary and arbitrary. The notion that we will let our decision of whether a paper is strong evidence or not depend on whether p=0.049 or p=0.051 is ridiculous!
The only rational conclusion from the above four points is that p-values cannot be considered perfect and accurate to the six decimal places to which they are sometimes reported. Given the shades-of-gray correctness of statistical model assumptions that are policed according to human whims all while swimming in a complex philosophical soup of Type I vs Type II error trade-offs, the only rational approach to a p-value is to think of it is an approximation, as a guidepost, as a hint of what is going on.
And if you do that, it undermines the whole logic of refusing to report p-values when they are known not to be perfect. Although Hurlbert 1984 certainly advocated dropping p-values (“be liberal in accepting good papers that refrain from using inferential statistics when these cannot be validly applied”), he was not rigid about it. His previous two points were more about disclosure of limitations (“insist that statistical analysis applied be specified in detail”, “when [p-values] are marginally allowable, insist on disclaimers and explicit mentions of the weakness of the experimental design”). What bothered Hurlbert the most was people puffing themselves up over p-values inappropriately (“Disallow ‘implicit’ pseudoreplication which, as it often appears in the guise of very ‘convincing’ graphs is especially misleading”). My own approach follows much more closely to that of Lauri Oksanen (2001 “Logic of experiments in ecology: is pseudoreplication a pseudoissue” – an absolute must read!). Oksanen acknowledges pretty much everything Hurlbert says, but disagrees in suggesting we should thow out p-values. Instead, he says “computing inferential statistics is just courtesy towards the reader.”
So while the Hurlbert and Oksanen papers are often pitched as in deep disagreement they actually agree on much. The problem with p-values is when we treat the test of statistical significance of “p<0.05” as more important than the inferential power of the experimental design, and become befuddled by once again ritually rubbing on the talisman of “p<0.05”. Hurlbert is a bit more inclined to allow that sometimes p-values are correct and to omit them when they’re not. But Hurlbert and Oksanen both agree that perfectly good papers can exist with pseudoreplication. In a nutshell, the problem is that we have gotten lazy and allowed a very quick check of “p<0.05” which is a matter of statistical inference to replace the much harder work of scientific inference. IE, is the logic of the experimental design convincing (or the best achieved to date, or weak but combined with an important enough result to make it worth publishing)? And once you accept that scientific inference (as opposed to statistical inference) is the judgment criteria of a paper, then I think the Oksanen view makes a lot of sense. Report the p-value as a courtesy to the reader (we all know the reviewers are going to ask for them anyway). It is a courtesy because a p-value is a meaningful approximate estimate of something we are interested in. It is useful to have some estimate (even if imperfect) of the odds that the observed data could have come from the null hypothesis.Then acknowledge that they are approximate and highlight the limitations – i.e. clearly label and disclose. Of course then you cannot hang your whole paper on whether p<0.05. Instead you have to put emphasis on a candid discussion of the strength of your scientific inference. Then the problem that the p-value is imperfect and approximate goes away. And we can stop having ritualistic defenses of the perfectness of p-values because we used the “right’ statistics that ‘fixed” “all” the violations of assumptions. And turn our attention to where it should be – how strong is the scientific inference in the paper?
What do you think? Have you heard of the notion of p-values as a courtesy? Do you agree with it? Do you agree that p-values are necessarily approximations? Should you never report a p-value if your statistics are imperfect? I’m aware I’m swimming against the stream here – the replication crisis in social psychology is all about more rigorous p-values where as I would argue it should be all about treating p-values like talismans and avoiding critical thinking about the scientific inference of the papers. What do you think?