Technical statistical mistakes are overrated; ecologists (especially students) worry too much about them. Individually and collectively, technical statistical mistakes hardly ever appreciably slow the progress of entire subfields or sub-subfields. And fixing them rarely meaningfully accelerates progress. The rate of scientific progress mostly is not statistical mistake-limited.
Don’t agree? Try this exercise: name the most important purely technical statistical mistake in ecological history. And make the case that it seriously held back scientific progress.
Go ahead, I’ll wait. 🙂
Off the top of my head, here are some candidates for most important purely technical statistical mistakes in ecological history. None of which seem all that important in the grand scheme of things, honestly:
- Regressions of local species richness on regional species richness that failed to account for the boundedness of the operational space, biasing the results.
- Using the wrong null models as baselines against which to test for density dependence in population dynamics. Dennis & Taper 1994 review some of this history.
- Overfitting niche models by failing to allow for spatial autocorrelation in abiotic environmental conditions and species’ occurrences/abundances.
- If you want to argue that failure to account for detection probability is a purely technical statistical mistake that has seriously held ecology back, Brian will fight you. 🙂
- I don’t think that widespread adoption of generalized linear models over general linear models applied to transformed data fixes a technical mistake, or that it makes a big enough difference to the scientific conclusions in enough cases to constitute an important advance to ecology.
- I suppose you could follow Nate Silver and argue that null hypothesis significance testing is a technical mistake that has massively held back all scientific progress, including in ecology. I don’t buy that argument. But if you want to make the case for a statistical mistake appreciably limiting the progress of an entire scientific field, that’s probably the magnitude of mistake you’re looking for
Look, I’m very glad that technical mistakes on my little list got fixed. I think those fixes were real advances. I’m just not convinced they were big advances. For instance, I don’t get the sense that fixing how we regress local richness on regional richness had any appreciable effect on the direction of that research program. My sense is that that research program always had more fundamental conceptual problems, and was already petering out or changing direction anyway by the time the statistical problem was fixed. As another example, see Brian’s old posts arguing that estimating detection probabilities hasn’t really much advanced or altered our understanding of wildlife ecology.
Of course, what constitutes a “purely technical statistical mistake” is open to debate. One common way that people who prefer statistical approach X argue for X is to argue that not using approach X is a purely technical mistake, rather than a defensible choice with upsides and downsides. That argument has been deployed for instance by defenders of estimating detection probabilities. As another example of the fuzzy line between “purely technical statistical mistake” and “other stuff”, consider the use of randomized null models of species x site matrices to infer interspecific competition or facilitation. Is that a purely technical statistical mistake? Or an ecological mistake–a mistaken ecological interpretation of technically-sound statistics? I dunno, honestly. I’d say ecological mistake, but I could imagine someone arguing otherwise.
I do think that technical statistical advances aid scientific progress. But I think they do by combining in complicated ways with all sort of other changes in the constantly-evolving ecosystem of scientific practice. And I think we’d be better off if we all recognized that and emphasized it more to our students. Students who come to me for statistical advice invariably worry far more about technical statistical issues than they need to, to the exclusion of more important issues like being clear about what scientific question they’re trying to address in the first place. Yes, you should get your technical statistics right enough (that last word is important). But it’s more important for you to think hard about what scientific question is worth asking, and all the things besides statistics that go into answering it.
I’m deliberately taking the strongest defensible position I can take on this, in the hopes to spark an interesting thread in which people come up with counterexamples and explain why I’m wrong. Plus, I only spent about 10 minutes trying to think of examples. I figured that this isn’t the sort of thing one can exhaustively research, so I might as well toss out a few examples and then open the floor. So have at it! 🙂
p.s. This post does not say or imply that pre-publication peer reviewers should stop caring about technical soundness, or that we should stop teaching our students statistics! Part of the reason technical mistakes don’t much inhibit scientific progress is because we teach our students not to make them, and because peer reviewers catch them. We, and they, should keep doing those things, because otherwise many more technical statistical mistakes would get made. We should just try to do those things in a way that doesn’t cause the possibility of technical statistical mistakes to take on an outsize importance in our minds. And I don’t think we need to do much more than we already do to prevent or correct statistical mistakes.
If you’re going to accept that significance testing is a technical mistake, then hasn’t it had a huge effect? For example, think of how much time has been wasted on null models.
Yes, if you regard it as a technical mistake. I don’t. Though I do think NHST gets abused.
This assumes there is no unidentified TSM that founded a whole path of research without the initial finding being replicated, something analogous to this example from the human gut microbiome
Ooh, good example of regression to the mean! That one’s going in my list of statistical vignettes for intro biostats: https://dynamicecology.wordpress.com/2018/03/19/statistical-vignette-of-the-day-as-a-teaching-tool/
And yes, I’d agree that’s an important purely technical statistical mistake that, had it not been made, might’ve prevented that line of research from ever getting off the ground. But the example isn’t from ecology–should ecologists take comfort from that?
Not my field, but another example of an important TSM would be not controlling for multiple comparisons in genetic association studies, leading to a lot of spurious and unreplicated results.
Yes, the most serious technical statistical mistakes I can think of, in terms of their consequences for the overall direction of a scientific field, are all from outside ecology. The one that occurs to me is failing to properly correct for multiple comparisons (or properly control FDR) in brain imaging studies, apparently leading to the entire field being based mostly on false positives. We have old linkfest entries on this; can’t find them just now.
The ongoing replication crisis in social psychology is more of a borderline case–an unfortunate interaction between technical statistical mistakes like not pre-specifying hypotheses, and other sorts of problems like lack of useful theory.
Would you please define “purely technical statistical mistakes”?
The post and comments give examples, including of borderline cases. Are there particular examples you disagree with or find unclear?
Would you consider expert bias in selecting (supposingly random) sampling points as a “technical” statistical mistake? If yes, then much of our conception on species’ habitats and whereabouts may be overly biased. And this may actually have a huge effect (and cost) on the way we design national parks or proceed with impact assessment.
Speaking generally (since I know nothing of the specific cases you seem to have in mind), yes, if somebody says they sampled randomly when in fact they didn’t, that’s a technical mistake. The seriousness of the mistake depends on various considerations.
Brian has some old posts on the broad topic of autocorrelation and non-independence of observations in ecology:
I’m just amazed that Diamond/Simberloff hasn’t been mentioned yet.
Quoting from the post:
“As another example of the fuzzy line between “purely technical statistical mistake” and “other stuff”, consider the use of randomized null models of species x site matrices to infer interspecific competition or facilitation. Is that a purely technical statistical mistake? Or an ecological mistake–a mistaken ecological interpretation of technically-sound statistics? I dunno, honestly. I’d say ecological mistake, but I could imagine someone arguing otherwise.”
Pingback: Yes, statistical errors are slowing down scientific progress! | theoretical ecology
Jeremy, because of length, I have placed my thoughts on the impact of issues such as p-hacking, inappropriate analysis combinations or wrong methods here https://theoreticalecology.wordpress.com/2018/05/03/yes-statistical-errors-are-slowing-down-scientific-progress/
Of course, you can always say: look, we are making great progress, aren’t we … so why does it matter? As I point out, admittedly somewhat polemic, you could say the same about data fabrication:
[irony on] Young people worry far too much about data collection, instead of just inventing data. I challenge you to name the most important data fabrication in ecological history. And make the case that it seriously held back scientific progress [irony off].
I agree with you that there are some problems that aren’t that big of a problem. Although I have a different view on many things that were described as “Statistical Machismo” on this blog, I’d agree that the failure to use phylogenetic corrections, spatial models, or detection errors alone probably hasn’t massively distorted the scientific record. But, as I argue in my post, I have no doubt that the sum of statistical mistakes that are rampant in ecology collectively has some impact on the reliability of ecological results, and the progress of the field.
One ecology blog responding to something another ecology blog wrote! That is some old school blogging. Thanks! 🙂
I think that your post is the best counter-argument there is against my post. I’m still inclined to disagree, but not all that strongly. And I freely admit I don’t have any data that would settle the issue.
Re: data fabrication, I actually don’t think it is a big deal. Nobody should do it, Students should be told not to do it. Anybody who does it should be punished appropriately. But thanks in large part to norms against it, it’s so rare that when it does happen it doesn’t make much difference to the progress of science, save perhaps in a few extremely unusual cases. So if (say) you told me that we need to totally reform how scientists and their research are evaluated because data fabrication is a huge problem and we need to remove all incentives to do it, I’d strongly disagree. I think we tend to overrate its importance because the rare high profile cases, like Michael Lacour, attract massive attention and many people *way* overgeneralize from them. Even though they attract massive attention precisely *because* they’re extremely unusual.
yeah, Jeremy, we’re going old school 😉
About the data fabrication: I agree, it probably hasn’t had a huge impact. What I wanted to say is that that’s not a good argument for why it’s OK to fabricate data, or not worry about it.
Thanks for this, Florian. I held off replying to this post because I wasn’t coming up with a good way to say exactly this. From my experience as a statistical consultant with students, “I have to squeeze something significant out of these results, how do I do it?” is really common, and often done at the advisor’s request. While any given paper might not stalled ecology in any given subfield, I think it has acted like sand in the wheels throughout the whole discipline; there’s just too many empirical papers out there that actually can’t be trusted that we still cite.
Also: +100 for “fitting power laws with linear models”! The whole Levy flight debate took way longer than it needed to to resolve as people were using completely inappropriate methods to fit Levy distributions.
Your experience is different than mine Eric. Students never come to me asking how to squeeze something significant out of their data.
Maybe interesting in this context – experience with freelance statistical consulting https://twitter.com/nickchk/status/991397611306303488
It may very well be that the mistakes that held us back the most are the ones that are not (yet) corrected. Even without regarding all of NHST as one colossal mistake, elements of its abuse (e.g. burying of non-significant findings, p hacking) are widespread. How can we know how much these mistakes have held us back? What concepts have been tested over and over because the non-significant findings were not published, rarely published, or published in obscure places? How much of our collective resources are used pursuing spurious leads because of p-hacked “significant” findings
Nice discussion! In my opinion, one relatively recent TSM is related to model selection. It may have some negative impact on the development of some fields in the future. Model selection and all associated tools were designed to explore big data sets related to poorly studied or understood phenomena or systems. In addition, model selection can be wonderfully used to operationalize the concept of strong inference, as proposed by Chamberlin (1890) and Platt (1964). Nevertheless, many people have been abusing model selection as a fancy statistical fishing rod. In other words, they don’t put much energy into thinking about the direct and indirect relationships between the studied variables. They don’t work with analytical mind maps, such as path diagrams and similar. Instead, they just let model selection search for minimum models and then make all interpretations a posteriori. That’s a modern way of predicting the past, which philosophically is very dangerous and may create some zombie ideas.
Marco, see the post linked in my comment, I provide an R example for this exact problem.
Thank you, I just saw it. Amazing! 😉
It’s a pity there is no “like” button in your blog! I loved your post.
I think its probably important to draw a distinction between:
a) wasted time and energy of authors by publishing papers that don’t accomplish much
b) wasting time and energy of non-authors by publishing papers that misdirect the field and cause many authors not on the original paper to pursue a question they otherwise wouldn’t have.
My read of Florian’s arguments is more in the vein of (a). My read of Jeremy’s question is more in the vein of when does (b) happen. Of course in the social sciences p-hacking has led to (b). Its an interesting question why it has caused less of (b) in ecology (because I agree p-hacking happens in ecology).
Which raisers a larger (and I suppose elitist) question. What fraction of papers in ecology actually materially affect the field and the direction of other workers in the field? I don’t think its a large fraction. Which makes (b) hard to do and necessarily rare.
Following the line of argument that would suggest that in social psychology a larger fraction of papers is influential. Don’t know if that is true.
And then I think you have to add (c) not just wasted time but caused the overall consensus of the field to be wrong (not inefficient but wrong).
That is a much higher bar. And I don’t think (c) has happened due to statistical issues in ecology very often
Re: your (c) Brian, I’d only add that I do think there are cases where the overall consensus in ecology is or has been wrong (and I know you weren’t denying that). The IDH, for instance. I agree that those cases mostly aren’t due to technical statistical issues though.
And yes, my argument is concerned with your (b) and (c).
And yes, if my argument is right, one reason is because most papers aren’t influential and don’t matter much. To which I think the strongest counter-argument is Florian’s: if a technical mistake is sufficiently widespread, the cumulative influence of many papers making that mistake can be large. So I don’t think that social psychology is in the mess it’s in because in that field a larger fraction of papers are highly influential. I think it’s in the mess it’s in through a combination of very widespread technical statistical mistakes interacting in a bad way with non-statistical problems like vague theory (which is arguably worse than having *no* theory).
Oh I agree – plenty of cases of (c) happening in ecology. Which makes it important to analyze why. I just don’t think flawed statistics is the answer to why very often (hardly at all). Might be in social psychology. But not ecology.
If I were to stick my neck out to the prime cause of (c) in ecology, I think ecology tends to jump on theories that are intuitively appealing and initial papers have data supporting them. Intuitively appealing is a complex melange in ecology. But I think the IDH was popular because of the complexity of “intermediate”. It was more rich than disturbance always increase or always decreases richness, but not as complex and contingent as saying hurricanes increase, fires decrease birds but increase trees, etc which is probably what the real truth looks like (in tone, not in specifics). Showing humans are bad makes things intuitively appealing too, and I think there are overtones of that in the IDH as well (natural distrubance regimes good, excessive human disturbance regimes bad).
Does quadratic regression inflate the false positive rate when testing for non-linear selection? More generally, have we been testing for [insert hump-shaped relationship] incorrectly?
A recent psych paper suggests a two-fold problem with quadratic regression: it often finds hump-shaped relationships when they aren’t really there (inflated type I error), and can have low power to detect the relationships when they are present. Original paper: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3021690 Blog summary: http://datacolada.org/62
Related to this, Stinchcombe et al. (2008) raised a TSM in the biological interpretation of quadratic regression coefficients themselves wherein the strength of non-linear selection was underestimated by 50%: https://onlinelibrary.wiley.com/doi/full/10.1111/j.1558-5646.2008.00449.x
Selection analysis is outside of my area, so I’m curious whether others think that these issues have held back (or are holding back) the field.
Yes, I’m aware of that paper, have linked to it in an old linkfest I believe. Great paper, I agree with it 100%. And I have a student doing a side project applying the proposed fix (the “two-lines test”) to ecological data. We’re starting with the IDH, but plan to move on to humped diversity-productivity relationships, and to stabilizing and disruptive selection.
I confess I’m still not 100% sure if this project will lead to much progress in the field even if it confirms that (say) the prevalence of humped bivariate relationships has been vastly overestimated by past quadratic regressions. I have an old post in which I argued that there’s no longer any point in merely testing whether (say) diversity-productivity relationships are humped or not, because no matter how the test comes out we don’t learn anything from the test:
So for me, undertaking this project is in part an attempt to refute my own past blog post. 🙂
As a mathematician who started out proving what I would consider “boring” Theorems, I quite like this post. I now publish almost exclusively in physics/mathematical biology, and any things which are proved rigorously are more of an aside rather than being the key part of my work. That said, much of my present work is exactly questioning when and how certain assumptions matter (e.g. random organism movement, spatial homogeneity, etc). I think in any field, keeping ones eyes on the main goal is really important – rigour has its place, and I think the line we choose to draw there is unfortunately very subjective. But being overly-careful can lead to papers which are difficult to read and unsatisfying, so that they aren’t really considered by others. Understanding the kind of thing which is important to the field is a difficult skill to acquire.
There’s a quote attributed to Einstein along the lines of “I’ve published many papers. Some of them are correct.” I can’t find this exact quote but I can source the following which is actually attributable to him (I’ll call this rigorous enough for this case 😉 ): “You don’t need to be so careful about this. There are incorrect papers under my name too.”
I just remembered that I wrote this same post years ago. Sort of. 🙂
Thanks to all contributing these comments. For those of us who don’t attend 10 meetings per year, or have a building full of colleagues engaged in these discussions, or are in small departments that don’t routinely invite guest speakers, blog comments are a pretty good alternative for staying intellectually engaged. Which raises an issue that seems to be repeated across many blogs — commenting is numerically dominated by males and we’d all benefit from more diversity. That said, here are my tl;dr comments on the above
A few comments on all the above
1) Florian’s many little errors argument. I am a frequent (~25 datasets) Dryad repository stalker, because I use it to find contemporary data to teach statistics (who wants another example from the mtcars dataset?) I always first try to reproduce the results of the author. I fail to reproduce some component of most of these publications. Usually the differences are small and would not change the interpretation. Some are not so small. Some are catastrophic. So, supporting Florian – I do think there are many many small errors out there, and supporting my first comment, I do think there are unknown catastrophic errors out there that matter (an aside: for teaching introductory statistics, Dryad data is only okay because most of the data come from very complicated sampling schemes or experimental designs).
2) several comments have debated NHST yet at least three comments have advocated NHST thinking, which permeates all of our thinking. These include A) multiple testing of of GWAS. I don’t think the issue is failure to account for multiple tests. The issue is, the belief that anything was discovered because p < some number. Discovery requires much harder work than that! Bonferroni or FDR isn't an answer (that said, FDR has very interesting properties and I do think it was good application in FMRI studies). A way forward is to just rank the effects and start the hard bench work of testing these. In strong Cell studies, the initial association stuff is relegated to a bit part in a supplement. It's only in fields depauperate of good science that the association study is the headliner. B) Type I and II error in quadratic components of selection coefficients. Estimating effects using multiple regression of observational data has much bigger problems than type I and II error for quadratic effects. I am unaware of a published (or unpublished) strong argument against my conclusions but welcome them (a case where being wrong would be good). C) Dissing model selection because it is fishing and inflates p-values or false discovery. Again, I don’t think the issue is fishing and p-values. This issue is the belief that one has discovered knowledge using simple and naive models of complex phenomena with observational data (I apply this criticism to causal graph models as well).
Not sure if this is also considered a “technical” statistical mistake but circularity bias seems to be extermely common in everyday practises and ecological consultancy. Say for example an ecologist wants to find out more about the diagnostic species in “good quality” oak forest in the Balkans. One usual practise is then to choose several strata in oak forests that appear of “good quality” in his/her eyes or in some arbitrary standards and then start sampling for the diagnostic species there. In the long term this set of diagnostic species is used to determine other “good quality” oak forests elsewhere.
Am I wrong to sense that there is a hidden issue of circularity bias in this approach? Following Kriegeskorte & al (2010), “….an analysis is circular (or nonindependent) if it is based on
data that were selected for showing the effect of interest or a related effect….”. The same paper discusses a number of statistical flaws that rise from this.
If I am correct than I have a feeling that this bias is very common in how we survey things in ecology. I would very much like to work together with any of you ( or any other person you might recommend) in order to try to quantify how pervasive circularity analysis actually is in practising ecology. This is an open question for partnership (with no funding at all though)!!
Kriegeskorte & al (2010): Journal of Cerebral Blood Flow & Metabolism (2010): 1–7