Statistical machismo?

Posted on September 11, 2012 by Brian McGill

Are ecologists excessively macho when it comes to statistical methods? I use the word macho in a purely gender neutral way – I use it to mean “posturing to show how tough one is and place oneself at the top of the hierarchy”.

In my experience ecologists have a long list of “must use” approaches to statistics that are more complicated than simpler methods but don’t necessarily change the outcome. To me this is a machismo attitude to statistics – “my paper is better because I used tougher statistics”. It has a Red-Queen dynamic – eventually what starts as a signal of being superior turns into something reviewers expect in every paper. But often times with a little thinking, there is really no reason this analysis is needed in a particular case (the reviewer who is requiring it is so far removed from the development of the approach that they have forgotten why it is really used). And even if the more complex approach might be relevant, it can be very costly to implement but often have very little impact on the final results. Thus what started out as statistical machismo turns into wasted time required by reviewers. Here are some of my hobby horses:

Bonferroni corrections – One should be careful of multiple comparisons and the possibility of increased Type I error. However, this is carried way overboard. First of all, one is usually told to use the Bonferroni method which is known to error way on the other side and be excessively conservative. Secondly it is used without thought about why it might be required. I recall a colleague who had measured about 35 floral traits in two populations. About 30 of them came back as significantly different. Reviewers told them to do a Bonferroni correction. To anybody who understands the biological question and the statistics, a Bonferoni correction will make no difference in the final answer (OK only 26 out of 35 traits will be significant after correction but are we now going to conclude the populations haven’t differentiated?). Now if only 2 or 4 out of 35 were significant at p<0.05, then some proper correction would absolutely be needed (but then the whole conclusion probably ought to change regardless of Bonferroni outcomes in this case). But when 30 out of 35 were significant we’re still going to waste time on Bonferroni correcionts?
Phylogenetic corrections – Anytime your data points represent different species (i.e. comparative analysis), you are now expected to do some variation of PIC (phylogenetically independent contrasts) or GLS regression. I know there are a handful of classic stories that reverse when PICs are used. But now people are expected to go out and generate a phylogeny before they can publish any comparative analysis. Even when we don’t have good phylogenies for many groups. And when the methods assume the phlogenies are error-free which they are not. And when the p-values are <0.0000001 and unlikely to change under realistic evolutionary patterns. I once was told I had to do a phylogenetic regression when my dependent variable was abundance of a species. Now of all traits that are not phylogenetically conserved, abundance is at the top of the list (there is published data supporting this), guaranteeing there could not be a phylogenetic signal in this regression. When I argued this, my protagonist eventually fell back on “well that’s how you do good science” to justify why I should still do it – there was no link back to the real issues.
Spatial regression – Increasingly reviewers are demanding some form of spatial regression if your data (specifically residuals) are spatially structured. It is true that treating points as independent when they are spatially autocorrelated can lead you to Type I error. But it usually doesn’t change your p value by orders of magnitude and in real world cases, many spatial regressions have hundreds of points and p-values with 5 or 6 leading zeros. They’re still going to be significant after doing spatial GLS. And, here is the important point – ignoring spatial autocorrelation does not bias your estimates of slope under normal circumstances (at worst it makes it less efficient) – so ignoring autocorrelation will not introduce error into studying the parameters of the regression. You can also use simple methods to adjust the degrees of freedom and hence p-value without performing spatial regression. Incidentally, I think the most interesting thing to do with spatial autocorrelation is to highlight it and study at as being informative, not to use statistical techniques that “correct it out” and let you ignore it – the same thing I would say about phylogenetic correlation. Note that all of these arguments apply to timeseries as well.
Detection error – I am running into this increasingly with use of the Breeding Bird Survey. Anytime you estimate abundance of a moving organism, you will sometimes miss a few. This is a source of measurement error for abundance estimates known as detection error. There are techniques to estimate detection error, but – and here’s the kicker – they effectively require repeated measures of essentially the same data point (i.e. same time, location and observer) or distance-based sampling where distance to each organism is recorded, or many covariates. This clearly cuts down the number of sites, species, times and other factors of interest you can sample and is thus very costly. And even if you’re willing to pay the cost, it’s not something you can do retroactively on a historical dataset like the Breeding Bird Survey. Detection error also requires unrealistic assumptions to calculate such as the assumption the population is closed (about like assuming a phylogeny has no errors). Now, if one wants to make strong claims about how an abundance has gone from low to zero, detection error is a real issue (see the debate on whether the Ivory-billed woodpecker is extinct). Detection error also can be critical if one wants to claim cryptic species X is rarer than loud, brilliantly colored species Y since the differential detectabilities biases the result. And detection error alone indubitably biases estimates of site occupancy downward (you can only fail to count individuals in detection error) but this assumes detection error is the only or dominant source of measurement error (e.g. might mistaken double counting of individuals accidentally cancel out detection error). But if one is looking at sweeping macroecological questions, primarily comparing within a species across space and or time, it is hard to spin a scenario where detection error is more than a lot of noise.
Bayesian methods – this one is a mixed bag (Jeremy has discussed his view on Bayesian approaches previously here and here). There have been real innovations in computational methods that are enabled by Bayesian approaches (e.g. hierarchical process models sensu Clark et al). Although even here, it is in most cases the Markov Chain Monte Carlo (MCMC) Method as a computational tool to numerically solve complex likelihood that is the real innovation – not Bayesian methods. (As an aside, to truly make something Bayesian sensu strictu in my mind you need to have informative priors which ecologists rarely do, but I know others enjoy the philosophical differences between credible intervals vs. confidence intervals, etc). Notwithstanding these benefits, I have reviewed papers where a Bayesian approach was used to do what was basically a two-sample t-test or a multivariate regression or even a linear hierarchical mixed model (OK the last is complicated but still not as complicated to most people as the Bayesian equivalent). Apparently I was supposed to be impressed at how much better the paper was because it was Bayesian. Nope. The best statistic is one that is as widely understood as possible and good enough for the question at hand.

These techniques all have the following features in common:

a) They are vastly more complex to apply than a well-known simple alternative

b) They are understood by a much narrower circle of readers – in my book intentionally narrowing your audience is a cardinal sin of scientific communication when done unnecessarily (but I secretly suspect this is the main reason many people do it – the fewer people who understand you, the more you can get away with …)

c) They may require additional data that is impossible or expensive to obtain (phylogenies, repeated observations for detection). Sometimes the data (e.g. phylogenies) or assumptions (closed populations of detection analysis) are error-riddled themselves but it is apparently okay to ignore this. They might also require new software and heavy computational power (i.e. Bayesian).

d) They reduce the power in a statistical sense, downgrading the p-value, thereby meaning on average we will need to collect a bit more data and simultaneously falsely paying homage to p-values instead of important things like variance explained and effect size and also erroneously prioritizing Type I over Type II error.

e) They have not in the grand sweep over many papers fundamentally changed our understanding of ecology in any field I can name (nor even changed the interpretation of most specific results in individual papers).

In short, our collective statistical machismo has caused us to require statistical approaches that are a drag on the field of ecology, allowing them to become firmly (or quickly becoming firmly) established as “must do” to publish let alone be considered high quality science. I don’t object to having these tools around for when we really need them or valid questions arise. But can we please stop reflexively and unthinkingly insisting that every paper that possibly could use these techniques use them? Especially, but not only, when we can tell in advance they will have no effect. They have real (sometimes insurmountable) costs to implement.

To make this constructive, here is what I would suggest:

For the issues that are obsessed with Type I error (Bonferroni, spatial, temporal and phylogenetic regressions), I would say: a) stop wasting our time when p=0.00001 – it ain’t going to become non-significant (or at a minimum the burden of proof is now on the reviewer to argue for some highly unusual pathology in the data that makes the correction matter way more than usual or else that estimation bias is being introduced). If p is closer* to 0.05 then, well, have a rational conversation about whether hypothesis testing at p<0.05 is really the main point of the paper and how hard it is to get the data to do the additional test vs the importance of the science, and be open to arguments of why the test isn’t needed (e.g. knowing that there is no phylogenetic signal in the variable being studied).
For detection error, apply some common sense about whether detection error is likely to change the result at hand or not. Some questions it is, some it isn’t. Don’t stop all science on datasets that don’t support estimation of detection error.
For Bayesian approaches – out of simple respect for your audience, don’t use Bayesian methods when a simpler approach works. And if you are in a complex approach that requires Bayesian calculations, be clear whether you are just using it as a calculation method on likelihoods or really invoking informative priors and the full Bayesian philosophy. And the burden is still to justify that you have answered an ecologically interesting question – including a Bayesian method doesn’t give you a free pass on this question.

To those readers who object to this as a way of returning common sense to statistics in ecology, I challenge the readers to make a case that these kinds of techniques have fundamentally improved our ecological understanding. I know this is a provocative claim, so don’t hold back. But please don’t: 1) Tell me you have to do the test “just because” or because “statisticians agree” (they don’t – most statisticians understand the strengths and weaknesses of these approaches way better than ecologists) or it violates the assumptions (most statistics reported violate some assumption – its just a question of whether it violates the assumptions in an important way); 2) nor assume that I am a statistical idiot and don’t understand the implications for Type I error, etc; and 3) Please do address my core issues of the real cost of implementing them and describe how they improve the state of ecological knowledge (not statistical assumption satisfying) enough to justify this cost. Otherwise, I claim you’re guilty of statistical machismo!

UPDATE – if you read the comments you’re in for a long read (109 and counting). If you want the quick version I posted a summary of the comments. At the moment it is the last post (until somebody comments on the original post again). If its not the last post or close to it, you can find by using your browser to search for “100+” which should take you straight to the summary.

*(I would propose an order of magnitude cut-off of only worrying about Type I error correction if p>0.005, and I think this is conservative based on how much I’ve seen these correction factors change p values)

202 thoughts on “Statistical machismo?”

Pingback: Ten things to keep in mind as you analyze your data (plus a few more) | Dynamic Ecology
Pingback: Follow the money – what really matters when choosing a journal | Dynamic Ecology
Pingback: Autocorrelation: friend or foe? | Dynamic Ecology
Pingback: Friday links: dance your statistics, ecological theory then and now, and more | Dynamic Ecology
Pingback: Modelling cyclic populations: thoughts on the workshop | Dynamic Ecology
Pingback: Do we need a culture of Data Science in academia? (guest post) | Dynamic Ecology
Pingback: Friday links: a purposeful scientific life, zombie statistics, silly science acronyms, Tarantino vs. Plato, and more | Dynamic Ecology
Pingback: Ask us anything: how do you critique the published literature without looking like a jerk? (UPDATED) | Dynamic Ecology
Pingback: On progress in ecology | Dynamic Ecology
Pingback: Scientific ethics discussions in labs | Dynamic Ecology
Pingback: Is requiring replication statistical machismo? | Dynamic Ecology
Pingback: Guest post: Is statistical software harmful? | Dynamic Ecology
Pingback: Conference report from a non-expert: Geochemistry | Small Pond Science
Pingback: Happy second birthday to us! | Dynamic Ecology
Pingback: Piecewise structural equation modeling in ecological research | sample(ECOLOGY)
Pingback: Friday links: visualizing sampling error, Ben Bolker vs. statistical machismo, why be wrong, and more | Dynamic Ecology
Pingback: Notes from France | theoretical ecology
Pingback: Detection probabilities, statistical machismo, and estimator theory | Dynamic Ecology
Pingback: Detection probability survey results | Dynamic Ecology
Pingback: Globalization and the 50-Year-Old Predicted Reorganization of Anole Biogeography | Anole Annals
Pingback: » On Theory in Ecology – Reading Marquet et al. (2014)
Pingback: #ESA100 – big concepts and ideas in ecology for the last 100 years | Dynamic Ecology
Pingback: What are the big ecological innovations of the last century? #ESA100 | EcoTone: news and views on ecological science
Pingback: How many terms should you have in your model before it becomes statistical machismo? | Dynamic Ecology
Pingback: Estatística: um problema ou uma solução? | Instituto Biodiversidade Austral
Pingback: Estatística: um problema ou uma solução? | Just another Ecology and Statistics blog
Pingback: Ten Top Tips for Reviewing Statistics: A Guide for Ecologists | methods.blog
Pingback: Yeasts, Neuroscience and the challenges of Interdisciplinary Research – Ecology is not a dirty word
guy incognito on July 13, 2015 at 9:43 pm said:

um, bonferroni corrections are -not- difficult to do, and correction for multiple comparisons is absolutely the responsibility of the researcher. there’s nothing ‘macho’ about doing the minimum responsible statistical analysis!

in particular in the example given (30 out of 35 traits significant, uncorrected) the argument that you shouldn’t do the correction because even if only 26 traits are different we will still conclude the populations are different is ridiculous. first, because it is still valuable to know -which- actual traits indeed are different in the two populations, and second because you don’t know prior to doing the correction whether the number of significant difference will go from 30 to 26 or 30 to 0!

Reply ↓
- Brian McGill on July 13, 2015 at 9:52 pm said:
  
  You do indeed in the practical real world, not the world of pathologically cooked up counter examples, know it won’t go from 30 to 0, or anything close to 0. That’s my point. And you appear to have more faith than I do that p<0.05 after Bonferroni corrections is an absolute and perfectly correct arbitrator of "knowing which traits are different".
  
  Anytime I hear a phrase like "minimum responsible statistical analysis" on a topic about which there is active, informed disagreement I know people are setting themselves up to win and not engaging in the actual details of the issues. Usually starting a sentence with "um, " is a good indication too.
  
  Reply ↓
guy incognito on July 13, 2015 at 10:17 pm said:

well, i agree i shouldn’t have started with ‘um’, it was a bit snarky and i do apologize. however, i do feel strongly about being responsible with statistics and i don’t think at all that calling something a minimal responsibility is ‘setting myself up for a win’. i deal with a lot of different types of data and i think this sentiment is more and more important in this era of ‘big data’ where we have access to vast amounts of data and it’s easier than ever to quickly run a huge number of statistical tests. i really -do- think that correcting your statistics to account for the number of tests you’ve run is 1) not that difficult and 2) the very least you can do to be responsible. i honestly don’t understand the notion that it would be ok to not perform a statistic correction on the basis that it’s too time consuming and that you’re sure it won’t change the result. and i definitely don’t agree with the notion that there’s some kind of machismo attached to it. if you want to make inferences about the world, you should do the corrections to make sure your statistics reflect -real- probabilities … nothing macho about it!

Reply ↓
- Jeff Houlahan on July 13, 2015 at 10:30 pm said:
  
  For most people that have serious issues about multiple comparisons corrections the concern in’t that it’s too time consuming. There is no question that doing lots of comparisons is a problem but suggesting that a Bonferroni question is the correct way to deal with the problem is where I think you potentially go astray. The key problem, in my opinion, is that any post-hoc correction that reduces the probability of Type I errors inevitably increases the probability of making a Type II error. Whenever I see somebody apply a post-hoc multiple comparisons corrections without 1) identifying their acceptable Type II error rate and (2) explicitly stating their assumptions about the relative costs of Type I and II errors I assume they’re doing it an uninformed way. In my experience, most of the examples I see where post-hoc corrections get applied it is in a rote way. Perfectly reasonable people disagree with your position on corrections – it’s not as clear cut as you’ve presented it. Best, Jeff Houlahan
  
  Reply ↓
guy incognito on July 13, 2015 at 11:17 pm said:

While it’s true that perfectly reasonable people disagree about corrections, (e.g. Bonferroni vs Holm-Sidák vs .. etc), I don’t think I’ve ever heard a reasonable person say that there’s -no- need to correct for multiple comparisons. And certainly you’re correct that many people don’t explicitly say the paricular Type I and II error rate they find acceptable with their test of choice, but I don’t see how that’s worse than performing no correction at all! Better a naive choice of correction performed by rote rather than no correction at all because it’s time-consuming or macho to perform one. (I’ve also never seen someone do multiple comparisons, not correct for them, and then explicitly mention the error rates they’re susceptible to because they didn’t)

Reply ↓
- Brian McGill on July 14, 2015 at 12:04 am said:
  
  It depends a lot on your inferential framework. Of which hypothesis testing at the altar of p<0.05 is one perfectly valid way but by no means the only valid way.
  
  Specifically, if you are hinging every conclusion on p<0.05 and each test represents a completely independent hypothesis that you are equally interested in and you care more about Type I than Type II error, then perhaps it is irresponsible not to do the calculations. But there are a lot of other valid approaches. And in some contexts valuing Type I over Type II error is downright irresponsible – e.g. the precautionary principle and conservation where it is better to error on the side of protecting things until we know better. And in my original example, while I agree that a MANOVA would have been better, they weren't really interested in which traits differed – only in whether the two populations differed overall or not for which more than enough calculation had been done to be abundantly sure the answer was yes. And that is not even touching exploratory statistics where p-values might be reported in only a very loose, courtesy fashion without claims of hypothesis testing.
  
  And you raise a good pointt that whatever the solution, the problem gets bigger as we move to big data.
  
  Reply ↓
- David Chalcraft on November 8, 2017 at 1:56 pm said:
  
  Late reply but thought these references might be helpful if someone stumbles across the above comment as I did. I consider Stuart Hurlbert a rather reasonable individual and I imagine others who found his work on pseudoreplication well reasoned would consider him reasonable as well. Stuart has advised against such corrections for multiple comparisons. His arguments and a coverage of the history of the debate among statisticians on this subject can be found here:
  
  Hurlbert, S.H. and C.M. Lombardi. 2012. Lopsided reasoning on lopsided tests and multiple comparisons. Australian and New Zealand Journal of Statistics 54:23-42
  
  Another more recent paper on the history and controversy on this subject is:
  
  Streiner, D. L. 2015. Best (but oft-forgotten) practices: the multiple problems of
  multiplicity—whether and how to correct for many statistical tests. Am J Clin Nutr
  102:721–8.
  
  Though one might disagree with the arguments made in these papers it does not mean that all reasonable individuals who have considered this topic have concluded that corrections for multiple comparisons are always needed.
  
  Reply ↓
Pingback: On the Appropriate Use of Statistics in Ecology: an interview with Ben Bolker | Ecology Students' Society
Pingback: Ecologists, drunkards, and statistics | OUPblog
hypergeometric on December 28, 2015 at 3:56 pm said:

Hmmm. “More complicated” depends upon where one is coming from. A problem with some of the analyses people are taught is that they are “just tests” and people use them as if they were black boxes, on faith. Also, it is a little unfair to impugn Bayesian methods as “difficult to understand” when there’s no expectation that the people who are using frequentist alternatives seldom know the assumptions under which their asymptotic behaviors are derived, and know even less about the corresponding derivations. That re-enforces the tendency to use these as black boxes or, worse, try M alternative tests and use the one that gives the answer wanted. (Seriously, I’ve seen that done.) And, if the Bayesian prior haunts a student of a problem (it oughtn’t for reasons not pertinent here), there are excellent methods based upon information theoretic measures now available, too.

Also, with Bayesian methods or information theoretic approaches we don’t usually worry about multiple comparisons since, if posed properly, these techniques correct for these.

I do admit, model selection with Bayesian methods can be tricky.

I don’t know your field, as much as I deeply respect it and learn from it, but it’s hard to complain about the techniques King, Morgan, Gimenez, and Brooks offer in their Bayesian Analysis for Population Ecology on the one hand, and Burnham and Anderson offer in their Model Selection and Multimodel Inference on the other.

Reply ↓
Pingback: Good uses for fake data (part 1) | Scientist Sees Squirrel
Pingback: Ecology is f*cked. Or awesome. Whichever. | Dynamic Ecology
Pingback: Making a case for hierarchical generalized models – biologyforfun
Pingback: Statistical Machismo | brouwern
Pingback: Qual teste estatístico devo usar? | Blog da BC
Pingback: What Determines if I Believe or Disbelieve a Research Paper? – Insectology Blog
Pingback: Statistik dan Arus Informasi dalam Ekologi: Apa Saja yang Harus Dipelajari? | Cuma Ide
Pingback: Statistics in Excel, and when a Results section is “too short” | Scientist Sees Squirrel
Pingback: Ask Us Anything: are statistics in ecology papers becoming too difficult for students and readers to understand? | Dynamic Ecology
Pingback: Taking statistical machismo back out of twitter bellicosity | Dynamic Ecology
Pingback: Ask us anything: how will ecology change (or not) over the next 50 years? | Dynamic Ecology
Pingback: Machismo estatístico (tradução) – Mais Um Blog de Ecologia e Estatística
Pingback: Os perigos do abuso da estatística e da modelagem matemática por ecólogos – Sobrevivendo na Ciência
Pingback: Poll results: what are the biggest problems with the conduct of ecological research? | Dynamic Ecology
Pingback: Articles with simple statistics that are good examples for teaching | Small Pond Science
Pingback: Don’t be so quick to recommend “best practices” in science or academia | Dynamic Ecology