Aside from the question about what statistical methods are appropriate to use in ecology, there is a mostly independent question about how many statistical methods is optimal for use across the field of ecology. That optimum might be driven by how many techniques we could reasonably expect people to be taught in grad school and to rigorously evaluate during peer review. Beyond that limit, the marginal benefits of a more perfect statistical technique could easily be outweighed by the fact only a very small fraction of the audience could read or critique the method. To the extent we exceed that optimum and are using too many different methods, I think it is fair to talk about statistical Balkanization. Balkanization is of course a reference to the Balkans (the region in the former Yugoslavia) and how the increasing splintering into smaller geographic, linguistic and cultural groups became unsustainable and led to multiple wars. I think there is a pretty clear case that too many statistical methods in use is bad for ecology and thus the label of that state as Balkanization is fair (I’ll make that case below). I am less sure if we are there yet or not.
After attending a recent conference I got to wondering about how many different statistical methods I had seen used to attack largely similar problems, but I wanted to be a little more rigorous. So I skimmed the methods of every research article in the June and July issues of Ecology Letters for a total of 23 articles. Almost by definition what appears in Ecology Letters is representative of “the mainstream of ecology”. Below are my nutshell summaries (each bullet point is one article) (some of which are probably oversimplified or not completely accurate due to my skimming, but they’re close):
- GLM, ANOVA
- Network statistics, Joint species distribution models (JSDM)
- GLMM, Tukey HSD
- RDA, ANOVA w/ permutation restrictions to control for temporal autocorrelation, 4th corner analysis with bootstrapping, GLS
- ANOVA with varPower() correlation structure (=~GLS), LMM, stepAIC
- LMM, Moran’s I, likelihood ratio tests, PGLS (Phylogenetic regression), variance partitioning
- PERMANOVA, path analysis
- ABC, uniform priors, model selection
- Consensus phylogenetic trees, PCA, MCMC GLMM, brms bayesian, phylogenetic parameter estimation (Pagel lambda)
- weighted LMM, temporal autocorrelation regression, Moran I
- repeated measures ANOVA, corARMA covariance structure, network statistics
- quantile regression, GAM, permutation
- MCMC, Bayesian Inference, Orthogonal polynomials, uniform priors
- GLM, Global Sensitivity Analysis
- Kolmogorov-Smirnov, t-tests, bootstrapping
- 3 level nested LMM, Nakagawa & Schielzeth R2
- PERMANOVA, NMDS/RDA, LMM, SEM
- nested 2-way ANOVA, Tukey HSD, bootstrapped confidence intervals
- phylodiversity metrics, beta diversity metrics, LM, SAR (spatial autoregression), minRSA model selection
- AICc, model selection from full suite of models, LMM
- PCA, RDA, permutation tests, GAMM
- LM, LMM on simulation output
Note that LM=Linear Model (regression), LMM=Linear Mixed Models, GLM=Generalized Linear Model (logistic and poisson regression). And beyond that, how many of the acronyms you understood is perhaps part of the point of the article.
Does this represent statistical Balkanization?
I don’t know. You tell me! On the surface I was somewhat reassured by the results. There seems to be a strong convergence on moving beyond linear models (regression) to add either random effects (linear mixed models or LMM) or generalized (non-normal) errors (GLM) or both (GLMM). Beyond that you could say there are a handful of multivriate papers using the basic PCA/RDA and a few specialty topics that will always be specialty like phylogenetics, spatial autocorrelation, network so pretty good. A unified core around LMM/GLM/GLMM and some specialty methods sounds promising The core shares a basic model structure y=a+bx_1+cx_2+…+ε (possibly with a link function). And the idea that everybody is interpreting linear coefficients is unifying.
But its when you dig into the details that the wide array of approaches appears. That basic model is being fitted using:
- sum of squares with F-tests (ANOVA)
- OLS/MLE by direct deterministic computation (i.e. calculating a formula)
- Maximum likelihood fit by direct integral quadrature (deterministic approximate computation)
- Maximum likelihood fit by Monte Carlo Integration
- Maximum Likelihood fit by Markov-Chain Monte Carlo integration
- weighted LMM
- a wide variety of error covariance structures
- multiple methods to detect and deal with temporal autocorrelation
- multiple methods to detect and deal with spatial autocorrelation
- bootstrapped confidence intervals
- permutation based confidence intervals (with several variations including constrained and unconstrained permutation)
And the inferential frame includes
- traditional p-value from F-tests
- traditional p-value from likelihood ratio tests (which can collapse to F-tests in some cases)
- AIC model selection from a subset of models chosen by the author
- somewhat related AIC selection on all possible models
- minRSA model selection
- Bayesian (these two both use uniform/uninformative priors but we are increasingly seeing vague priors)
- deviance partitioning (an extension of variance partitioning) done in several ways
- different methods of calculating pseudo-R2 on GLMs
And in just 23 articles in a very central journal we see
- multiple applications of GAM
- one application of quantile regression
- a handful of phylogenetic methods
- a handful of network methods
- a handful of multivariate methods
- multiple uses of SEM or path analysis
- 4th corner analysis
I want to emphasize I don’t think any of the methods were wrong. If I were a reviewer I would have passed all of these methods (at least based on the skim I did). So I am NOT calling out individual authors. Rather I am musing on the state of the field as a whole. Its a group fitness argument, not an individual fitness argument.
Part of what’s happened is the move to LMM and GLM (and GLMM) have meant we left the neat world of normal-statistics linear models where the R2 truly was simultaneously the % of variance explained and the square of the Pearson correlation and there was a direct (matrix algebra) formula for the solution. Now we must use iterative methods to solve (and plenty of people can attest to the challenges of getting complex models to converge – I heard it mentioned multiple times a the conference that started all of this). And all kinds of extra assumptions are brought in. For example what is the minimum number of levels one should estimate a random effect on (debatable but almost certainly higher than many – most? – published analyses) And although the mathematical machinery of GLM is elegant, it requires iterative fitting that can go wrong (e.g. notoriously on noisy logistic regression) and deviance is much less direct to work with than normal residual errors. There are probably a dozen different efforts to define an analogue of R2 in the GLM/deviance world. And despite being used for a couple of decades many fewer people are expert at diagnosing the violations of the additional assumptions that creep into GLM & LMM than they were at diagnosing OLS.
At that level it seems like there could be some challenge in the high diversity of methods.
Why would Balkanization be a problem?
I can think of several angles from which using too many statistical methods in our field could be a problem. The reviewer and the graduate student are the two that strike me most.
Does anybody reading this blog feel competent to actually review every method listed above (even if say we leave out the phylogenetic and network methods)? There is of course more than one level to this. One level is can you read the stats well enough to interpret whether the authors take the right biological take home message. But another level is do you know that particular method well enough to know “where the bodies are buried”. IE where things can go badly wrong. Where wrong data structures or method parameters (in R packages) or assumption violations or poorly behaved distributions of the errors can completely break things? How many readers have poked into the control structure of lme (package nlme) that determines the optimization process? Everybody good with BFGS as the default optimization method (lmer has a control parameter too with different defaults)? Do you have a good handle on whether the authors did proper diagnostics on MCMC burn-in? Whether an appropriate quadrature was used for integration? (do you even know what quadrature for integration is or why it might matter)? On whether spatial autoregression was needed and properly used? The pros and cons of the half dozen major methods that have been developed to address spatial autcorrelation (many work well but a few are in the literature that work very poorly)? On whether simpler methods of GLMM showed good or bad convergence? On whether assumptions of mixed models were violated? On how much deviation from a poisson distribution is acceptable to still run a poisson regression (like normality, very few ecological datasets are truly poisson)? What chi-squared value signals a good fit for SEM or should a different measure of model quality be used for SEM? I would hazard a guess that very few (if any?) people could evaluate all the methods at this level. And so what happens during review? Are editors always able to get an expert in the method used out of the 2-3 reviewers they land? I doubt it? Do people speak up when they are unsure about whether the complex stats are rigorously and appropriately applied? I doubt it. I think they mostly stay quiet and hope some other reviewer (or the author) knows what they are doing. This is a far cry from the days when everybody knew how to diagnose whether a (possibly transformed) OLS regression was well done or not. Maybe the default parameters of these more complex methods are always fine and we don’t need to worry? But given how many students come to me with convergence errors and that don’t know what convergence error means and fix it by trial and error rejiggering of the data until the error goes away (not by tweaking optimzation control parameters), I doubt it? So are our statistics being published more or less reliable than the days of OLS and ANOVA? Do we even know the answer to that question?
To be sure we could train and build expertise in any one of these things; but can we expect reviewer expertise in ALL of these things? If not that is statistical Balkanization.
And what about the graduate student. On the one hand I was psyched. The list of methods in the 23 papers above almost exactly matches the content of a 2nd semester graduate stats course I teach (except for ABC and 4th corner which I don’t cover yet). But I can tell you from teaching that course that my course is way more exposure than most graduate students get. And that the exposure I can give in a semester to most of these topis is VERY superficial. I spend ~15 minutes on MCMC. One lecture on Bayesian. A week on GLM, A week on LMM. Very little additional time on GLMM. One lecture on non-independent errors (GLS). A week in which non-linear regression, GAM, quantile regression and machine learning are all smashed together. And I guarantee you that at that level of coverage the students are not qualified to review let alone use these techniques themselves (and the students would be the first to agree with me). So what the heck is our strategy for graduate statistics education in this day and age? What is reasonable to expect students to learn? Where do those course slots come from? And how do we expect students to learn the rest?
And if freshly minted PhD students cannot read those papers critically, who can?
To repeat myself, I don’t object to any specific method. They all exist for good reasons. And I’ve probably been a co-author on papers using almost all of these techniques and a first/senior author on papers using more than half of them. So I’m not singling out individuals. Nor do I have easy answers for putting the genie back in the bottle. But I definitely think as a field we have moved past the optimum degree of complexity. The volume of the convex hull spanning all statistics commonly used in journals seems to have grown exponentially and I don’t think that is good for the field. It worries me about the quality of peer review on statistics. And it makes me feel badly for graduate students. We just might have achieved statistical Balkanization.
What do you think?