Statistical machismo, as I’ve been defining it in this blog (here and here) is when there is a push to ever newer, more complex methods, often in pursuit of a one-dimensional form of improvement while ignoring the fact that choosing the proper statistic involves complex trade-offs. For example, hyperfocusing on one source of error while ignoring (or even worsening) other sources of error.
Here I want to talk about how the new and the complex sides of statistical machismo interact badly in the most common application of machine learning to ecology to date: niche models (aka species distribution models or habitat models). The goal is to basically substitute extensive amounts of environmental data for our inadequate (i.e. sparse) biological sampling and biological understanding to build a predictive model of where a target species will be found.
I said I wanted to talk about how new and complex interact badly. Let’s start with complex. All niche modelling (and machine learning) are basically a giant regression. The different techniques simply substitute a different funcitonal form for the regression. Logistic regression has the well-known sigmoidal logit function. Regression trees have a branching decision tree format. Bagged trees=random forests use an ensemble of regression trees over different samples of the data. Boosted regression trees use a sequence of regression trees. MARS explicitly uses basis functions meaning that an infinitely large number of them can produce any n-dimensional surface imaginable. Neural nets have dozens of interconnections between nodes. The whole point of these functions is that they can fit extremely variable rugose surfaces with high orders of interactions between the variables. This allows. an extremely close fit to the data. In an extremely highly cited paper, Elith et al showed that advanced machine learning techniques (e.g. MARS, random forests and boosted regression trees) were superior to the older climate envelope approaches (like GARP which explicitly searches for heuristic rules for climate driven range boundaries) as well as approaches like well-known logistic regression. So better R2, better prediction, sounds good right?
Well this is where the complex interacts with the new. There is a major pitfall with highly flexible functions that fit the data very closely. It is called overfitting. If you fit the noise as well as the signal, your predictive power goes up without end as you increase the complexity of the model for the dataset your calibrating (fitting the model on) which looks really good, but eventually (actually at the point where you start fitting the noise instead of the signal) your predictive power for new data sampled from the same process/context (not used in calibrating) actually startgs going down because by definition the noise doesn’t carry into the new data. (see this earlier post for a refresher on some of these concepts).
Now the inventors of advanced machine learning techniques are very clever and not unaware of this problem. So they developed a technique to avoid overfitting. The basic idea is to fit the model out to a high degree of complexity (noticing the ever improving fit for the calibration curve in the figure). But then they test the fitted model on a separate set of data (the “holdout” or validation data) holding the parameters of the model constant but increasing the complexity from low to high (e.g. adding more nodes to the neural network, more branches of a regression tree, etc). Up to a point increasing complexity improves fit on the holdout data, but beyond a certain level of complexity the fit starts to get worse again. This is because you are now fitting noise in the original calibration data which doesn’t transfer to the validation data. This allows one to pick the optimal degree of complexity to fit the signal but not the noise and therefore make the most accurate predictions. In the figure it would be a complexity of 10. Now it is probably obvious the validation data needs to be independent in the statistical sense – uncorrelated errors – of the calibration data for this method to work. The simplest approach is to hold out, say, 1/3 of the data for validation. Most commonly now a fancier technique known as 10-fold crossvalidation is used. Here 90% of the data is used for calibration, and the remaining 10% is used for validation. Then a different 10% is held out for validation. This is repeated 10 times so all data is used once. But simmple hold out or 10-fold – the core idea of cross-validation with calibration then choosing the optimal amount of complexity on a separate validation dataset is the core innovation to avoid overfitting.
Note that this technique is not perfect, even with a validation step to choose the appropriate complexity, the model still performed better on the calibration data than the validation data (the gap being indicated by the dotted black line in the figure). In the kinds of data-mining applications for which these techniques were invented the amount of overfitting is usually small. This is because in those applications the data, e.g. a database of customer purchases, have data points that are largely independent of each other.
In niche modelling however, the data are spatially autocorrelated. Imagine an extreme scenario. Measure both the dependent variable (species presence/absence or abundance) and the independent variables (temperature, precipitation, soils, etc) at 100 points. Use these points as your calibration data to choose the best model. Now take as your validation data the 100 points that are 1cm from your calibration points. This is of course absurd and nobody does this. But it highlights the problem. If you took points 10 km away it might be a bit better (or a lot better (if you are studying phenomna at scales much smaller than 10km). What if you do what almost everybody does. You measure a bunch of points scattered across the domain of interest and then you pick some as calibration and some as hold/out validation. Probably you are in a lot of trouble. Because your points used for validation are almost always going to be neighbors of points used for calibration, and almost no matter what scale phenomena you are studying, by the nature of the design of taking as dense a set of points as you can afford to get across the area of interest, neighboring points are going to be autocorrelated. Not quite as extreme as taking validaiton points 1cm away from the calibration points. But not very good either.
Surely this problem must be known and addressed? Well, several papers acknowledge it is a potential issue. And a handful of papers have actually attempted to measure this effect by spatially separating the calibration and validation points (e.g. here and here and here). And in general the effect is rather large. Models predict very poorly when the test/validation data is spatially distinct from the model calibration data. That’s because the combination of extremely flexible functions make fitting the noise rather too easy combined with spatial non-independence of the test and training data (allowing fitting noise to look like a good idea even with cross-validation). Yet this issue is ignored in 99% of the papers published using niche models. I don’t know why. Perhaps there is a sense that what is good enough in other areas of machine learning ought to be good enough in ecology. But those other areas aren’t fitting spatially correlated points like in ecology.
To my mind this is a great example of statistical machismo. The fancy machine learning techniques have been advocated for and adopted primarily because they show improved goodness of fit (e.g. R2 or more often AUC). Maybe they have also been adopted a little too fast because they are new and shiny and complex enough to reduce the size of the “in” group too – but I’ll leave that to the reader to decide. Thus we have the one-dimensional view (only worrying about R2) without worrying about trade-offs. The biggest trade-off that concerns me here is simply that old, boring, simple techniques have been extremely well-studied and the pitfalls and shortcomings are well-known. Hot new techniques rarely have substantial study nor substantial understanding amongst the readership (or usually the authors) of where the problems lie. Specifically, the source domains of machine learning don’t suffer from spatial autocorrelation so they never studied it and they are too new in ecology for people to have worked out and assessed the problems. A secondary trade-off that concerns me is that the fancy new machine learning models are almost completely black-box to interpretation in comparison to simpler methods like linear or logistic regression (yes you can get back variable importance from these fancy methods but that is a good step short of the sense of sign, magnitude, specific interactions, non-linearity etc one can get from a good old simple regression). As if the final sign that the push for fancy machine learning methods in niche models is a case of statistical machismo, I have had papers where reviewers insisted that the advanced methods be used and they would not accept rational counter arguments.
Some readers may wonder why I am making such a big deal here about spatial autocorrelation whereas in the past I have expressed indifference to it. The reason is simple. In the past I was addressing methods that were concerned about improving accuracy of p-values in very strongly constrained regression models (i.e. the fitted functions are not too bendable). p-values are treated in way too binary a fashion and given too much importance and I just can’t get excited about methods designed to get more accurate p-values, especially when the changes are fairly small. Here I am worried about extremely flexible functions that are made flexible to increase goodness of fit (e.g. R2). The possibility of greatly overfitting data is enormous. Indeed several studies have shown that the consequence is quite large. Small changes to unimportant p-value — meh – fix it if you want but don’t bother me (and don’t bother to flame the comments on this topic here – save that for the original post if you need to). Vastly overstating important R2 – I’m worried. And for what – well those fancy machine learning methods typically only improve R2 by about 0.05 (i.e. 5% of variance explained or less). This point is not often highlighted but take a look at Figure 3 of Elith et al – the point-biserial correlation changes from 0.20 for things like MARS to 0.18 for things like logistic regression, which is a difference in r2 of 0.04 vs 0.032 or a difference of 0.008. Not a good trade-off in my book! This is of course a matter of judgement, and I’m open to people who feel differently as long as it is a balanced, multi-dimensional weighing of all aspects of the trade-off (and a reciprocal openness to different opinions).
I’m not going to make the call here. I’ll leave it to the readers and the comment section. Is the push to fancy machine learning methods in niche models despite the fact they’re so new their limitations aren’t really worked out (specifically as we’re starting to understand they can badly overfit spatially autocorrelated data) a case of statistical machismo? What do you say?