Every year we invite you to ask us anything. Here’s our next question, from Mike Mahoney, and our answers:
What (if any) role do you see for less-interpretable machine learning (& similar techniques) in ecology? In particular I’m thinking of Brian’s posts on statistical machismo & also the need for more prediction in ecological studies, and Peter Adler’s post on ecological forecasting — I’m wondering if you see any areas in ecology where the increase in predictive accuracy offsets not being able to understand how the model got there.
First, I’m not 100% sure what “machine learning” is. That term seems to get applied to various different approaches, some of which are more black-boxy than others. Multiple logistic regression with some sort of variable selection procedure, which doesn’t seem all that black-boxy to me. Neural networks, which seem more black-boxy. Regression trees. Deep learning. AlphaZero. Other approaches. I’m not sure whether that uncertainty makes me a good or bad person to answer this question. Does it make me a broadminded outsider who can evaluate machine learning objectively, without getting caught up in hype or technical details? Or does it just make me uninformed? Could be some of both! 🙂
Re: valuing an improvement in predictive accuracy even at the cost of not being able to understand the basis of the prediction, I’d first want to be convinced that machine learning models (however defined) really do make substantially more accurate predictions than more interpretable models do! In my admittedly limited reading, the predictive performance of machine learning techniques in practical applications in ecology and social science (as opposed to in, say, facial recognition) doesn’t seem to be much better than garden-variety OLS regression. For instance, see Salganik et al. 2020 PNAS, and Brian’s discussion from a few years ago of how machine learning methods badly overfit ecological niche models in the presence of spatial autocorrelation. More broadly, popular studies of what makes for successful prediction in various fields don’t identify many cases in which uninterpretable black-box machine learning methods outperform other methods (see here and here). Ok, I’m aware of recent theoretical arguments that the classic bias-variance tradeoff (aka underfitting vs. overfitting) doesn’t apply to some forms of machine learning. But those theoretical arguments don’t really apply in typical ecological contexts, where it’s not possible to perfectly fit the training data, or even come close to doing so.
Sociologist Kieran Healy seems to have the same view as me, but he has a funny meme for it:
I know there’s active research on making black box machine learning methods interpretable, but I don’t know anything beyond the fact that research is ongoing.
One further thought: speaking as someone who’s own research focuses on improving understanding rather than prediction, I don’t feel like I have a good handle on the ecological contexts in which we really would want totally black-box predictions, even if they were good predictions. I mean, I can think of a few. For instance, I’m sure ecologists who work with camera trap data would love to have a really good black-box image recognition system for identifying the species in millions of pictures. Because not every camera trap project can enlist an army of citizen scientists to do image classification. And I haven’t read this yet, but just based on the abstract it sounds like an interesting application of deep learning to detect potential mimicry in snakes. In contrast, think about management contexts. Don’t decision-makers often want some understanding of the basis of the forecast that’s informing their decision-making? (And if they don’t want some understanding, perhaps because they want to cherry-pick their preferred forecast, or because they don’t want to take responsibility for their decisions, well, maybe they should want some understanding!) And aren’t there ethical issues around just blindly trusting black-box classification and prediction algorithms? I’m thinking for instance of cases in which purportedly “neutral” classification and prediction algorithms end up creating and reinforcing racist outcomes. These aren’t ethical issues I’ve thought much about, so perhaps they don’t apply in many ecological contexts? Anyway, consider this last paragraph an invitation for comments from others who know more about this stuff than I do.
Unlike Jeremy, I am willing to stipulate the premise of the question – that complex machine learning can make more accurate predictions. This is not voodoo – they are just regressions but they use functions that are more malleable (able to bend to fit the data) than a linear or quadratic, and much more able to pick up interactions between terms (which is awkward and crude at best in an OLS regression). This allows for better prediction. But for both those reasons there is a very real risk of fitting noise rather than signal in the data. To truly fit the signal you need to have:
- Lots of data as in thousands or at least many hundreds of points (not so common in ecology – very common in database marketing where much of this originated)
- The ability to select subsets of the data that are truly independent of each other for appropriate calibration and validation. This also is rare in ecology – e.g. problems with spatially structured data.
- A predictable problem. If you want to predict next years deer population – can variables measurable now actually predict that, or is it going to be primarily driven by next years weather and the arrival of a new disease? If the latter, machine learning is not going to improve prediction.
So – if you actually meet all of those technical requirements, when, if, ever, would you want to use it? As the question assumes, really complex models (like neural nets, MARS, boosted regression, etc) can get that extra few percent of accuracy out of the data at the cost of telling people “trust my model because I cannot explain it to you”. And it is worth noting it is usually only an extra few percent of accuracy. To me it is obvious black-boxing for a few percent is only going to be acceptable in a management context where people are desperate enough for the last few percent of accuracy to give up the ability to “kick the tires”, “sanity check”, etc. Predicting the stock market is one of those places (obviously not ecological). Within ecology I’m not sure how common that really is? I would think mostly super applied situations with strong economic impacts. Predicting dam water levels (and that is really hydrology not ecology)? Predicting populations of pests or game animals? Mostly though I don’t think there are a lot of applications where people are willing to make that tradeoff to answer your question.
There is one glaring exception though. Species distribution models. There is no denying there has been massive embracing of these models. And there is no denying that people working in this domain are more than happy to jump on board with more complex machine learning models and toss out Generalized Linear Models (e.g. logistic), even GAMS. And there is evidence it does buy a few percent of increased accuracy. But as I’ve argued before (see my first link on spatial non-independence), I personally don’t think this is reliable, given the problems of getting independent data to cross-validate. I trust a GLiM or GAM much more than boosted regression tree in its ability to extrapolate into the future. So I can’t explain why this one exception exists in peoples minds. My cynical self suspects it is because we are predicting species ranges 50-100 years in the future which is largely untestable so there is not a high degree of accountability and has sort of veered off into make-believe land even in manager’s minds (I think it is also an open question how much managers rely on these in any detail for management decisions)? Or maybe less cynically it has to do with producing maps which people intuitively trust much more than answers like “the number is 5.7328”.
I do think Jeremy’s point about automated image classification is likely to be a highly legitimate current and future use for some specific types of machine learning. They fit the criteria of lots of data easily obtainable (thanks to digital cameras and remote sensing). And the benefit is scaling (100s or 1000s of predictions), not more accuracy. And scaling at a level that is costly or impossible with the alternative (human analysis). And it is producing raw data for some future model, not the model itself. But it is also worth noting that in areas like land cover classification of aerial images simpler models including regression trees have been used for decades fairly successful.
Which leads to … I would argue that you can have your cake and eat it too. Simple regression trees have most of the benefits (nonlinear responses, interactions between variables) and yet are even more easily interpreted by managers than OLS coefficients. One has to be careful with regression trees to address my 3 point list above to avoid over fitting, and regression trees do suffer from some proclivity to instability with highly collinear data (to which the fix is to eliminate collinear variables). But I am in the minority in favoring the “old-fashioned” regression trees. Most prefer random forests, pointing out that you can get metrics on the “importance” of a variable, but I find these to be unsatisfying as they don’t really measure effect sizes (or even directions) – still too black boxy for me. So I stick to OLS/GLiM/GAM/CART in my own work.
I think a promising use for machine learning (ML) in ecology is the approximation of otherwise intractable likelihood functions. For instance, ML methods can be used to “learn” a likelihood function from a simulator, then this learned function can be used to infer parameters from data using standard frequentist or Bayesian approaches. In such cases, the “black box” issue surrounding ML is less important because you’ve coded the biology of the system directly into the simulator and the only pertinent question for the ecologist is whether the learned function reflects what occurs in the simulations.
I’d argue that many of the complex systems in ecology and evolutionary biology led to intractable likelihood functions (especially if one considers space), and ML could be an important tool for dealing with this.
It’s interesting that you mention this point – we used RF for dimension reduction in ABC in this paper https://esajournals.onlinelibrary.wiley.com/doi/abs/10.1002/eap.1873 but I have seen few other people using ML approaches for this. Or are you more thinking in the direction of ML for synthetic likelihood, as in https://projecteuclid.org/euclid.ejs/1527300140?
Yes, based on the abstracts, I’d say more like the latter. Thanks for the links.
Kyle Cranmer has a recent review of simulation-based inference that discusses using normalizing flows to estimate densities in high dimensions. https://doi.org/10.1073/pnas.1912789117
I just want to say that Brian’s answer is so good. So crisp and logical. (And not streets away from my own rambling answer, I don’t think…)
OK, I will make a few comments here:
a) ML as a field is by now probably larger than all of ecology, and the collective research of this field has demonstrated very convincingly that certain ML techniques will, on average, lead to lower predictive error than crude statistical strategies such as “guessing the right model” or AIC model selection. If in doubt about this, feel free to sign up to a predictive modelling competition at https://www.kaggle.com/competitions and try your luck with OLS or AIC model selection.
b) Of course, certain things come at a cost. There are sometimes trade-offs between predictive error and interpretability. See, e.g., our recent preprint “Explainable Artificial Intelligence enhances the ecological interpretability of black-box species distribution models” https://ecoevorxiv.org/w96pk/ or also https://arxiv.org/abs/2007.04131 or https://esajournals.onlinelibrary.wiley.com/doi/abs/10.1002/ecm.1422?af=R . On the other hand, it’s a fallacy to assume that there is never a “free lunch” – for particular purposes (e.g. predictive SDM modelling), I believe that a combination of a flexible model + xAI may be much better suited to the needs of the average modeller than all other alternatives. I also agree with Brian btw. that CARTs or BRT may often be in a sweet spot for many ecologists. But it really depends on what you want to do.
c) about understanding: yes, standard ML methods are not tuned to infer causality. But neither are OLS, because OLS will only uncover causal relationships if the modeller is guessing the right model structure (think confounding, collider bias etc.). Even if you’re trained in this, it’s often impossible because there is not enough prior information available to do so. Moreover, in my experience, few ecologists are even aware of these issues, and consequently, I suspect that most OLS regression coefficients do not contribute significantly more “understanding” than say random forest variable importance, which is to say: they indicate that there is a correlation between the response and the predictor, but this correlation is not necessarily due to a direct causal link. In data science, people are researching on finding automatic methods that infer causality from complex data, see e.g. https://www.frontiersin.org/articles/10.3389/fgene.2019.00524/full , so here I would say again: specialised ML methods may actually help to infer causality better than OLS.
I guess my overall point is: there is a lot of useful stuff going on out there in the data science field, and I think (especially young) ecologists would be well advised to take note of these developments and explore how they could be used for ecology. Is ML a hype? Sure. Do people have inflated expectations about ML? Certainly. But most hypes have a certain level of substance behind then, and in this case, I think there is really substantial substance, to the point that I think a stats class in 20 yrs will teach 50% classical stats and 50% ML methods, not because one is better than the other, but just because they do different, but useful things.
Thanks very much for this Florian. Very useful. Our hope with many AUAs is that we’ll get comments from readers who know more than us.
Re: your (a), how much improvement in prediction do you think machine learning methods get you in typical ecological applications? Brian suggested “a few percent”–do you think that’s in the right ballpark? Honest question–I’m trying to learn from people who’ve read more machine learning papers in ecology than I have.
Re: your (c), I should clarify that by “understanding” I don’t mean “infer causality”. I share your skepticism of OLS regression on observational data as a reliable tool for causal inference. By “understanding” I meant “I can explain to someone why the model is making the predictions is makes”. For instance, imagine I was doing OLS regression with some collinear predictor variables, did some sort of variable selection procedure to choose which predictors to include in my final model. I could look at plots of the data, variance inflation factors, etc., and understand why the final model includes the variables it does and spits back the partial regression coefficient estimates it does. In contrast, there are applications of machine learning where it’s very difficult (at best) for humans to understand why the ML model does whatever it does. Why AlphaZero moved its rook to G1 in that chess position, or why the ML stock-trading algorithm bought those stocks, or etc. It sounds like what I had in mind re: “understanding” is what you mean by the trade-off between prediction and “interpretability”?
about improvement of predictions – it depends completely on your problem of course. For example, I wouldn’t be surprised to see that, for an SDM, I will get an AUC of 0.9 with a simple logistic regression, and 0.95 with a RF. Now, I could either argue that the RF improved prediction by only 5%, or that the error was reduced by 50%. Either way, the fact is that distributions are relatively easy to predict (even with non-causal predictors, see https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1466-8238.2007.00331.x), so the baseline set by a GLM is already quite high, and there is not so much to improve in absolute numbers. Many people that use this stuff in practice (e.g. in remote sensing) still find an improvement of 0.9 to 0.95 AUC meaningful.
You will see larger improvements where the predictive problem is more complicated. See, for example, our comparison of GLMs and RF for trait matching in Fig. 4 in https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13329 . GLMs are doing fine for a simple structure, but they drop substantially for more complex trait matching structures, while RF does not. My expectation is that such complex responses, or situations with a large number of predictors, are the field where ML can really excel over simple regressions.
About c) – OK, fair enough. Although I have to say that I’m not totally sure if OLS is so much easier to understand than simple xAI methods. I mean, for example, I would argue that the majority of ecological papers interprets OLS coefficients causally (at least implicitly), suggesting that people are indeed not aware of what these coefficients mean exactly. Moreover, one of the principle of post-hoc xAI methods is to simply fit simpler models (such as OLS or RTs) to the fitted ML model, so to the extend to which the structure in the data can be captured by an OLS, it can also be understood by an OLS-type approach, even if a much more complicated ML model is being used. What that simple xAI approximation looses is the stuff that cannot be captured by the OLS, but still, in comparison to just using the OLS, you’re not really loosing anything.
In the Elith et al 2006 paper that I linked to (which was the paper that really supercharged ML methods in SDMs) they find that AUC goes from about 0.68 to 0.7 or correlation goes from about 0.18 to about 0.2 (traditional like GAM, GLM) to machine learning (BRG, MAXENT, MARS).
I’ve always been surprised that was perceived as a compelling increase (although the title clearly claims it is).
Not sure I know of another study besides what Florian mentions about traits. Might well be time for a meta-analysis comparing approaches across problems.
Can someone please explain to me why Maxent counts as “machine learning”? Genuinely surprised to hear it characterized that way. Which just goes to show what I said in the post: I don’t know what machine learning even is! 🙂
@Brian – AUC values depend very much on the dataset / application – here is an example that has values similar to what I mentioned https://onlinelibrary.wiley.com/doi/full/10.1002/ece3.2332
@Jeremy – I also not fully happy with this classification, but MaxEnt is in most papers classified as ML, and this is because of the MaxEnt principle that is used to fit the model, see Elith et al., A statistical explanation of MaxEnt for ecologists https://onlinelibrary.wiley.com/doi/full/10.1111/j.1472-4642.2010.00725.x, but it is possible to interpret MaxEnt differently, e.g. Renner & Warton, Equivalence of MAXENT and Poisson Point Process Models for Species Distribution Modeling in Ecology https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1541-0420.2012.01824.x
MaxEnt is similar to a constrained optimization. The goal is typically allocation of objects to set of categories (molecular states in physics, species most often in ecology). Then set your constraints. Then maximize entropy (disorder) in the system subject to constraints (the equation looks the same as the real equation for entropy in chemistry/physics but of course it is not entropy of atoms so it is only an analogy to entropy).
This can be applied in two fairly different ways:
a) on equations – this is what John Harte has done. Constraints are on species richness, total abundance, and total energy. THen he derives an equation predicting the relative abundance of species
b) on data – this is almost like a regression. Maximize the entropy of a set of numeric values (relative abundance in Shipley 2006) subject to constraints (producing observed community weighted mean (CWM) values of traits by combining the relative abundances and observed trait values).
c) MaxEnt SDMS are similar to (b) – the numeric values whose entropy are maximized are suitability – relative probability across space of occurring at a location – and the constraints are observed relationships between climate and occurrence (or abundance) with some simple nonlinear shapes allowed for connecting climate and probability of occurrence.
d) Roderick Dewar and students have applied maxent to energetics and productivity and satellite data as well (also in the numeric category of b/c)
In either the theory (a) or numeric (b/c/d) case the method of lagrange multipliers is used to maximize entropy subject to constraints – its just that in one case it is on equations where one takes derivatives, sets equal to zero and solves. And in the other it is on nearly linear equations in the variable for which entropy is being maximized such as relative abundances (bar a few log/exp functions)
In both cases the real action is the constraints. MaxEnt will typically produce a uniform distribution (whether in equation is in a or numeric as in b/c form) without constraints (all species equally likely). And adding one more constraint usually does not slightly tweak the resulting distribution it fundamentally changes it. For example going from an exponential distribution of relative abundances to a logseries distribution when going from 1 to 2 constraints in Harte’s theory.
Note that in physics maximizing entropy is the 2nd law of thermodynamics. But outside of physics where entropy is only an analogy it is really just a useful assumption. But in an interesting paper Haegeman and Loreau (Oikos 2008) show that other criteria such as the center of the feasible set (i.e meeting the constraints) works equally well as MaxEnt on Shipley’s data. People always focus on the maximum entropy feature but for sure the real action is in the constraints. Ken Locey (with Ethan White 2013 Ecology Letters) has shown the constraints used by Harte are also reasonable strong/limiting and looking at how far the feasible set after constraints gets you (answer in this case is reasonably far but not as good as MaxEnt).Both Shipley and Dewar have put some effort into justifying MaxEnt as an expected outcome specific to ecology.
@Florian – thanks for the paper link. That’s a really nice paper. Can’t believe I missed it. Of course the AUC depends heavily on the data so attributing one range of AUC to a method would be a mistake. I was just reporting the values for the dataset analyzed in Elith et al. And I wouldn’t assume the differences would be identical across dta. Although interestingly in the Shebani etal paper it seems like the AUCs (on the validation data) mostly go from 0.93 to 0.95 (the 0.02) as Elith despite the shfit in position unless I’m interpreting that wrong? Probably largely coincidence but it does lead me to a general sense that improvements in fit (so long as appropriately calibrated) are real but not huge. Unless I’m reading it wrong you have to go to the 3rd decimal place to find a difference between GLM and BRT/RF/MAXENT. TSS is more complex with BRT/MAXENT being about 5% better than GLM for TSS but random forest crashes and burns on TSS (overfitting?).
I love these posts. Is it too late to ask another question? I am personally interested on the current thoughts on using climax and seral species concepts in ecology. For instance, I work in a disturbance prone system (a historically frequent fire forest) and I question the consistent use of those concepts there, although it is still commonly done. I wasn’t able to find a thorough treatment of these (although likely they exist) on the modern thinking on this.