Every year we invite you to ask us anything. Here’s our next question, from Mike Mahoney, and our answers:
What (if any) role do you see for less-interpretable machine learning (& similar techniques) in ecology? In particular I’m thinking of Brian’s posts on statistical machismo & also the need for more prediction in ecological studies, and Peter Adler’s post on ecological forecasting — I’m wondering if you see any areas in ecology where the increase in predictive accuracy offsets not being able to understand how the model got there.
First, I’m not 100% sure what “machine learning” is. That term seems to get applied to various different approaches, some of which are more black-boxy than others. Multiple logistic regression with some sort of variable selection procedure, which doesn’t seem all that black-boxy to me. Neural networks, which seem more black-boxy. Regression trees. Deep learning. AlphaZero. Other approaches. I’m not sure whether that uncertainty makes me a good or bad person to answer this question. Does it make me a broadminded outsider who can evaluate machine learning objectively, without getting caught up in hype or technical details? Or does it just make me uninformed? Could be some of both! 🙂
Re: valuing an improvement in predictive accuracy even at the cost of not being able to understand the basis of the prediction, I’d first want to be convinced that machine learning models (however defined) really do make substantially more accurate predictions than more interpretable models do! In my admittedly limited reading, the predictive performance of machine learning techniques in practical applications in ecology and social science (as opposed to in, say, facial recognition) doesn’t seem to be much better than garden-variety OLS regression. For instance, see Salganik et al. 2020 PNAS, and Brian’s discussion from a few years ago of how machine learning methods badly overfit ecological niche models in the presence of spatial autocorrelation. More broadly, popular studies of what makes for successful prediction in various fields don’t identify many cases in which uninterpretable black-box machine learning methods outperform other methods (see here and here). Ok, I’m aware of recent theoretical arguments that the classic bias-variance tradeoff (aka underfitting vs. overfitting) doesn’t apply to some forms of machine learning. But those theoretical arguments don’t really apply in typical ecological contexts, where it’s not possible to perfectly fit the training data, or even come close to doing so.
Sociologist Kieran Healy seems to have the same view as me, but he has a funny meme for it:
I know there’s active research on making black box machine learning methods interpretable, but I don’t know anything beyond the fact that research is ongoing.
One further thought: speaking as someone who’s own research focuses on improving understanding rather than prediction, I don’t feel like I have a good handle on the ecological contexts in which we really would want totally black-box predictions, even if they were good predictions. I mean, I can think of a few. For instance, I’m sure ecologists who work with camera trap data would love to have a really good black-box image recognition system for identifying the species in millions of pictures. Because not every camera trap project can enlist an army of citizen scientists to do image classification. And I haven’t read this yet, but just based on the abstract it sounds like an interesting application of deep learning to detect potential mimicry in snakes. In contrast, think about management contexts. Don’t decision-makers often want some understanding of the basis of the forecast that’s informing their decision-making? (And if they don’t want some understanding, perhaps because they want to cherry-pick their preferred forecast, or because they don’t want to take responsibility for their decisions, well, maybe they should want some understanding!) And aren’t there ethical issues around just blindly trusting black-box classification and prediction algorithms? I’m thinking for instance of cases in which purportedly “neutral” classification and prediction algorithms end up creating and reinforcing racist outcomes. These aren’t ethical issues I’ve thought much about, so perhaps they don’t apply in many ecological contexts? Anyway, consider this last paragraph an invitation for comments from others who know more about this stuff than I do.
Unlike Jeremy, I am willing to stipulate the premise of the question – that complex machine learning can make more accurate predictions. This is not voodoo – they are just regressions but they use functions that are more malleable (able to bend to fit the data) than a linear or quadratic, and much more able to pick up interactions between terms (which is awkward and crude at best in an OLS regression). This allows for better prediction. But for both those reasons there is a very real risk of fitting noise rather than signal in the data. To truly fit the signal you need to have:
- Lots of data as in thousands or at least many hundreds of points (not so common in ecology – very common in database marketing where much of this originated)
- The ability to select subsets of the data that are truly independent of each other for appropriate calibration and validation. This also is rare in ecology – e.g. problems with spatially structured data.
- A predictable problem. If you want to predict next years deer population – can variables measurable now actually predict that, or is it going to be primarily driven by next years weather and the arrival of a new disease? If the latter, machine learning is not going to improve prediction.
So – if you actually meet all of those technical requirements, when, if, ever, would you want to use it? As the question assumes, really complex models (like neural nets, MARS, boosted regression, etc) can get that extra few percent of accuracy out of the data at the cost of telling people “trust my model because I cannot explain it to you”. And it is worth noting it is usually only an extra few percent of accuracy. To me it is obvious black-boxing for a few percent is only going to be acceptable in a management context where people are desperate enough for the last few percent of accuracy to give up the ability to “kick the tires”, “sanity check”, etc. Predicting the stock market is one of those places (obviously not ecological). Within ecology I’m not sure how common that really is? I would think mostly super applied situations with strong economic impacts. Predicting dam water levels (and that is really hydrology not ecology)? Predicting populations of pests or game animals? Mostly though I don’t think there are a lot of applications where people are willing to make that tradeoff to answer your question.
There is one glaring exception though. Species distribution models. There is no denying there has been massive embracing of these models. And there is no denying that people working in this domain are more than happy to jump on board with more complex machine learning models and toss out Generalized Linear Models (e.g. logistic), even GAMS. And there is evidence it does buy a few percent of increased accuracy. But as I’ve argued before (see my first link on spatial non-independence), I personally don’t think this is reliable, given the problems of getting independent data to cross-validate. I trust a GLiM or GAM much more than boosted regression tree in its ability to extrapolate into the future. So I can’t explain why this one exception exists in peoples minds. My cynical self suspects it is because we are predicting species ranges 50-100 years in the future which is largely untestable so there is not a high degree of accountability and has sort of veered off into make-believe land even in manager’s minds (I think it is also an open question how much managers rely on these in any detail for management decisions)? Or maybe less cynically it has to do with producing maps which people intuitively trust much more than answers like “the number is 5.7328”.
I do think Jeremy’s point about automated image classification is likely to be a highly legitimate current and future use for some specific types of machine learning. They fit the criteria of lots of data easily obtainable (thanks to digital cameras and remote sensing). And the benefit is scaling (100s or 1000s of predictions), not more accuracy. And scaling at a level that is costly or impossible with the alternative (human analysis). And it is producing raw data for some future model, not the model itself. But it is also worth noting that in areas like land cover classification of aerial images simpler models including regression trees have been used for decades fairly successful.
Which leads to … I would argue that you can have your cake and eat it too. Simple regression trees have most of the benefits (nonlinear responses, interactions between variables) and yet are even more easily interpreted by managers than OLS coefficients. One has to be careful with regression trees to address my 3 point list above to avoid over fitting, and regression trees do suffer from some proclivity to instability with highly collinear data (to which the fix is to eliminate collinear variables). But I am in the minority in favoring the “old-fashioned” regression trees. Most prefer random forests, pointing out that you can get metrics on the “importance” of a variable, but I find these to be unsatisfying as they don’t really measure effect sizes (or even directions) – still too black boxy for me. So I stick to OLS/GLiM/GAM/CART in my own work.