Why advanced machine learning methods badly overfit niche models – is this statistical machismo?

Statistical machismo, as I’ve been defining it in this blog (here and here) is when there is a push to ever newer, more complex methods, often in pursuit of a one-dimensional form of improvement while ignoring the fact that choosing the proper statistic involves complex trade-offs. For example, hyperfocusing on one source of error while ignoring (or even worsening) other sources of error.

Here I want to talk about how the new and the complex sides of statistical machismo interact badly in the most common application of machine learning to ecology to date: niche models (aka species distribution models or habitat models). The goal is to basically substitute extensive amounts of environmental data for our inadequate (i.e. sparse) biological sampling and biological understanding to build a predictive model of where a target species will be found.

I said I wanted to talk about how new and complex interact badly. Let’s start with complex. All niche modelling (and machine learning) are basically a giant regression. The different techniques simply substitute a different funcitonal form for the regression. Logistic regression has the well-known sigmoidal logit function. Regression trees have a branching decision tree format. Bagged trees=random forests use an ensemble of regression trees over different samples of the data. Boosted regression trees use a sequence of regression trees. MARS explicitly uses basis functions meaning that an infinitely large number of them can produce any n-dimensional surface imaginable. Neural nets have dozens of interconnections between nodes. The whole point of these functions is that they can fit extremely variable rugose surfaces with high orders of interactions between the variables. This allows. an extremely close fit to the data. In an extremely highly cited paper, Elith et al showed that advanced machine learning techniques (e.g. MARS, random forests and boosted regression trees) were superior to the older climate envelope approaches (like GARP which explicitly searches for heuristic rules for climate driven range boundaries) as well as approaches like well-known logistic regression. So better R2, better prediction, sounds good right?

Well this is where the complex interacts with the new. There is a major pitfall with highly flexible functions that fit the data very closely. It is called overfitting. If you fit the noise as well as the signal, your predictive power goes up without end as you increase the complexity of the model for the dataset your calibrating (fitting the model on) which looks really good, but eventually (actually at the point where you start fitting the noise instead of the signal) your predictive power for new data sampled from the same process/context (not used in calibrating) actually startgs going down because by definition the noise doesn’t carry into the new data. (see this earlier post for a refresher on some of these concepts).


The red line shows the calibration data. Goodness of fit only ever improves with complexity (eventually approaching perfect goodness of fit – here measured as RMSE=0). However when this calibrated model is applied to “out of the bag” or “holdout” or “test” data (i.e. data not used in the calibration), the goodness of fit peaks at some intermediate complexity and then worsens again (green line). The worsening fit is due to fitting noise instead of signal in the calibration data. Despite choosing an optimal complexity level based on the validation curve (here complexity=10), there is still some unavoidable overfitting going on (shown here as the dotted black line).

Now the inventors of advanced machine learning techniques are very clever and not unaware of this problem. So they developed a technique to avoid overfitting. The basic idea is to fit the model out to a high degree of complexity (noticing the ever improving fit for the calibration curve in the figure). But then they test the fitted model on a separate set of data (the “holdout” or validation data) holding the parameters of the model constant but increasing the complexity from low to high (e.g. adding more nodes to the neural network, more branches of a regression tree, etc). Up to a point increasing complexity improves fit on the holdout data, but beyond a certain level of complexity the fit starts to get worse again. This is because you are now fitting noise in the original calibration data which doesn’t transfer to the validation data. This allows one to pick the optimal degree of complexity to fit the signal but not the noise and therefore make the most accurate predictions. In the figure it would be a complexity of 10. Now it is probably obvious the validation data needs to be independent in the statistical sense – uncorrelated errors – of the calibration data for this method to work. The simplest approach is to hold out, say, 1/3 of the data for validation. Most commonly now a fancier technique known as 10-fold crossvalidation is used. Here 90% of the data is used for calibration, and the remaining 10% is used for validation. Then a different 10% is held out for validation. This is repeated 10 times so all data is used once. But simmple hold out or 10-fold – the core idea of cross-validation with calibration then choosing the optimal amount of complexity on a separate validation dataset is the core innovation to avoid overfitting.

Note that this technique is not perfect, even with a validation step to choose the appropriate complexity, the model still performed better on the calibration data than the validation data (the gap being indicated by the dotted black line in the figure). In the kinds of data-mining applications for which these techniques were invented the amount of overfitting is usually small. This is because in those applications the data, e.g. a database of customer purchases, have data points that are largely independent of each other.

In niche modelling however, the data are spatially autocorrelated. Imagine an extreme scenario. Measure both the dependent variable (species presence/absence or abundance) and the independent variables (temperature, precipitation, soils, etc) at 100 points. Use these points as your calibration data to choose the best model. Now take as your validation data the 100 points that are 1cm from your calibration points. This is of course absurd and nobody does this. But it highlights the problem. If you took points 10 km away it might be a bit better (or a lot better (if you are studying phenomna at scales much smaller than 10km). What if you do what almost everybody does. You measure a bunch of points scattered across the domain of interest and then you pick some as calibration and some as hold/out validation. Probably you are in a lot of trouble. Because your points used for validation are almost always going to be neighbors of points used for calibration, and almost no matter what scale phenomena you are studying, by the nature of the design of taking as dense a set of points as you can afford to get across the area of interest, neighboring points are going to be autocorrelated. Not quite as extreme as taking validaiton points 1cm away from the calibration points. But not very good either.

Surely this problem must be known and addressed? Well, several papers acknowledge it is a potential issue. And a handful of papers have actually attempted to measure this effect by spatially separating the calibration and validation points (e.g. here and here and here). And in general the effect is rather large. Models predict very poorly when the test/validation data is spatially distinct from the model calibration data. That’s because the combination of extremely flexible functions make fitting the noise rather too easy combined with spatial non-independence of the test and training data (allowing fitting noise to look like a good idea even with cross-validation). Yet this issue is ignored in 99% of the papers published using niche models. I don’t know why. Perhaps there is a sense that what is good enough in other areas of machine learning ought to be good enough in ecology. But those other areas aren’t fitting spatially correlated points like in ecology.

To my mind this is a great example of statistical machismo. The fancy machine learning techniques have been advocated for and adopted primarily because they show improved goodness of fit (e.g. R2 or more often AUC). Maybe they have also been adopted a little too fast because they are new and shiny and complex enough to reduce the size of the “in” group too – but I’ll leave that to the reader to decide. Thus we have the one-dimensional view (only worrying about R2) without worrying about trade-offs. The biggest trade-off that concerns me here is simply that old, boring, simple techniques have been extremely well-studied and the pitfalls and shortcomings are well-known. Hot new techniques rarely have substantial study nor substantial understanding amongst the readership (or usually the authors) of where the problems lie. Specifically, the source domains of machine learning don’t suffer from spatial autocorrelation so they never studied it and they are too new in ecology for people to have worked out and assessed the problems. A secondary trade-off that concerns me is that the fancy new machine learning models are almost completely black-box to interpretation in comparison to simpler methods like linear or logistic regression (yes you can get back variable importance from these fancy methods but that is a good step short of the sense of sign, magnitude, specific interactions, non-linearity etc one can get from a good old simple regression). As if the final sign that the push for fancy machine learning methods in niche models is a case of statistical machismo, I have had papers where reviewers insisted that the advanced methods be used and they would not accept rational counter arguments.

Some readers may wonder why I am making such a big deal here about spatial autocorrelation whereas in the past I have expressed indifference to it. The reason is simple. In the past I was addressing methods that were concerned about improving accuracy of p-values in very strongly constrained regression models (i.e. the fitted functions are not too bendable). p-values are treated in way too binary a fashion and given too much importance and I just can’t get excited about methods designed to get more accurate p-values, especially when the changes are fairly small. Here I am worried about extremely flexible functions that are made flexible to increase goodness of fit (e.g. R2). The possibility of greatly overfitting data is enormous. Indeed several studies have shown that the consequence is quite large. Small changes to unimportant p-value — meh – fix it if you want but don’t bother me (and don’t bother to flame the comments on this topic here – save that for the original post if you need to). Vastly overstating important R2 – I’m worried. And for what – well those fancy machine learning methods typically only improve R2 by about 0.05 (i.e. 5% of variance explained or less). This point is not often highlighted but take a look at Figure 3 of Elith et al – the point-biserial correlation changes from 0.20 for things like MARS to 0.18 for things like logistic regression, which is a difference in r2 of 0.04 vs 0.032 or a difference of 0.008. Not a good trade-off in my book! This is of course a matter of judgement, and I’m open to people who feel differently as long as it is a balanced, multi-dimensional weighing of all aspects of the trade-off (and a reciprocal openness to different opinions).

I’m not going to make the call here. I’ll leave it to the readers and the comment section. Is the push to fancy machine learning methods in niche models despite the fact they’re so new their limitations aren’t really worked out (specifically as we’re starting to understand they can badly overfit spatially autocorrelated data) a case of statistical machismo? What do you say?

37 thoughts on “Why advanced machine learning methods badly overfit niche models – is this statistical machismo?

  1. Thanks for this post Brian. I like the general point that model adequacy is a multidimensional issue. Mark Taper and others have some interesting ideas in this direction:


    Here’s their abstract:
    “Models carry the meaning of science. This puts a tremendous burden on the process of model selection. In general practice, models are selected on the basis of their relative goodness of fit to data penalized by model complexity. However, this may not be the most effective approach for selecting models to answer a specific scientific question because model fit is sensitive to all aspects of a model, not just those relevant to the question. Model Structural Adequacy analysis is proposed as a means to select models based on their ability to answer specific scientific questions given the current understanding of the relevant aspects of the real world.”

    • Ooh, good link Steve! Hadn’t seen that paper, it looks right up my alley. And very relevant to stuff we’ve talked about on this blog before, like the many different ways models can be false and how falsehood in models can actually help us learn the truth.

  2. What is your thoughts about combining species distributions models and better a priori model sets or post-hoc model selection methods (some sort of AIC)? I’m running in to this problem in my work. We’re trying to predict species occurrences over a really large area so we can determine where we should conserve. We have a fair amount of presence-only data and little to no presence-absence data. I’m trying to avoid the overfitting issue by starting with ‘well’ justified models based on expert opinion. I kind of feel like the modellers will look down on using experts and the experts/end users will look down on the modelling.

    • Using well-thought out model based on expert should NEVER be looked down upon in my opinion! It is the preferred option whenever available.

    • @ATM,

      Your suggestion about model selection was exactly where I was going with my thoughts on this. However, kind of like Dr McGill, I worry about other aspects of ML methods. In particular, while overfitting is possible with really complicated models having lots of parameters, that can be compensated for using something like the AIC, BIC, WAIC, or WBIC. The worry I have is that, now, people are talking about ensemble methods, where whole flocks of models are being applied, a little logistic regression here, a little bagging there, a few random forests over there, and people are talking about averaging over the set of models to get a composite explanation.

      So, now, it seems to me it’s no longer sufficient to zing each model in an ensemble by its own complexity penalty, it seems something needs to be subtracted from the log-likelihood for the fact that there are now M models being used in combination. I’m new to ensemble methods in ML, but I have not seen this discussed, so, like Dr McGill, “Surely this problem must be known and addressed?”

      If anyone knows a paper that does, I’d greatly appreciate the pointer. Seems key to use, at least to me.

  3. Question Brian: is the unavoidable overfitting even with optimal model complexity actually just failure to fit the noise in the out-of-sample data? Or is there a subtlety here I’m missing? That bit of your post was new to me, so I want to make sure I understand it.

    • Most of it is fitting part of the noise in the calibration data that isn’t matched in the validation data. Even the optimization method can’t pull this apart perfectly. At the simplest – complexity usually is integral, but the optimal complexity might be between integers. Or there is a certain amount of instability in methods with highly collinear data – the noise in the calibration data might tip things to the slightly suboptimal model so it can never get the full “optimal” prediction on the validation data.

  4. I started to use boosted regression trees recently (not exactly for niche modelling) and I definitely think that the limitations you discuss are important and people should try to understand these methods better. However, I have two comments:

    1. One think I really like about these methods is that they allow all kinds of non-linear relationships including thresholds and unimodal responses which are common in ecological data but difficult to analyze using simpler methods.

    2. Species distribution modelling seems to be one of the biggest bandwagons in ecology and I think that there are much more serious problems with it than overfitting. Most studies assume that species distribution is driven only by a few coarse-scale climate variables and ignore small-scale heterogeneity, species interactions, possibility of adaptation (when predicting the effects of climate change) etc. Importantly, the quality of data is often questionable: unreliable database sources, presence-only data, sampling biased towards certain geographical ares… The species distribution modelling community should do much more about these issues. The quality of data and reasonableness of biological assumptions is crucial. The decision whether to use GLM, BRT, MARS, or something else is probably not so important after all…

    • Re #1 – I completely agree – by far the best reason to choose these methods (although I personally prefer simple regression trees where I can actually look and parse the nature of the interactions and non-linearities easily). If your data is spatial (or temporal or phylogenetic) though we either have to solve the problem I’m talking about or your overestimating how good your model is.

      Re #2 – I’m not going to disagree with you, except that I think the reason most people give niche models a pass on the issues you raise is that the quantitative metrics of prediction (R2, AUC) look so good that people just assume the issues must be minor details. They loom larger though if you realize that the goodness of fit is only apparent.

      • Regression trees are terrible things, unless you have very large data sets, because of the instability issue; collect another sample of data (or remove a few observations from the actual sample) and there is a good chance that the first split (= “best” variable) will be different to the one selected on the original (full) data set. This is partly why we now have ensemble approaches such as bagged or boosted trees, random forests, etc. Unfortunately, you do loose the nice interpretation of the single tree.

        From what I know of your research Brian, you have access to big datasets, but all the advocacy for these machine learning tools seems to resonate with many practitioners in applied ecology whilst the warnings about sample size, stability etc pass them by. (I’ve been guilty of this too some years ago.) What use is a nicely interpretable tree diagram when what it represents is a fluke of random chance during data collection?

    • Re: your #2 Jan, a general remark. I think it’s interesting how different people rank the “fundamentalness” of different issues with any given idea or approach. Here, you and Brian clearly agree on much–but disagree at least mildly on whether overfitting or data quality is the more “important” or “fundamental” issue with niche modeling.

      I’ve run into something similar in phylogenetic community ecology. Like Mayfield and Levine (2010), I’d say the most fundamental problem with phylogenetic community ecology (mapping coexisting species onto a phylogeny to infer contemporary coexistence mechanisms) is conceptual: even with perfect data, the results don’t actually have the implications for species coexistence they’re widely believed to have. But other folks would say that the most fundamental issues are other issues–bad phylogenies, say, or inability to account for intraspecific variation, or dependence of the results on spatial scale, or etc. And all of us could say–quite correctly!–“This issue is fundamental because unless we sort it out then the approach won’t work, no matter what happens with the other issues.”

      I’m tempted to conclude that when multiple necessary conditions must be satisfied in order for an approach to work, none of those conditions is more “fundamental” or “important” than the others. Which perhaps is a useful conclusion to draw, because it encourages users to think about all of the issues with an approach–all of the conditions that must hold for an approach to work. Rather than users worrying exclusively about certain issues (perhaps the ones that are most obvious, or seem most soluble) and ignoring other issues. On the other hand, if users are ignoring some issues with an approach, one way to get them to pay attention is to argue that the issues they’re ignoring are actually the “most fundamental” ones…

  5. Wenger and Olden (doi:10.1111/j.2041-210X.2011.00170.x) is another cautionary paper about the transferability of complex models. They have a nice figure showing a complicated, wiggly curve (best according to standard CV) with a much more plausible smooth curve (best according to block spatial CV). (Could you show the authors/year of references rather than just a “here” link with a DOI? I’d rather not have to click through if I can see that I already know about the paper …)

    • I didn’t know that paper although I clearly should have. As you note it clearly shows the role of using a very flexible (random forest) vs. more rigid (GLMM with linear and quadratic terms) in this problem. Even with spatial autocorrelation the GLMM didn’t come out looking too bad but the random forest looks terrible.

    • @Ben I commented on the MEE blog post about this (and can’t access the paper; stupid paywalls hence just commenting on the figures here), but something doesn’t look right with those curves. I agree the RF fit is exceedingly complex but, as I commented on the MEE blog, something odd is happening at the low end of the gradient. I find it hard to believe the RF got the fit so wrong at low values of mean air temp. The counter question then is; is the quadratic GLMM too constrained in terms of the model being fitted? Is there a better model between the simple but restrictive quadratic GLMM fit and the complex but overfitted(?) RF model? I suggested a GAM(M) fit would’ve been a nice comparison, allowing the response to be derived from the data rather than specified a priori by the researcher (as it is in the GLMM), but being smoother than the RF fit. Such as GAM(M) fit might help investigate differences between the two (RF and GLMM) model fits at the low temperature end of the data set.

      Now, whereis #icanhazpdf …

      • I agree that if you use regression trees, you have to address instability (because regression trees are sequentially calculated they suffer from some of the same issues as stepwise selection on collinear data – a different sample of the data can often give a different tree). But not all regression trees show instability. And one can identify and report instabilities when they occur. And finally – most often the instabilities are due to collinearity so I’m not sure its as big a problem (from a mangement/applied perspective) as people make it out to be when the top split flips to a different variable when it is highly collinear with the first variable.

        But you have to sit in a room with a manager and show them a regression tree and see their eyes light up because they understand it (it looks like a decision tree used in management) and then put a table of importance values in front of them and watch their eyes glaze over to appreciate that there is a trade-off between black-boxedness and stability issues.

      • @Brian (I’ll reply here as that is where your reply came in the thread) Your 2nd paragraph is quite pertinent; I have sat in a room clients working for a responsible agency in the UK and show them a single tree and they do find it easy to understand what the model means. It gets messy though when you then need to show different trees based on accepting alternative splits instead of the best splits. Quite often the problem can boil down to poor choice of predictors in the first place; there is a tendency in some quarters to through everything and the kitchen sink at statistical methods and see what comes out. Correlated but ecological trivial or unimportant variables can mask the effects of the things that truly are important, just not according to the reduction of RSS.

  6. Somewhat relatedly, statistician and machine learning guru Larry Wasserman has a new post up today, explaining why it’s impossible to create a 95% confidence interval around a nonparametric function. He calls lack of ways to do this “one of the biggest embarrassments in statistics”. He describes a way to get a biased c.i. which can be computed via bootstrapping, and suggests that we’ll just have to “stop worrying and love the bias”.

  7. I would also point out that over fitting not only affects spatial transferability of niche models but their temporal transferability (ie their projections through time under climate change). Using a retrospective analysis, we show that machine learning models show stronger fit in the time period that the models were fitted but simpler approaches like GLM show better skill when tested against temporally independent data (Dobrowski et al. 2011; ecological monographs).

      • Thank you for an interesting post… This is a topic I’ve been thinking about for some time. Over the past decade, a “cottage industry” of research has sprung up around fitting increasingly complex models for niche modeling applications. I always ask a basic question when I use these models: do I have any biological reason to believe that the climate response function should be anything but smooth in nature? Also your point about the marginal gains of using these machine models is a good one. We examined multiple models for >100 plant species and show that the variance in model skill driven by species traits dwarfs that driven by model algorithm. It’s basically much ado about nothing and reinforces the bottom line…. Biology matters.

      • Thanks for the thoughtful ideas (and empirical results). You make an interesting point about smooth functions. Paleoecologists have worked with strictly smooth (usually just quadratic) “transfer” functions (but basically the same idea as niche modelling) for decades. There is starting to be a move now to switch to the fancier methods and I wonder if they will regret it in the long run.

      • Interesting that you bring up palaeoecologists (I am one). We are simple folk and went even simpler than smooth parametric approaches (GLMs); we use(d) a simple heuristic approximation based on weighted averages, which under certain conditions approximates the fit based on GLMs fitted to each taxon and inverted to yield a prediction of environment in the past.

        In a recent chapter on machine learning methods in palaeolimnology, I fitted a boosted tree model to predict lake water pH from diatoms and compared the fit with the simple weighted average heuristic. Using an independent test set there was nothing to choose between them, yet the BRT took an age to fit and had several parameters to tune. Both models were equally difficult to understand, however.

        The central points of your post certainly resonate with me as a palaeoecologist. You may be interested in a couple of recent blog posts by Richard Telford, a fellow palaeoecologist, who has examined the issue of spatial autocorrelation in a number of settings. The issue of his recent blog posts is catastrophic underestimation of model errors in dynoflagellate-temperature models based on spatially smooth datasets from the Arctic for example. His recent posts are here

  8. Pingback: INTECOL 2013 – Tuesday Recap #INT13 | Dynamic Ecology

  9. @Brian Thanks for the interesting post. Your comment

    A secondary trade-off that concerns me is that the fancy new machine learning models are almost completely black-box to interpretation in comparison to simpler methods like linear or logistic regression (yes you can get back variable importance from these fancy methods but that is a good step short of the sense of sign, magnitude, specific interactions, non-linearity etc one can get from a good old simple regression).

    is somewhat inaccurate of the situation say with random forests or boosted trees. Partial response plots allow the sort of interpretation you desire, in the same way that with more than a few covariates you’d need to proceed with a more familiar GLM or GAM fit.

    Likewise, there have been a good number of advances in understanding the statistical properties of many of these newer methods for example as summarised in the excellent Elements of Statistical Learning (2011, Hastie et al, PDF freely available from the book website). The situation isn’t as dire as your comment suggests, as long as you are prepared to work at it.

  10. I guess I’m a bit late to the comments on this, but I wanted to add one thought. I suppose that insisting on some advanced technique because it’s advanced despite violating the methods assumptions is machismo, but would you say using machine learning in and of itself is statistical machismo? I can estimate a simple linear model (y = a + bx + error) with OLS, or using ML, or with a Gibbs sampler or even an ANN, should I only use OLS because all other methods are “statistical machismo” and OLS is the simplest? The flip side of this analogy is that even simple normal models with OLS have assumptions and if one violates those assumptions (normality, i.i.d. etc..) it seems that such a transgression is no worse than the example in your blog post.

    It just boils down to all model fitting methods have assumptions, and we need to meet those assumptions for our inferences to be valid. In a discipline that is supposed to be scientific, its amazing how much emotional baggage people bring to choices of methods. If I can fit my aforementioned linear model with a Gibbs sampler and I don’t violate any of the assumptions, who cares? My parameter estimates will be quite close (in most circumstances) to an OLS estimate. Personally I’ve had reviews come back where I’ve fit an ANOVA like model with a Gibbs sampler and reviewers have told me that “this is too complicated, just use an ANOVA”. I think the pushback against statistical machismo can be just as draconian as those advocating for machine learning without any understanding of its limitations.

    Finally you write:
    “…despite the fact they’re so new their limitations aren’t really worked out (specifically as we’re starting to understand they can badly overfit spatially autocorrelated data)…”

    It seems that good practice would be that if you’re applying a method to a problem you should have a good understanding of the limitations of said method within the context you’re working on. But that said if I’m using machine learning for something else non-spatially autocorrelated, should I really care about that particular limitation? Maximum likelihood has limitations too, but as long as my problem doesn’t hit those boundaries, it’s hard to see what difference it makes.

    • Hi Ted – never too late to comment!

      I definitely would NOT say any use of advanced machine learning is statistical machismo. As I said I do it myself. In fact a subtheme of my anti-statistical-machismo rants is that anybody who says “you should always do X” is committing statistical machismo. The opposite of statistical machismo is recognizing the existence of complex trade-offs and working these trade-offs for each unique situation.

      I would strongly agree with ” all model fitting methods have assumptions”. But even though we’re probably not disagreeing that much but I would not say “we need to meet those assumptions for our inferences to be valid”. Indeed I would argue there has never been application of model fitting in the history of the human race that 100% met every assumption of the model. This is why I get so upset when people I have to do procedure X to fix violation of assumption 1 when they completely ignore the fact that fixing #1 makes assumption #2 much more badly violated.

      Only last point I’ll make is I think you should use the simplest reasonably good method not for statistical reasons but for reasons of communication – it is the authors job to make sure the greatest number of readers understand what you’re doing and that probably involves picking simpler models. As a reviewer I would never reject a paper becaue it used a Gibbs Sampler to do an ANOVA but I would point out that you are needlessly losing readers and recommend doing the simple ANOVA analysis in parallel. But from your comments I expect you and others would disagree with me on that.

      I think we’re mostly agreeing though. The world is complex with many competing criteria to optimize on and it is somewhat subjective and personal which trade-offs to make.

  11. Pingback: Autocorrelation: friend or foe? | Dynamic Ecology

  12. Pingback: Link Round-up – New Sparse City

  13. HI everyone – thanks for this thorough article – also a bit late to the game just reading this now.
    I’ve always been a big user of machine learning techniques – and have used them for a variety of different datasets and am rarely disappointed.

    My comment is on this “black box” issue that seems to come up frequently with respect to these algorithms. I don’t believe these algorithms are “black box” at all. If you take the time to read through the original CART papers by Leo Breiman and Jerome Friedman, you’ll find all you need to know about how regression trees are created. Also, the boosting algorithms that are used are well described in the stochastic gradient boosting papers and in J.Eliths et al. papers on boosted regression trees.

    These ML techniques have been around for decades and are well understood in the business community who have been using them for successful marketing campaigns since the 1990s. In fact, machine learning and data mining dates back to the 1950s with the invention of Neural nets. So as I see it, the algorithms are not really black box, but are well studied and understood. However I’m not saying that it’s overly easy to just pick up a paper and then suddenly make sense of them because the answers lie over the course of many many papers.

    How we use them in spatial models on the other hand is probably less studied than the algorithms themselves. You talk about spatial autocorrelation in these models and how it’s ignored and that ML algorithms can lead to mis-interpretation because of it. But something like this can be easily overcome by careful examination of your data and creation of pseudo-absences based on expert study. In fact, the gbm.step algorithm gives users the option to define independent test data for cross-validation between trees. So in this case you could recommend to people to pre-define the independently collected data and use that for cross validation in tree building. This issue is not limited to ML techniques though – ALL statistical algorithms used for prediction and model building have limitations. It’s our job to know what they are and be aware when we create models. It’s sort of like hammering a nail in with a ball peen hammer versus a claw hammer.

    All in all, ML techniques provide better predictions on independent data – and if that’s the end goal of your study, then it shouldn’t matter what happens in between as long as you are careful and read the background material first. Afterall, isn’t one of the major goals of science to make predictions? If we want to understand the mechanisms though then you can limit the number of predictor variables you use to try and make sense of things with other statistical models. But in caution to that, if you’re a believer in the gaia hypothesis and that everything is connected, wouldn’t you want to build a model that takes into account extreme complexity? (guess it depends on the question you want to ask). Our brain is pretty limited in what we can compute and understand, and if we require simple answers to complex problems, then maybe the problem lies within human nature? (that’s up for some philosophical debate I suppose haha)


  14. Very interesting thread. I got here because I have been thinking about applying data mining techniques to social network data. The ecologist and the social network people need to talk more because the issues are completely parallel. I am older and have been amused that when I learned statistics the entire emphasis was on correctly modeling the structure of the errors, and when I took a course on data mining at Stanford last year, there was nary a mention of autocorrelation. In SNA there a host of work that grossly overstate network effects such as social influence because they fail to control for autocorrelation. It concerns me that entities such as the NSA may are certainly applying data mining techniques to these data and finding many connections that simply do not exist.

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s