I have been working on a series of posts on why ecologists need to take prediction more seriously as part of their mandate as scientists. In Part 1– I argued that an ANOVA/p-value mentality is killing us. In Part 2 – I argued that the rigorous discipline of putting out quantitative predictions and then checking to see if they are right is good for a discipline (with weather prediction as an example). In Part 3 – I argued that a pure reductionist view of where to look for mechanistic models to produce predictions was flawed. Throughout, I argued (and many commenters agreed) prediction is not just for the applied questions. Prediction, done right, brings a level of rigorous honesty that helps basic science advance more quickly too.

This is my last post on the topic (you can breathe a sigh of relief), and wasn’t originally planned. But so many commenters, especially on the first post, wanted more about the statistics of measuring predictions. In Part 1 – I made it pretty clear I thought the p-value was overrated and effect size and r^{2} needed more attention. But of course it is more complicated than that. Questions were raised about AIC and a bunch of other metrics (and no questions were raised about some of the metrics that I personally think are most important). So the following is a tutorial on measuring the quality of a prediction in a quantitative fashion.

## Goodness of fit – continuous variables

Let’s take the simplest model y=f(x,θ)+ε. Let y (but not necessarily x) be a continuous (aka metric) variable like temperature or mass or distance, or to an approximation abundance if the abundances are large. If we define the model f and the parameters θ, then if we have a new value x (e.g. environmental conditions), then we can predict the corresponding new value (pronounced y-hat) and compare it to the observed value y. Of course in a perfect prediction would always equal y (i.e. error ε=0), but ecology is a long ways from that Nirvana! So we want measure the error/deviance/inaccuracy/failure of prediction. How do we do it?

One of the world’s alltime geniuses, Carl Friedrich Gauss, came up with the first insight. If we define to be the prediction error for one data point x_{i} then we can look at sum-squared-error or . Minimizing this SSE was Gauss’ criteria for picking the best fit line, leading directly to modern Ordinary Least Square (OLS) regression. However, there is a problem with SSE as a measure of goodness of fit – it is completely uninterpretable. Is SSE=40502 a good fit or a bad fit? The answer totally depends on the number of data points and the innate variance in the data (after all if two widely different y values are observed for the same x, no model can fit that that). To solve this problem the idea of error partitioning was developed. If we define total-sum-of-squares (i.e. a proxy for total variance) as where is the average value of y, then we have .

R^{2} is extremely useful. In particular it ranges between 0 and 1, making it easy to compare success of models across different amounts of data, different data, and even models of different things. The last point bears expanding – R^{2} even lets us say things like models of predicting metabolic rate by body size (R^{2}=0.9 or so) are much better than models of abundance vs temperature (R^{2}=0.2 or so). There are limits to this – and I certainly wouldn’t make a big deal of small differences, but there is no other metric that is even credible for this kind of cross-disciplinary comparison. R^{2} also is independent of units – I get the same R^{2} value whether my y variable is measured in grams or kg. If all I’ve accomplished is gotten more ecologists to pay attention to and report R^{2}, I would be very happy!

But there are of course complications! The first is it was noticed that in the simple linear model (y=ax+b+ε), R^{2}=r^{2} where r is the Pearson correlation coefficient (which itself ranges from -1 to 1). That is kind of cool! But it now means we have two definitions and ways of calculating R^{2}. And outside of the simple linear model (e.g. nonlinear models), they do NOT give the same answer. So at a minimum, you have to specify how you calculate it, which is often done by using R^{2} to report the error partitioning approach and r^{2} to report the correlation approach. If you want to look good, definitely use the r^{2} version. It will always be ≥0. If your model is worse than the null model (horizontal line of y=c), then R^{2} can be less than zero (and yes I have produced models with this outcome!). Even worse, the r^{2} approach can sweep systematic bias under the rug. Imagine a plot of predicted vs actual values (x-axis is predicted value , y-axis is observed value for y). In a good model the data should all be close to the 1-1 line (passing through the point 0,0 and slope of 45 degrees). R^{2} directly measures this. But now imagine you are constantly over predicting (see figure below). You correctly track that when x goes up a certain amount, y goes up a certain amount, but every prediction is greater than the observed values. This would cause a nice-tight line to form in the plot that is above the 1-1 line. This could have an r^{2} of 1 even if no point is on the 1-1 line! **For these reasons, serious test of prediction should use R ^{2} rather than r^{2}** (with the added benefit that R

^{2}generalizes to n-dimensional x-variables).

So R^{2} does a lot of nice things, including the fact that it relativizes the error to the innate variance in the data. But sometimes we don’t want to relativize our error. Lets take an example from my (and others) research where R^{2} doesn’t tell the whole story. We’re trying to interpolate temperatures between weather stations. In many regions weather stations are 100s km apart (with small weather influencing details like mountains in between). So we are evaluating different methods. If I find that a particular method has an R^{2} of 70%, I am probably moderately happy – my method is explaining a majority of the variability. But if you are a user looking to use these layers, do you know what you want to know? Probably not. In fact the R^{2} statistic is pretty meaningless if you don’t have an innate sense of variability of the inherent error. What you probably want to know is something like the average error is 1 °C or the 95% confidence interval is ±2 °C. Confidence intervals (and boot strapping methods that produce them) are well-known to ecologists, so I won’t dwell on them further here (except to say that they do require a sensible mode of bootstrapping to produce which doesn’t always exist).

I want to look at the other concept “average error”. Literally this would be Mean Square Error or MSE=SSE/N where N is the # of observations. Except it still has the square in it (recall if we didn’t square before summing, then the errors would cancel each other out and sum to zero in a linear regression). This problem of squaring is fixed not surprisingly by introducing a square root – RMSE=Root Mean Squared Error or . This is widely used by engineers and meteorologists. In fact, you also already know this in a different inferential context as the standard error. But in a prediction context it is called RMSE. Its main virtue is that it is in the same units as y. So I can tell you that the RMSE prediction at a point between weather stations is 1 °C. Now this is useful to you as an ecologist trying to use these layers. It is also an excellent measure of prediction accuracy. It has lost its relativism (RMSE of 1 °C is excellent – indeed unacheivable – for my application but really poor for the accuracy on repeat measures of a $1,000 temperature probe). And it is now units dependent – predicting body mass in kg and getting an RMSE of 1 kg will turn into an RMSE of 1000 g if you measure body mass in g. But RMSE is really concrete. It also has the nice feature that it is more like R^{2} than r^{2}. Indeed there is a formula directly relating RMSE to bias: . Indeed RMSE is one measure that attempts to assess trade-offs between bias and variance (a model with low variance but very high bias may well be worse than a model with no bias and moderate variance). This came up recently deep into the comments on my post on detection probabilities. One wrinkle is that RMSE, depending on squaring errors, is very sensitive to outliers and fat tailed error distributions. If this is a problem in your data you can use Mean Absolute Error (). Note that OLS linear regression effectively minimizes RMSE (it technically minimizes SSE, but since RMSE is a monotonic transformation of SSE, the optimization chooses the same answer), while LAE or median regression, a form of robust regression, chooses the line that minimizes MAE, thus both methods select “best prediction” lines. There are dozens of other measures you can use, but in my opinion if you look at R^{2} and RMSE (or MAE) you have really got 90% of the value of prediction metrics.

## Goodness of fit – binary and discrete variables

Now, what if y is a categorical variable rather than continuous? There are literally whole books written about categorical variables, and I’m not going to go too far into the topic here. Indeed I am going to limit myself only to a subset of categorical variables – binary variables (i.e. 0/1 or true/false or present/absent). These are extremely common dependent variables in ecology (any model that predicts survivorship or presence/absence is likely using a binary dependent variable). Your first instinct might be to keep using R^{2}. If you do you will be told you are wrong to do so. And there is a problem. Since the observed y values can only be 0 or 1, and the predicted values are usually some intermediate value such as 0.732 the distances between observed and predicted are artificially large. Or put in other terms, the assumption of normal errors which underlies much of the justification for R^{2} (and even RMSE) is violated with binary variables. But in fact, as long as you accept (and know) the fact that R^{2} used on binary y variables is never going to be as high as R^{2} on continuous variables it actually works rather well. And it even has a name – the point-biserial correlation (as the name implies this is usually calculated as r^{2} rather than R^{2}). This is my favorite metric of prediction on binary variables, but I am rather out of the mainstream.

Not surprisingly giving the demonstrable proclivities of ecologists for statistical machismo, ecologists have embraced a much harder to calculate and harder to interpret statistic known as Area Under the Curve (AUC), which is based on an idea in signal communication theory known as the receiver operator curve (ROC). It is not my goal to provide a full introduction to this metric here. But basically it assumes there is a threshold or cut-off that can be varied and then it measures how much of a trade-off there is between true positive rates vs false positive rates. Imagine a model predicting species presence or absence as a function of climate. This will produce predictions at each point in space that looks like the probability of presence of say p=0.732. To get back to the binary present/absent, we have to chose a threshold. An obvious one is p_{threshold}=0.5. if predicted p>0.5 then we predict present, otherwise we predict absent. Obviously if p_{threshold} were set at p=0 we would predict present everywhere, giving us many true positives and no false negatives but lots of false-positives (and the opposite for p=1.0). In a perfect (but impossible world) any value of p>0 and p<1 would work equally well. This gives an AUC (area under curve) of 1.0 (the curve goes from 0,0 to 1,1 and stays inside the 1×1 unit box so the maximum area under the curve is 1). Alternatively, as a null model, if we just take the original proportions of presences () and for each point randomly flip a coin with probability of coming up present, then our false positive and negative rates would depend only on the thresholds and the area under the curve would be 0.5.** Even if you skipped most of the last few sentences, this is the important point – the null model value of AUC is 0.5.** An AUC of 0.5 is the same as an R^{2} of 0. And an AUC <0.5 is the same as an R^{2}<0! So if next time you see an AUC of 0.6 don’t be too impressed. It is possible to rescale AUC to run from 0-1 to seem more like R^{2} (i.e. (AUC-0.5)/0.5) ) but the analogy is still misleading. There are dozens of other statistics commonly in use. Some even let you specify your preferences for errors of omission (false negatives) vs commission (false positives). But at that point, I’d rather just publish a 2×2 table of true and false positives vs true and false negatives.

## A note on comparing models, AIC and other information criteria

R^{2} is in general a wondrously relativized number that can be used to compare prediction quality across datasets, types of models etc. But it is an inescapable fact (indeed a mathematically provable fact) that as the number of explanatory variables going into a regression goes up, R^{2} must also go up. This means R^{2} is not always the best tool for comparing two models with different numbers of parameters (although the problems are often overstated – big differences in R^{2} are always meaningful in practice). To correct for this, ideas like adjusted R^{2} and Mallow’s Cp were invented. More recently, Akiake’s Information Criteria (or AIC) has emerged (where AIC=-2LL+2k and LL is log-likelihood and k is # of parameters). AIC can be used to compare two models with different numbers of parameters. It should be noted that AIC has a relationship to our starting point, SSE. Namely for normal errors, AIC=n ln(SSE/n)+2k. So AIC is in a certain fashion a measure of goodness of fit. But just like SSE it is not that useful by itself (is an AIC of 273 a good fit or bad?). But AIC is only good for comparing along one dimension – between models across the same data. It is a mistake to use AIC to compare one model on two datasets for example, and certainly it cannot be used as a general comparison of two totally different models.(unlike R^{2}). But if you really want to compare two models on the same dataset AIC is the way to go. Or maybe AICc or BIC or QIC or … That is the main problem with AIC – it is one choice out of an infinite list of possible weightings of goodness of fit vs. number of parameters. It has some logical justification if you think information theory explains the world. Otherwise, it is a bit arbitrary. Personally, I prefer reporting different R^{2} values and different numbers of parameters and letting the reader choose what is the best model. The use of model comparison as an inferential approach also suffers from certain logical pitfalls (finding the best model out of a list of bad models doesn’t necessarily advance science). Thus, although I use AIC in some contexts (namely comparing different linear regression models), it doesn’t add much to assessing goodness of prediction in the sense I’ve talked about.

## The experimental design aspect – predicting on independent data

So far I’ve talked about a few small cheats in the world of prediction (like using r^{2} instead of R^{2}), but there is one whopper – failing to distinguish calibration vs. validation. Calibration is the process of picking the best possible parameters to fit one set of data. A simple example is using the OLS (minimizing SSE) to pick the slope and intercept of a linear regression. Validation is taking the parameters picked using one data set and testing them on a separate independent data set. The reason this is important is the issue of overfitting. Overfitting is when the model fits not just the signal but the noise/error in the data. This is easy to do when you have a really flexible model with lots of parameters. As Fermi quoted Von Neumann “with four parameters I can fit an elephant, and with five I can make him wiggle his trunk” (which believe it or not led to a scholarly paper showing he was right!). If you have enough parameters to draw an elephant, you will fit every bump and wiggle of the noise, and if you do, you’ll get a really awesome R^{2}, but when you move to your next dataset (this is the validation step), your R^{2} will be terrible (by definition noise does not transfer repeatably).

So to avoid this problem, the proper method is to choose parameters that give the best fit and measure the goodness of fit on independent datasets. The exact method of doing this varies. In regression trees, one builds a very complex tree that one knows is overfit. Then one applies the tree to new data and walks it back (one by one removes the lowest branches) until the goodness of fit (often R^{2}) is maximized on the new data. In classical regression, a separate validation is not necessary IF you make certain assumptions about the data and errors (normal, independent, constant variance) because we can actually model the transferability. The methods for validation are really as diverse as the methods for modelling.

This is why I like prediction so much – it clearly separates out the calibration vs. the validation steps. Calibrate on today. Predict then validate on tomorrow. Of course as commentors have pointed out, one can do hindcasting (calibrate on today, predict in the past or vice versa) or lateral casting (predict in North America, validate in Europe). These all work equally well. **But the real key is the validation data must be independent of the calibration data!** Any pretense of validation that fails to do this is at least as bad as (indeed is pretty much doing the same thing as) pseudoreplication.

One common approach to getting independent data that comes from the machine-learning world is to use a hold-out or cross validation technique. Holdout is when one randomly selects say 70% of the data and calibrates on that and then validates on the remaining 30%. Cross-validation is a slightly fancier version where one calibrates on 90%, validates on the remaining 10%, then repeats this 10 times which each separate chunk of 10% held out once and averages. These techniques work great when all the data points are truly independent. These techniques work lousy when there is spatial or temporal autocorrelation linking the points (the 30% that were held out are not truly separate from the 70% used to calibrate). Nearly every paper I have seen on niche models fails to realize this (or at least plows ahead anyway). In the study I just linked to this simple effect artificially inflated R^{2 } from close to zero to a very respectable 60 or 70% – pretty misleading.

## Summary

So a long post! To summarize my recommendations:

- For continuous variables, report R
^{2 }(not r^{2}) and RMSE (or MAE)

- For binary variables leave the prediction as probabilities and report the point-biserial correlation. If you HAVE to turn probabilities of presence (or whatever) into 0/1 binary variables, choose an appropriate threshold (non-trivial),then just report the 2×2 table. I am not a fan of AUC – it is hard to interpret and inherently misleading with a base of 0.5
- Probably even more important than the statistical metrics is the question of experimental design and inferential context. You must think through how to separate calibration from validation. If your validation is not statistically independent of the calibration data, you are committing pseudoreplication. And worse, without validation, you’re not predicting, you’re just curve fitting!

An excellent end to the series. I’d just add that that, like r2 and other correlation-based measures, AUC also ignores systematic bias in the predictions.

It’s a constant source of frustration that AUC and correlations seem to be the standard for assessing species distribution models.

Important points. Thanks. If you were czar of the SDM world, do you have a measure you would prefer?

I’d plump for deviance of the predictions from an independent presence-absence test set. One of the myriad versions of ‘deviance R2’ could allow comparison between datasets too, though exactly which one to use is another matter.

Presence-only SDM appears to be one of, if not the, most common form of predictive ecology. Yet what most applications (performed using MaxEnt or any other approach) report as a predicted ‘probability of presence’ is really only a *relative* probability of presence, ie. within some constant of the actual probability.

It’s a tricky problem since in most cases we don’t have any decent data with which to test the models. Hence the default being that people report AUCs calculated on the presence/background data used to fit the model – a metric which tells us nothing about how well we are predicting the species’ distribution.

I pretty much agree with you about the state of SDM modelling. There is an old old saying in modelling – GIGO – garbage in garbage out. And as long as we claim to be predicting presence/absence using only presence data with no absences, we are going to be really limited. This is a case where I think there are legitimate reasons to plow ahead in applied contexts despite these limitations, but for basic research about climate-niche relations etc using SDM, we really need to think about whether we don’t need to put more effort into getting better quality data (i.e. presence AND absence data if not abundance data).

And to expand on Nick’s comments about Deviance R2 (also called psuedo-R2), these are extensions of the idea of sum-of-squares analysis to generalized linear models (logistic regression, Poisson regression, etc). As already mentioned in my OP under the AIC section, there are direct ties between sum-of-squares analysis (SSE, SST) and likelihood in a normal errors case. If one turns the R2 calculation into a likelihood calculation and then generalizes the likelihood form to all generalized linear models you get deviance (a difference in likelihoods) as a generalization of sum-of-squares and deviance R2 as a generalization of R2. As Nick intimated there are a forest of deviance R2 – many are reviewed here.

Thanks for all the great additions Nick!

For readers interested in more on model selection, here is an excellent primer:

In particular, this primer points out two important but little known facts about AIC which ought to be more widely known (everything Brian says about AIC is correct, but he deliberately skipped over some key details):

-The way that AIC weights model fit vs. complexity is not arbitrary; AIC is an estimate of the relative Kullback-Leibler information distance. That is (and I’m speaking very loosely here), when you compare the AIC values of different models, you’re asking which model is closest to the unknown truth, where distance from the truth is measured as Kullback-Leibler information distance. Different information criteria weight model fit vs. complexity differently not because the people who developed them preferred different weightings, but because they’re estimating different things. I suggest that, if you’re going to use AIC, BIC, or whatever, they really shouldn’t be black boxes to you–you should know something about what they’re estimating, and how, and why one might want to estimate those things. The primer linked to above is a good starting point. And Shane Richards has a good 2005 Ecology paper explaining AIC and using simulations to test its performance under different conditions.

-AIC is derived under the assumption that the models you’re comparing are nested. This will probably come as a surprise to many of you, since AIC is used routinely to compare non-nested models! Indeed, I’ve done this myself. I can justify it only by saying that others have done it before, and it seems to “work” in that it returns answers that seem sensible and that line up with other lines of evidence. But there’s no theoretical justification for it as far as I know.

The linked primer is also very good on the deep connections between different methods of model selection. Indeed, while they’re often seen as alternatives, methods of model selection and hypothesis testing in many ways all come down to the same thing: dealing with underfitting vs. overfitting. Our models can either be simpler than the unknown truth (e.g., by omitting relevant predictor variables), or more complicated than the unknown truth (e.g., by including irrelevant predictor variables). Too-simple models tend to fit poorly, aka underfit, aka inaccurate, aka low likelihood of generating the observed data. Too-complex models tend to fit the observed data *too* well, aka overfitting, aka “fitting the noise”. They’re tailored to fit the *particular* data that you observed, and so are literally “too good to be true”! If you confront them with new data drawn from the same population, they’ll either fit it badly (if you hold the parameters fixed at their previously-estimated values), or else they’ll fit it (too) well, but only via drastically changing the parameter estimates so as to fit the noise in the new data (which is why those estimates are very imprecise and have massive confidence intervals). Whether you’re doing likelihood ratio tests of some simple null model vs. a more complex alternative, or cross-validation, or AIC, or whatever, it all comes down to more or less the same thing. You’re trying to avoid either under- or over-fitting.

In general, while I think model selection has its place (like most everyone, I do likelihood ratio tests all the time!), I very much agree with Brian that you ought to be looking at absolute measures of model fit, not just focusing on selecting which model is best relative to the others. There are times when it’s useful to select “the best of a bad bunch”–but you always want to know how bad your bunch is in absolute terms!

One more comment, on a related issue Brian didn’t mention. Stepwise model selection procedures (like backwards elimination, forward selection, and other stepwise procedures in multiple regression) are a bad idea. Effectively what you’re doing is trying out a bunch of alternative models and then just keeping whichever one happens to fit best. Which is a recipe for overfitting. In particular, every stats package I’m familiar with incorrectly returns the p-values of the selected model vs. the null, *as if the selected model were the only model you tried*. That’s flat-out wrong. It’s circular reasoning–effectively what you’re doing is letting the data tell you what model to test against the null, and then testing that model against the null using the same data. If you don’t believe me, try this exercise: generate a bunch of vectors of random numbers and then regress one of those vectors on the others using stepwise multiple regression (or backwards elimination, or forward selection). If you try this many times, you’ll discover that your selected model will come out “significant” much more than 5% of the time. Steve Walker has a good post on this, along with what may be a clever new way to correct for such “model selection bias”:

http://stevencarlislewalker.wordpress.com/2012/11/08/new-magical-correction-for-model-selection-bias/

Nice expansion on information criteria Jeremy.

It is interesting – it seems like model selection paradigms should have so much in common with prediction-focused paradigms (strong emphasis on modelling, goodness of fit is a central part of AIC). But in fact model selection has many models, one dataset. Prediction-focused paradigms have one model, many datasets (at a minimum at least in hold out or cross validation, but ideally predict tomorrow as well as today or etc). The multiple datasets on one model is a much harsher test and a stronger test of generality thus in the end, in my opinion, advances scientific understanding faster.*

*Full disclosure – as I’ve already said I do use AIC in some cases, but I try to avoid it as my central organizing inferential paradigm.

Brian Ripley’s opinion on model selection is only one side of the coin:

In particular, Burnham & Anderson refute the notion that AIC can only compare nested models or that “truth” has to be in the model set. Now, I am not posting this as an advocate of B&A model selection. Just thought it was a potentially important counter-point to that primer.

Personally, I try to avoid formal model selection at all costs. I don’t like how it facilitates vague pseudo-inferences on relationships between variables. I think I’ve read somewhere that using AIC to identify important parameters is akin to setting your alpha to ~0.10 or something like that.

With observational data, we are extremely limited in the inferences that can be made. I don’t think this message is relayed with the necessary emphasis to most ecology students (I know it wasn’t for me). So I agree with Brian (McGill of course) that validation is the ultimate achievement. But in ecology, validation with independent data is typically very difficult. Without validation, the only thing we can do with a model of observational data is suggest relationships and generate hypotheses to be tested. In that regard, the arm waving about model selection seems like overkill.

Re: use of AIC with non-nested models, interesting. I freely admit that I’m not an expert here, and had taken Ripley’s word for it.

Just to make sure I’m clear, when you say B&A “refute” Ripley, what exactly do you mean? Do they say Ripley is just flat-out wrong about what’s assumed by published, formal derivations of AIC? Do they make the case that Ripley is somehow subtly misinterpreting some very difficult-to-understand technical assumption made in the derivation of AIC, maybe to do with precisely what “nested” models are? (I’m just guessing there, based on my vague impression that defining “nestedness” is actually kind of tricky) Is the disagreement here actually about how to interpret the results of an AIC analysis rather than about the assumptions required to derive AIC? Or do Burnham & Anderson just present some argument (a heuristic argument, an argument based on testing AIC on simulated data, whatever) that in practice AIC works even when models aren’t nested, even though the derivation of AIC assumes nestedness?

Now that I’m aware of a disagreement on this (for which, thanks very much for the heads up!), I’d prefer to adjudicate it by going to primary sources rather than secondary ones like either Ripley, or Burnham and Anderson (unless Burnham and Anderson actually present a formal derivation of AIC for the non-nested case?) I mean, Burnham and Anderson certainly know their stuff, and their book is a popular reference–but it’s not like Ripley’s ignorant, he’s a full professor of applied statistics at Oxford! So rather than just flipping a coin and deciding to trust one or the other, it seems like it’s better to appeal to a higher authority, if there’s one available (and that people like me can understand!) From a quick scan of Wikipedia, I’m guessing that Burnham and Anderson are relying on Takeuchi 1976? Which is in Japanese, sadly…Anyone else know of a reliable primary source one can go to to adjudicate this? Or short of that, some third party who reviews the issue and explain why people as smart and well-informed as Ripley and Burnham & Anderson could disagree on what you’d think would be a completely clear and straightforward technical question (“Does the derivation of AIC assume nested models or not?”)

Re: the “true” model having to be in the model set, I’ve always found that remark of Ripley’s a bit odd, which is why I didn’t note it specifically in my earlier comment. I guess I just wrote off that remark of Ripley’s as a poor way to phrase the fact that deltaAIC values are a measure of relative K-L information distance from the models to the unknown truth.

I hope my earlier comment didn’t give the impression that I think Ripley’s primer is gospel; I just thought (and still do think) that it’s a useful overview of a lot of stuff.

Well if you’d follow the link I provided, you’d find some of your answers! Seriously though, B&A specifically reference Ripley in their little write up. And the write up is by no means rigorous for every argument they refute regarding myths about AIC.

Regarding your other comments: are you honestly asking how it is possible that well-educated experts could disagree on fundamentals? 🙂

Thank you Dan. And sorry for not immediately following your link, senior moment. 😉

Re: wondering how well-informed experts could disagree on fundamentals, yes, I can see where it might be surprising for someone who’s written what I’ve written to ask that question. 🙂 But in my own defense, it’s not so much that I’m surprised that they disagree (though I am indeed a little surprised, this being the highly-precise realm of mathematical derivation, not the more loosey-goosey verbal realm in which much of ecology resides). It’s more that I’m curious about precisely why they disagree. As I tried to indicate in my previous comment, I can imagine various reasons why they might disagree, each with quite different implications for the everyday statistical practice of ecologists.

I’ll go follow your link now…

Hey Brian, I liked this series, lots of food for thoughts. Two things I wanted to mention in addition to your distilled summary:

1) Always report sample size (n) and degrees of freedom (n-p) or something along those lines. For example the expected value of R^2 based on independent random variables (no relationship) is p/(n-1). This means that with n=100 and with 32 variables (p=32), you’ll get R^2=0.323. An alternative measure is the adjusted R^2 will give an expected value of 0 when variables are truly independent.

2) In the SDM world one wants to separate classes (0/1, or some form of relative abundance) where AUC is a totally reasonable metric to quantify this separation. Separation is important to validate spatial predictions. However, if for example we designed a study to quantify the effect of land conversion on species abundance, we are not necessarily interested in the best separation. If the study was nicely randomized (we did not control for other factors by design) and there is considerable variation in the data, we would expect a pretty bad AUC that has nothing to do with the goal of the study: estimate the effect of land conversion on abundance.

HI Peter – always nice to have a real statistician drop in! Good points. Definitely r2 does depend on sample size (in the extreme with n=2, then r2 will be 1.0 even in random data) so this is good to keep in mind when interpreting r2 (and you provide the formula). Also a good distinction between prediction and effect sizes of variables in AUC.

Maybe I’m just simplistic. Or maybe I should be blaming the authors who misinterpret and the readers who fail to know better, but I just can’t get over the fact that the base (null model) for AUC is 0.5. It seems too confusing to be allowed in a fundamental statistic. Also, its framing and interpretation is rather complex and hard (at least for me) to get intuition around and AUC’s framing is dependent on thresholding (which I don’t like either). But in the end, this is a matter of opinion, and opinions/styles vary. Good to have somebody stick up for AUC.

I would certainly never reject a paper because it uses AUC (but I would and have rejected papers because they get really excited about AUC=0.55 and act like it is as good as an R2 of 0.55).

If you don’t like the thresholding interpretation of AUC, there are at least two alternatives:

1) It’s a simple transformation of several other rank statistics, including the Wilcoxon rank-sum statistic.

2) I never see this reported in the ecological literature, but it’s pretty well-known in psychology I think: AUC is the probability that, given a randomly-selected presence site and a randomly-selected absence site, the model will assign a higher suitability score to the correct site.

One thing I would like to see would be something akin to confidence intervals or for (I have no idea how hard it would be to calculate analytically, but I you could get this for free from bootstrapping or MCMC). can be viewed as a random variable dependent on the data, and can vary incredibly wildly for the same data set. I suspect high values inflated from adding extra noise predictors would look much less impressive if reported as “.

Also: prediction intervals are a wonderful thing; one of the downsides with is that it’s derived for best fit, not taking into account estimation error in the parameters. Prediction intervals combine parameter uncertainty and residual variation to give a better estimate of how much variability you would expect, given the fitted model, and it becomes easier to spot when a model is poorly predicting. The downside is they have to be calculated for specific combinations of input parameters.

Also, as a follow-up to Jeremy’s comment about stepwise regression etc. and why to avoid: why does no one in ecology seem to use LASSO based methods? As far as I can tell, they solve a lot of the problems of stepwise regression, and combine model fitting and model selection.

I’ve just been thinking about whether I ought to learn something about the lasso myself. Andrew Gelman just did a post on how he initially didn’t see the point of it, but now does. Any suggestions on a good accessible starting point? Something in the spirit of the model selection primer I linked to? I’ve already had a glance at Wikipedia…

I wish I had a good review paper on it; I mostly learned it from Cosma Shalizi’s course notes, and a diversity of other papers discussing it. I read Larry Wasserman’s All of Nonparametric Statistics, which discussed it, and Shalizi’s course notes point towards Hastie et al.’s Elements of Statistical Learning which I’ve only skimmed, but is pretty interesting (and might be free as a pdf; at least it is from my work computer).

Great resources Eric, thanks! Cosma’s a great explainer. And yes, Hastie et al. seems to work for me too, though it comes up with some odd typos.

And if you know something about the lasso and think ecologists ought to use it, you should totally write a paper on it for Methods in Ecology and Evolution. That’s precisely what MEE is for, and it’s widely read.

LASSO, ridge regression, and their hybrid using the elastic net penalty is widely referred to as regularization methods, or sometimes as penalized likelihood analysis. Penalized likelihood methods show up in ecology from time to time as a way of reducing bias, AIC/BIC can also be seen as penalized likelihood (-2*logLik + penalty).

More importantly: MAXENT (that has been trashed a lot lately) is a prime example of regularization for present-only data used a LOT in ecology and SDM literature.

I’ve thought about it, definitely. I’ve been keeping an eye out for a good data set that could be used to demonstrate the value of LASSO, as well as other penalization methods. OSCAR in particular looks interesting, as it can find clusters of colinear predictors, and essentially treat them as a group, rather than arbitrarily selecting one of them to give a non-zero coefficient to and set the rest to zero (as LASSO can).

Presumably, one could also deal with highly-colinear predictors separately, before feeding data into the lasso. For instance, by doing a PCA on the colinear predictors and using the scores on the first PC axis as a single predictor variable that then gets fed into the lasso. This procedure is somewhat ad hoc, but seems attractive to me, especially in cases where the colinear predictors are all interpreted as different indices of the same underlying, unmeasurable latent variable.

Or you could take the view that, if your predictors are really strongly colinear, you’re not losing much information if you just arbitrarily chuck all but one of them (or end up with the coefficients for all but one of them equal to zero). Then you just keep in mind that the one variable you kept is basically a “stand in” for any/all of them. I

Bottome line, if two variables are highly colinear, then there’s no statistical force on earth that can pull them asunder, and you have to keep that in mind when interpreting the output of any statistical procedure with highly colinear predictors. If strong colinearity is really a problem for you (as it will be if your question requires you to precisely and accurately estimate the independent effects of each of your colinear predictors), ultimately the only thing to do is to go out and run a manipulative experiment that breaks the colinearity!

Yup!

As you say, thinking through which of a set of collinear variables is mostly like to be biologically causal and using PCA are two other approaches to collinearity. Indeed, I would argue they are probably preferred techniques. But if you can’t figure out/don’t know the biology and don’t like the indirectness and difficulty of interpretation of a slope on PCA component, then LASSO and related techniques, which basically don’t claim to pull collinear variables asunder (unlike say stepwise which presumes to pick out the most important one), will at least focus attention on avoiding the damage from collinearity (namely numerically unstable solutions with ridiculously large betas that change drastically depending on which subset of the data you use).

I’ll preface this with: I agree with you entirely on the fact that there’s no way to pull apart two strongly colinear predictors in terms of estimating their causal effects; the question is, if you’re trying to predict outcomes from data, what’s the best way to estimate a model for the predictors we have?

The thing I find interesting in the OSCAR approach is that it can do three approaches at once: variable selection (which predictors to include), parameter estimation, and clustering of predictors. It does this with two adjustable (and estimatable) parameters: one that penalizes large absolute values of predictors and one that penalizes unequal predictors. I would agree both methods you suggested would work as well for this sort of situation; I’d prefer the penalization approach more, though, as it’s continuous in its penalties (no large changes in model fit from dropping or adding a predictor) and does a lot of the deciding for you; you don’t have to decide which predictors or PCs to drop; if their’s insufficient data to distinguish their relationship with the data, the penalty terms will automatically pool them towards each other.

I’m about in the same position as Andrew – I had pretty much ignored LASSO but am increasingly seeing the point of it – it definitely deserves more attention in ecology.

The quick summary for readers is that LASSO picks “suboptimal” (i.e. doesn’t minimize SSE) estimates to parameters by trading off this criterion with a penalty against really big betas (estimates of slope). It is primarily a tool for use when there is lots of collinearity in the data and is an alternative to step-wise regression.

Carsten Dormann et al just had a good review paper on various techniques (including LASSO and the related ridge regression) for dealing with collinearity.

Ah, so Dormann et al. have already written the paper I suggested that Eric write! 🙂

Your quick summary nicely makes a point Andrew Gelman just emphasized in his recent post linked to above: bias-variance trade-offs are inescapable, everyone’s just picking where they want their answer to end up on the trade-off curve. You pick something like ridge regression or lasso (or, say, using PCA to collapse your colinear predictors to a single PC axis) if you’re willing to accept a bit of bias in your parameter estimates in exchange for a big reduction in the variance of your estimates. Indeed, Gelman is always banging away on the value of “regularization” and “shrinkage” for dealing with this trade-off.

Jeremy:

A quick point…if I remember correctly, Friedman, Hastie, and Tibshirani have argued (for reasons that are beyond me/escape me at the moment) in “Elements of Statistical Learning” that regression on PCA inputs is likely to perform worse in validation samples than ridge/lasso/related approaches.

Thanks Steve, useful pointer. Though of course knowing the reasons behind their suggestion would help one put their suggestion to best use. Unfortunately, the gap between “Authoritative person recommends procedure A over procedure B under conditions X and Y, for reason Z” and “Everybody knows procedure A is the RIGHT way to do things and procedure B is rubbish” seems to be one that many ecologists find small and easily crossable. 😉

You’re absolutely right Jeremy. Apologies for my appeal to authority.

But what’s worse is that I exaggerated their claims!

Here is a quotation from their book (Elements of Statistical Learning, 2nd Ed. p. 82):

“To summarize, [partial least squares], [principal component regression] and ridge regression tend to behave similarly. Ridge regression may be preferred because it shrinks smoothly, rather than in discrete steps. Lasso falls somewhere between ridge regression and best subset regression, and enjoys some of the properties

of each.”

Apologies.

Hi Steve,

Sorry, I didn’t mean to accuse you of an appeal to authority at all! I just meant to use your comment as an excuse to talk about how other people (not you!) too often oversimplify nuanced, sophisticated advice from experts into over-broad, over-simplified, blanket rules. Should’ve been clearer, sorry.

As for the fact that you slightly mis-summarized their advice: YOU IDIOT! NEVER COMMENT HERE AGAIN! /end obvious joke

Brian,

Completely unrelated to this cool discussion, but I grew up in Hampden, ME, just down the road from Orono. How’s things in Maine and at the Uni these days? I don’t get back much except in the summer.

I wonder if I may not be trolling, but on the other hand trolling in statistics could not possibly be that bad, would it?

I remember I had the reverse position in a discussion with a colleague, noting that predictability is not always the aim. I had a data set in which a correlation was way too noisy to predict anything (“low” r², “average” R²) but was significantly affecting the outcome, and I felt (and still feel) completely fine: all I wanted was to know whether something affected my dependent. There sure are cases where you could be just satisfied answering the basic question and not bothering too much as to whether you could predict anything beyond “the more x(i), the more y” (and an associated P-value).

On the other hand I’m frequently wondering how correlation strengths can be evaluated in ecology (e.g. when is a correlation strong or weak? Is there a treshold value for collinearity issues in ecologically noisy dataset or are red flag issues only for more regular odd models where correlations near “1.0” ?).

Sounds like a fair question to me. You’ll find most of my thoughts (and many other people’s) in my first post on prediction. I’d be happy to respond to comments there if you have further questions or comments.

The oft cited (but still really a rule of thumb) is that collinearity becomes an issue when correlation r>0.7.

seedsaside:

” I had a data set in which a correlation was way too noisy to predict anything (“low” r², “average” R²) but was significantly affecting the outcome, and I felt (and still feel) completely fine: all I wanted was to know whether something affected my dependent. There sure are cases where you could be just satisfied answering the basic question and not bothering too much as to whether you could predict anything beyond “the more x(i), the more y” (and an associated P-value).”

I hear you, but I feel I need to quibble a bit. Sure, there are cases where you can be satisfied by just rejecting a null. But there are at least two reasons why I think these cases should be rare. I’ll start with the more boring one. As Gelman says, and I’m paraphrasing, “I’ve never committed a type I error in my life…because I’ve never studied anything with zero effect. I’ve also never committed a type II error because I’ve never claimed that any two things that are different are actually the same.” Rejecting null hypotheses is fine, but the hard part is converging on good hypotheses, and this is how you make good predictions. Second, if you add another predictor to your model, it is possible that the conclusion changes from “the more x(i), the more y” to “the more x(i), the less y”. So which is it? By putting our money where our mouth is, and applying our significant findings to try and make predictions, we can get a much better handle on how sure we are that the x and y covary positively, negatively, quadratically, or in some other way. When we get the right combination of variables in the model, interacting in the right way, then we start to make better predictions. Until we do that, all we have with significant findings is, ‘well y increased with x in this particular context.’ Which is a good start — and I recognize that we only have so many degrees of freedom and that validation data are hard, if not impossible to collect — but it is just a start.

Sorry for griping. From your tone it doesn’t sound like we will ultimately disagree?

So which is it?Well, actually, these were under ANCOVA analysis, so we had a reasonnable amount of predictors (actually, the number of covariables that can reasonnably be handled for the study sample size, and in this specific case covariables were fortunately carefully considered –i.e., among potential processes, these were the most plausibly linked to characteristics that would impact the dependent, and they’re known to, so they were “chosen” long before the experiment even started). Of course, in the very end, there’s _always_ the possibility that we miss the actual explanatory variable, but I’m sure you were not meaning that specifically, because this would be true any time statistics are used.

I feel safe also because the hypothesis was somewhat obvious (it is just much better to back it up as a result of the analysis). Where we can bring discussion back to the opportunity of simply documenting the case or trying to elaborate predictive models. You’re absolutely right that the take-home message may not grow above a tiny “

y increased with x in this particular context“, though enrooted within contextual discussion.This approach may seem too humble, but to start with “

get the right combination of variables in the model, interacting in the right way, then we start to make better predictions” approach, in some cases this would be too specific to a particular study model and at the cost of generality (the ultimate goal of modelling in ecology, science and the likes). In this case the take home predictive message somewhat stay at the same level: “in this particular context, because of theright variables interacting in the right way“. (I’m not saying this is always the case, predictive ability is quite a noble goal, simply it may not always be amenable within the range of specific settings or questions).Sorry for griping.That’s fine! Discussion is good, it’s just the modern way of having it online that feels so… modern. ^^ (just to stick to new discursive norms ^^).

From your tone it doesn’t sound like we will ultimately disagree?No. Not yet?

It’s good to see where discussion lines get driven. It looks like there’s a debate to have as to where the tradeoff between predictive ability and prediction usefulness stands, because of sometimes accute specialisation of ecological data. Interestingly positions on a debate scale might broadly differ between “empiricist” and “theoretician” extremes. (and that’s amazing of a discussion about statistics, since it’s one of the linking tools).

Ok, re: exactly what’s assumed in the most general derivations of AIC, I’ve had a glance at Burnham and Anderson’s (2002) book, specifically sections 7.5-7.6, which focus on the derivation in the special case of exponential-family distributions. It’s not fully explicit about what’s being assumed in every derivation (well, it’s probably explicit enough for someone with a degree in mathematical statistics to fill in the blanks). For the benefit of anyone following this comment thread, here’s what I took to be the gist:

-One can derive a “general” K-L based model selection method without assuming that the true model is in the set of candidate models.

-The “general” result for K-L model selection (i.e. the result derived without assuming that the true model is in the set of candidate models), is not actually AIC, but another criterion Burnham and Anderson call TIC, after Takeuchi (1976), on whose derivation they apparently rely. In this “general” case (and in the limit of a large sample size; small sample size correction is a whole ‘nother ballgame), a different bias correction term than that used by AIC needs to be subtracted from the expected maximized log likelihood in order to obtain an unbiased estimate of K-L information distance. A sufficient but not necessary condition for the bias correction term to exactly equal that used in AIC (i.e. K, the number of free parameters in the model) is that two specified quantities (the definitions of which I couldn’t be arsed to look up) are equal. In general, equality of these two quantities is guaranteed if and only if the true model is identical to, or is nested within, the model one is considering. Burnham and Anderson note that it’s unrealistic to expect that the true model will be identical to or nested within the model under consideration, raising the issue of how good of an approximation AIC is to TIC when the true model is more general than the model being considered but the model being considered is a “reasonable approximation” to the truth. They explicitly state the need for “extensive simulation studies and experience to give us full confidence in when we can expect reliable results from AIC, versus when we might have to use TIC”. They then go on to derive what they describe as some limited but useful results on this issue, for certain special cases, and do some simulations. From what I can tell, in the simple special cases they consider they’re able to prove that the bias correction term used by AIC often is quite different from that used by TIC, and can err by being either too small or too large. I didn’t look at their simulation results because my interest was in what can be proven analytically.

I have to say that, after reading this (and emphasizing that I’m NOT a mathematical statistician), I can see where Ripley is coming from. If I’m understanding correctly, if one wants to do K-L-based model selection without assuming that the true model is one of the candidate models, or is nested within one of the candidate models, then one has to use TIC (which I’m guessing isn’t easily calculable in general) rather than AIC. AIC has been proven to exactly equal TIC only in the special (and highly unrealistic) case where the truth is one of the candidate models or is nested within one of the candidate models. So when one doesn’t have nested models that don’t include the truth, AIC is only an approximation, and can’t in general be *proven* to be a great approximation.

Burnham and Anderson go on to summarize various simulation results by saying that, in practice, there’s generally no meaningful difference between AIC and TIC so long as the model considered is reasonably close to the unknown (and possibly more complex) true model. And they go on to paraphrase another author who has studied the issue in some special cases and say that, in cases in which the truth is more complex than any model considered, “the most reasonable (essentially, compelling) explicit model selection criterion is to use AIC. This gives a justification for using the AIC, even if the model is not true.” They appear to say this at least in part because in practice TIC can’t be calculated exactly and so has to be estimated, and the available estimators are really variable. So rather than trying to estimate TIC it’s better to just take a pragmatic leap of faith and use AIC. All of which is fair enough–but none of which sounds like the words one would use if AIC had, *in general*, been *proven* to equal or approximate TIC in cases where the truth is more complex than any of the models considered. Bottom line: as far as I can tell, Burnham and Anderson’s insistence that one can and should use AIC when models aren’t nested, and when the true model may not be among those considered, is backed up by simulations, not rigorous proofs. As best I can tell, Burnham and Anderson are being a little weaselly when, in their “AIC myths” document attacking Ripley, they say AIC is an estimator of expected relative K-I information distance, whether or not models are nested, period. Yes, it’s an estimator–but it hasn’t been *proven* to be a *good* estimator except in the case of nested models, as far as I can tell. And similarly, when Burnham and Anderson say in their “AIC myths” document that “our clear position is that nothing needs to be assumed about a true model when justifying or using AIC”, I think they’re being weaselly. They’re trying to change the question from “What’s been analytically proven to work?” (which is the issue Ripley focuses on) to “What’s a reasonable practical approach?” without admitting that they’ve done so.

Now, in fairness, Ripley’s primer certainly leaves one with the impression that people who use AIC in contexts where it hasn’t been analytically proven to work are at best taking a totally unjustified leap of faith, and maybe just flat-out wrong. Burnham and Anderson are I think quite right to appeal to their simulation results to justify AIC as a pragmatic choice in a wide range of settings (though not an infinitely wide range–what if none of your candidate models are even close to the truth?), and to push back against the impression Ripley’s primer leaves. But let’s be clear: AIC is a pragmatic choice, which seems to only be rigorously justified in more or less the cases where Ripley says it’s rigorously justified.

Hope I’ve gotten this right!

Impressive summary, Jeremy. I want to say that being an applied ecologist, I’m most interested in practical approaches. But I find it’s never enough for me to follow an approach just because some expert says it works fine. That is one reason I’m a big fan of Andrew Gelman – he always lays his cards on the table and instead of pretending that his methods are inherently superior he makes it clear that they are most useful for his typical objectives.

I feel the need for some obstinance this morning.

Even though I think variance partitioning is a great idea overall, I don’t think we should go overboard on extolling R^2 as an informative diagnostic. It’s true, as per part one of this series, that a large enough sample size will give you a significant p value eventually. But it’s equally true that small sample sizes will frequently give you a high R^2 that is spurious, relative to your critical alpha value of choice, and your p value is then the only thing keeping you from going overboard on those type II errors.

The other point with R^2 is that its importance really depends heavily, very heavily, on what scale of variance one is interested in explaining with one’s analysis. If you’re interested in “explaining” a long term trend in a system with a high ratio of short- to long-term variance, then a low R^2 is not necessarily a “bad” result at all, as long as it explains something about that trend and not primarily the high frequency variance.

correction, type I error that should be. I *always* get those two mixed up.

HI Jim – all true points about R2. The small sample issue is real (I wouldn’t trust an R2 on 6 or 10 points too far), but I don’t see it coming up that often in the domains of ecology I read the literature in. Do you?

And quite right about R2 is not a perfect scale. I have situations where I am dancing around the room when I get an R2 of 0.30 and situations when I am in tears with an R2 of 0.30. As I emphasized it is a scaling relative to the underlying variance.But I still find it far more informative than the p-value.

No I agree, most people are working with larger datasets. I can see it as important in whole ecosystem manipulative experiments and similar expensive and intensive projects.

I was referring to a different kind of scaling than you were. I meant low vs high frequency variance components. My point was that you can get equivalent R^2 values (and r^2) from very different underlying relationships between your variables: it’s not capable of discriminating between covariance caused by high- vs low-frequency correspondences between two variables. If I’m interested in explaining the trend in variable A, from variable B, the fact that the high frequency variations between the two are highly correlated (thereby producing a high R^2) does not by itself tell me what I want to know. Conversely if such variations are very poorly correlated, but there is in fact a relationship at longer scales, leading to an *overall* low R^2 as a result, THAT result is meaningful.

Got you. Interesting point about frequencies. As more of a spatial ecologist, I think of it as grain size. And there are an increasing number of papers being produced showing how the most explanatory variables (highest r2) change with grain size (and extent), which is rather intuitive (habitat explains a lot at small grains – 100 m x 100m – but climate explains more for large grains – 100 km x 100 km). Interesting to hear it put into time series terminology. Is it your feeling that this point has been well appreciated for a long time in the timeseries world or also a relatively new (starting 10 years ago, peaking now) phenomenon like in space?

Re: time series and associations between variables at different scales more generally, that’s what frequency domain approaches are for. Often very useful because variables can, as Jim notes, have different associations at different scales. For instance, two variables might covary positively on one timescale, and negatively on another. Their overall covariance in that case could be positive, negative, or zero depending on what their variances are on those different scales.

One use of such approaches is to subtract out variation at one scale (often variation that has some obvious or well-known cause) in order to reveal variation at other scales. For instance, a project I’m involved in is using frequency domain approaches to quantify patterns of variation and covariation among zooplankton in long-term lake datasets. All zooplankton covary positively on an annual timescale, because they all are most abundant in summer and least abundant in winter. You need to subtract that out in order to see if there are other timescales on which they negatively covary (short answer: no, there aren’t…)

Brian, I confused things a little by using the term “frequency” sloppily, which was dumb because I object to that very issue in some of the tree ring literature, where people often use “low/high frequency” when what they really mean is long/short time scales, respectively. Those two pairs of terms are not synonymous, as the former imply +/- periodicity/regularity while the latter do not. So I can see why you and Jeremy thought I was implicitly referring to time series data, even though the point I was making is equally applicable to time series and to spatial ordered data, and in fact to any two variables which are ordered with respect to each other. But probably the most common use is in fact in spectral decomposition of time series analysis data, where you break things into trend, periodic (“seasonal”) and random noise components. And we don’t have to perform a Fourier transformation to do this either, because again, there may be no seasonal/periodic component, just non-periodic variance of different magnitudes at different scales, the relative magnitudes or which are nevertheless potentially very informative as to what’s driving what in the system.

I don’t know enough to answer your question, but it does seem to me that wavelets are being more often used in the last two decades, in time series analysis, than they formerly were, although I think physicists, engineers and others who work with spectrally definable data have been using them for quite a while.

Pingback: It’s friday, free picks: | Seeds Aside

Thanks for the great post, Brian. In addition to the practical info about goodness of fit, I think this really brought home the importance of prediction. GOF statistics are useful for understanding how well the model fits the data but aren’t that helpful regarding how well the model reflects reality (whatever that is) or if the model reflects any patterns or processes useful in understanding anything other than your data. People often criticize generalizations and say you can interpolate but not extrapolate from your data. Strictly speaking that is true but we are really interested in what the data (sample of the world) tells us about other places in time and space. Testing how we’ll a model predicts new/additional data is actually telling us how useful the model is, and testing the limits of a models prediction is useful as well. We generally use inductive reasoning in ecology rather than the hypothetical-deductive method of physics, but publishers and funding agencies often denigrate repeated studies. They want novel research but that isn’t helpful in an inductive science when we don’t have good tests of our current models. Maybe by framing things in terms of testing model predictions we can get away from the dogma that every study must be completely novel. I think macroecologists do this better than many other ecologists because of the nature of the field. Lab ecologists like Jeremy Fox are probably as better by working in an experimental system. Hopefully mesoecologists like myself can utilize prediction testing to a greater extent and move the science forward.

I’m writing on an iPad on the stationary bike at the gym so hopefully this all makes sense. That’s my main blog time. Thanks again for the excellent series of posts.

Dear Brian,

I was somewhat confused that you write ” (x-axis is predicted value \hat{y}, y-axis is observed value for y)” but on the figure underneath this sentence you do the opposite (the only figure in that long and interesting post). I once saw a paper on the consequences of getting this wrong (as you do in the figure, but not in the text) for R2 (or was is r2), but for the life of me cannot find the reference.

Anyway, since in a regression we assume x to be error-free, it is the prediction (the point prediction) to go onto the x-axis (which is, of course, the default in R and alike).

Sorry for bringing this up rather than joining the r2R2AIC-debate.

Cheers,

Carsten

Hi Carsten – you did catch an error where I changed the graph (to show bias which is normally in the y-axis) but didn’t change the text. Bit of a contrived case to have bias in a linear univariate regression model, but it is simple to illustrate.

However for the r2 it doesn’t matter. In my contrived linear univariate case, R2=r2=cor(x,y)^2 and cor(x,y) is symmetric. Definitely does matter for slope estimates though, hence the interest in Type II regression in macroecology, as you know. And in more complex cases, better statistics like RMSE are still symmetric in the observed and predicted.

If you’re trying to make a general point about which way observed vs predicted plots should go, I see your logic, but can’t say I can get too excited about the issue as long as the statistics (which mostly depend on distinguishing dependent vs independent variables) are calculated correctly.

Pingback: Why advanced machine learning methods badly overfit niche models – is this statistical machismo? | Dynamic Ecology

Pingback: INTECOL 2013 Wednesday Thursday Wrapup – #INT13 | Dynamic Ecology

Pingback: In praise of exploratory statistics | Dynamic Ecology

Pingback: In praise of a novel risky prediction – Biosphere 2 | Dynamic Ecology

Pingback: Is requiring replication statistical machismo? | Dynamic Ecology

Pingback: Detection probabilities, statistical machismo, and estimator theory | Dynamic Ecology

You are probably already aware of this, but I just want to point out that there is an asymptotic equivalence between model selection by AIC and leave-one-out cross-validation (see Stone, 1977 at http://www.jstor.org/stable/2984877). In your post, model selection criteria such as AIC and cross-validation are presented as rather separate things, but as Stone’s paper shows, there are fundamental connections between them. It also puts your statement: “That is the main problem with AIC – it is one choice out of an infinite list of possible weightings of goodness of fit vs. number of parameters.” in a different perspective – besides the information theory interpretation there is also an analog to cross-validation. However, from a more philosophical perspective I agree with you that cross-validation techniques and model selection by AIC-style criteria are very different, and they are often applied in rather different circumstances.

Pingback: Big data doesn’t mean biggest possible data | Dynamic Ecology

I’m way behind the curve with this great comment thread, but cannot resist adding this reference to the mix. Interesting to see similar performance among many different variable selection methods (of course there are many ways these approaches could be compared). Also, although I do not use stepwise and am generally opposed to it (see Jeremy’s March 19, 2013 comment for a good explanation of why), I liked Murtaugh’s point on the flexibility of stepwise regarding choosing a P-to-enter and P-to-stay value that reflects the study objectives/context (end of 1066 – 1067).

Murtaugh, P. A. (2009). Performance of several variable-selection methods applied to real ecological data. Ecology Letters.

Pingback: Prediction in ecology -implementing a priority. – biologyforfun

Pingback: Ask us anything: the future of machine learning in ecology | Dynamic Ecology