I have been working on a series of posts on why ecologists need to take prediction more seriously as part of their mandate as scientists. In Part 1– I argued that an ANOVA/p-value mentality is killing us. In Part 2 – I argued that the rigorous discipline of putting out quantitative predictions and then checking to see if they are right is good for a discipline (with weather prediction as an example). In Part 3 – I argued that a pure reductionist view of where to look for mechanistic models to produce predictions was flawed. Throughout, I argued (and many commenters agreed) prediction is not just for the applied questions. Prediction, done right, brings a level of rigorous honesty that helps basic science advance more quickly too.
This is my last post on the topic (you can breathe a sigh of relief), and wasn’t originally planned. But so many commenters, especially on the first post, wanted more about the statistics of measuring predictions. In Part 1 – I made it pretty clear I thought the p-value was overrated and effect size and r2 needed more attention. But of course it is more complicated than that. Questions were raised about AIC and a bunch of other metrics (and no questions were raised about some of the metrics that I personally think are most important). So the following is a tutorial on measuring the quality of a prediction in a quantitative fashion.
Goodness of fit – continuous variables
Let’s take the simplest model y=f(x,θ)+ε. Let y (but not necessarily x) be a continuous (aka metric) variable like temperature or mass or distance, or to an approximation abundance if the abundances are large. If we define the model f and the parameters θ, then if we have a new value x (e.g. environmental conditions), then we can predict the corresponding new value (pronounced y-hat) and compare it to the observed value y. Of course in a perfect prediction would always equal y (i.e. error ε=0), but ecology is a long ways from that Nirvana! So we want measure the error/deviance/inaccuracy/failure of prediction. How do we do it?
One of the world’s alltime geniuses, Carl Friedrich Gauss, came up with the first insight. If we define to be the prediction error for one data point xi then we can look at sum-squared-error or . Minimizing this SSE was Gauss’ criteria for picking the best fit line, leading directly to modern Ordinary Least Square (OLS) regression. However, there is a problem with SSE as a measure of goodness of fit – it is completely uninterpretable. Is SSE=40502 a good fit or a bad fit? The answer totally depends on the number of data points and the innate variance in the data (after all if two widely different y values are observed for the same x, no model can fit that that). To solve this problem the idea of error partitioning was developed. If we define total-sum-of-squares (i.e. a proxy for total variance) as where is the average value of y, then we have .
R2 is extremely useful. In particular it ranges between 0 and 1, making it easy to compare success of models across different amounts of data, different data, and even models of different things. The last point bears expanding – R2 even lets us say things like models of predicting metabolic rate by body size (R2=0.9 or so) are much better than models of abundance vs temperature (R2=0.2 or so). There are limits to this – and I certainly wouldn’t make a big deal of small differences, but there is no other metric that is even credible for this kind of cross-disciplinary comparison. R2 also is independent of units – I get the same R2 value whether my y variable is measured in grams or kg. If all I’ve accomplished is gotten more ecologists to pay attention to and report R2, I would be very happy!
But there are of course complications! The first is it was noticed that in the simple linear model (y=ax+b+ε), R2=r2 where r is the Pearson correlation coefficient (which itself ranges from -1 to 1). That is kind of cool! But it now means we have two definitions and ways of calculating R2. And outside of the simple linear model (e.g. nonlinear models), they do NOT give the same answer. So at a minimum, you have to specify how you calculate it, which is often done by using R2 to report the error partitioning approach and r2 to report the correlation approach. If you want to look good, definitely use the r2 version. It will always be ≥0. If your model is worse than the null model (horizontal line of y=c), then R2 can be less than zero (and yes I have produced models with this outcome!). Even worse, the r2 approach can sweep systematic bias under the rug. Imagine a plot of predicted vs actual values (x-axis is predicted value , y-axis is observed value for y). In a good model the data should all be close to the 1-1 line (passing through the point 0,0 and slope of 45 degrees). R2 directly measures this. But now imagine you are constantly over predicting (see figure below). You correctly track that when x goes up a certain amount, y goes up a certain amount, but every prediction is greater than the observed values. This would cause a nice-tight line to form in the plot that is above the 1-1 line. This could have an r2 of 1 even if no point is on the 1-1 line! For these reasons, serious test of prediction should use R2 rather than r2 (with the added benefit that R2 generalizes to n-dimensional x-variables).
So R2 does a lot of nice things, including the fact that it relativizes the error to the innate variance in the data. But sometimes we don’t want to relativize our error. Lets take an example from my (and others) research where R2 doesn’t tell the whole story. We’re trying to interpolate temperatures between weather stations. In many regions weather stations are 100s km apart (with small weather influencing details like mountains in between). So we are evaluating different methods. If I find that a particular method has an R2 of 70%, I am probably moderately happy – my method is explaining a majority of the variability. But if you are a user looking to use these layers, do you know what you want to know? Probably not. In fact the R2 statistic is pretty meaningless if you don’t have an innate sense of variability of the inherent error. What you probably want to know is something like the average error is 1 °C or the 95% confidence interval is ±2 °C. Confidence intervals (and boot strapping methods that produce them) are well-known to ecologists, so I won’t dwell on them further here (except to say that they do require a sensible mode of bootstrapping to produce which doesn’t always exist).
I want to look at the other concept “average error”. Literally this would be Mean Square Error or MSE=SSE/N where N is the # of observations. Except it still has the square in it (recall if we didn’t square before summing, then the errors would cancel each other out and sum to zero in a linear regression). This problem of squaring is fixed not surprisingly by introducing a square root – RMSE=Root Mean Squared Error or . This is widely used by engineers and meteorologists. In fact, you also already know this in a different inferential context as the standard error. But in a prediction context it is called RMSE. Its main virtue is that it is in the same units as y. So I can tell you that the RMSE prediction at a point between weather stations is 1 °C. Now this is useful to you as an ecologist trying to use these layers. It is also an excellent measure of prediction accuracy. It has lost its relativism (RMSE of 1 °C is excellent – indeed unacheivable – for my application but really poor for the accuracy on repeat measures of a $1,000 temperature probe). And it is now units dependent – predicting body mass in kg and getting an RMSE of 1 kg will turn into an RMSE of 1000 g if you measure body mass in g. But RMSE is really concrete. It also has the nice feature that it is more like R2 than r2. Indeed there is a formula directly relating RMSE to bias: . Indeed RMSE is one measure that attempts to assess trade-offs between bias and variance (a model with low variance but very high bias may well be worse than a model with no bias and moderate variance). This came up recently deep into the comments on my post on detection probabilities. One wrinkle is that RMSE, depending on squaring errors, is very sensitive to outliers and fat tailed error distributions. If this is a problem in your data you can use Mean Absolute Error (). Note that OLS linear regression effectively minimizes RMSE (it technically minimizes SSE, but since RMSE is a monotonic transformation of SSE, the optimization chooses the same answer), while LAE or median regression, a form of robust regression, chooses the line that minimizes MAE, thus both methods select “best prediction” lines. There are dozens of other measures you can use, but in my opinion if you look at R2 and RMSE (or MAE) you have really got 90% of the value of prediction metrics.
Goodness of fit – binary and discrete variables
Now, what if y is a categorical variable rather than continuous? There are literally whole books written about categorical variables, and I’m not going to go too far into the topic here. Indeed I am going to limit myself only to a subset of categorical variables – binary variables (i.e. 0/1 or true/false or present/absent). These are extremely common dependent variables in ecology (any model that predicts survivorship or presence/absence is likely using a binary dependent variable). Your first instinct might be to keep using R2. If you do you will be told you are wrong to do so. And there is a problem. Since the observed y values can only be 0 or 1, and the predicted values are usually some intermediate value such as 0.732 the distances between observed and predicted are artificially large. Or put in other terms, the assumption of normal errors which underlies much of the justification for R2 (and even RMSE) is violated with binary variables. But in fact, as long as you accept (and know) the fact that R2 used on binary y variables is never going to be as high as R2 on continuous variables it actually works rather well. And it even has a name – the point-biserial correlation (as the name implies this is usually calculated as r2 rather than R2). This is my favorite metric of prediction on binary variables, but I am rather out of the mainstream.
Not surprisingly giving the demonstrable proclivities of ecologists for statistical machismo, ecologists have embraced a much harder to calculate and harder to interpret statistic known as Area Under the Curve (AUC), which is based on an idea in signal communication theory known as the receiver operator curve (ROC). It is not my goal to provide a full introduction to this metric here. But basically it assumes there is a threshold or cut-off that can be varied and then it measures how much of a trade-off there is between true positive rates vs false positive rates. Imagine a model predicting species presence or absence as a function of climate. This will produce predictions at each point in space that looks like the probability of presence of say p=0.732. To get back to the binary present/absent, we have to chose a threshold. An obvious one is pthreshold=0.5. if predicted p>0.5 then we predict present, otherwise we predict absent. Obviously if pthreshold were set at p=0 we would predict present everywhere, giving us many true positives and no false negatives but lots of false-positives (and the opposite for p=1.0). In a perfect (but impossible world) any value of p>0 and p<1 would work equally well. This gives an AUC (area under curve) of 1.0 (the curve goes from 0,0 to 1,1 and stays inside the 1×1 unit box so the maximum area under the curve is 1). Alternatively, as a null model, if we just take the original proportions of presences () and for each point randomly flip a coin with probability of coming up present, then our false positive and negative rates would depend only on the thresholds and the area under the curve would be 0.5. Even if you skipped most of the last few sentences, this is the important point – the null model value of AUC is 0.5. An AUC of 0.5 is the same as an R2 of 0. And an AUC <0.5 is the same as an R2<0! So if next time you see an AUC of 0.6 don’t be too impressed. It is possible to rescale AUC to run from 0-1 to seem more like R2 (i.e. (AUC-0.5)/0.5) ) but the analogy is still misleading. There are dozens of other statistics commonly in use. Some even let you specify your preferences for errors of omission (false negatives) vs commission (false positives). But at that point, I’d rather just publish a 2×2 table of true and false positives vs true and false negatives.
A note on comparing models, AIC and other information criteria
R2 is in general a wondrously relativized number that can be used to compare prediction quality across datasets, types of models etc. But it is an inescapable fact (indeed a mathematically provable fact) that as the number of explanatory variables going into a regression goes up, R2 must also go up. This means R2 is not always the best tool for comparing two models with different numbers of parameters (although the problems are often overstated – big differences in R2 are always meaningful in practice). To correct for this, ideas like adjusted R2 and Mallow’s Cp were invented. More recently, Akiake’s Information Criteria (or AIC) has emerged (where AIC=-2LL+2k and LL is log-likelihood and k is # of parameters). AIC can be used to compare two models with different numbers of parameters. It should be noted that AIC has a relationship to our starting point, SSE. Namely for normal errors, AIC=n ln(SSE/n)+2k. So AIC is in a certain fashion a measure of goodness of fit. But just like SSE it is not that useful by itself (is an AIC of 273 a good fit or bad?). But AIC is only good for comparing along one dimension – between models across the same data. It is a mistake to use AIC to compare one model on two datasets for example, and certainly it cannot be used as a general comparison of two totally different models.(unlike R2). But if you really want to compare two models on the same dataset AIC is the way to go. Or maybe AICc or BIC or QIC or … That is the main problem with AIC – it is one choice out of an infinite list of possible weightings of goodness of fit vs. number of parameters. It has some logical justification if you think information theory explains the world. Otherwise, it is a bit arbitrary. Personally, I prefer reporting different R2 values and different numbers of parameters and letting the reader choose what is the best model. The use of model comparison as an inferential approach also suffers from certain logical pitfalls (finding the best model out of a list of bad models doesn’t necessarily advance science). Thus, although I use AIC in some contexts (namely comparing different linear regression models), it doesn’t add much to assessing goodness of prediction in the sense I’ve talked about.
The experimental design aspect – predicting on independent data
So far I’ve talked about a few small cheats in the world of prediction (like using r2 instead of R2), but there is one whopper – failing to distinguish calibration vs. validation. Calibration is the process of picking the best possible parameters to fit one set of data. A simple example is using the OLS (minimizing SSE) to pick the slope and intercept of a linear regression. Validation is taking the parameters picked using one data set and testing them on a separate independent data set. The reason this is important is the issue of overfitting. Overfitting is when the model fits not just the signal but the noise/error in the data. This is easy to do when you have a really flexible model with lots of parameters. As Fermi quoted Von Neumann “with four parameters I can fit an elephant, and with five I can make him wiggle his trunk” (which believe it or not led to a scholarly paper showing he was right!). If you have enough parameters to draw an elephant, you will fit every bump and wiggle of the noise, and if you do, you’ll get a really awesome R2, but when you move to your next dataset (this is the validation step), your R2 will be terrible (by definition noise does not transfer repeatably).
So to avoid this problem, the proper method is to choose parameters that give the best fit and measure the goodness of fit on independent datasets. The exact method of doing this varies. In regression trees, one builds a very complex tree that one knows is overfit. Then one applies the tree to new data and walks it back (one by one removes the lowest branches) until the goodness of fit (often R2) is maximized on the new data. In classical regression, a separate validation is not necessary IF you make certain assumptions about the data and errors (normal, independent, constant variance) because we can actually model the transferability. The methods for validation are really as diverse as the methods for modelling.
This is why I like prediction so much – it clearly separates out the calibration vs. the validation steps. Calibrate on today. Predict then validate on tomorrow. Of course as commentors have pointed out, one can do hindcasting (calibrate on today, predict in the past or vice versa) or lateral casting (predict in North America, validate in Europe). These all work equally well. But the real key is the validation data must be independent of the calibration data! Any pretense of validation that fails to do this is at least as bad as (indeed is pretty much doing the same thing as) pseudoreplication.
One common approach to getting independent data that comes from the machine-learning world is to use a hold-out or cross validation technique. Holdout is when one randomly selects say 70% of the data and calibrates on that and then validates on the remaining 30%. Cross-validation is a slightly fancier version where one calibrates on 90%, validates on the remaining 10%, then repeats this 10 times which each separate chunk of 10% held out once and averages. These techniques work great when all the data points are truly independent. These techniques work lousy when there is spatial or temporal autocorrelation linking the points (the 30% that were held out are not truly separate from the 70% used to calibrate). Nearly every paper I have seen on niche models fails to realize this (or at least plows ahead anyway). In the study I just linked to this simple effect artificially inflated R2 from close to zero to a very respectable 60 or 70% – pretty misleading.
So a long post! To summarize my recommendations:
- For continuous variables, report R2 (not r2) and RMSE (or MAE)
- For binary variables leave the prediction as probabilities and report the point-biserial correlation. If you HAVE to turn probabilities of presence (or whatever) into 0/1 binary variables, choose an appropriate threshold (non-trivial),then just report the 2×2 table. I am not a fan of AUC – it is hard to interpret and inherently misleading with a base of 0.5
- Probably even more important than the statistical metrics is the question of experimental design and inferential context. You must think through how to separate calibration from validation. If your validation is not statistically independent of the calibration data, you are committing pseudoreplication. And worse, without validation, you’re not predicting, you’re just curve fitting!