Statistician Cosma Shalizi is a really sharp guy, who writes wonderfully on everything from technical statistical topics to philosophy of statistics to books about the Soviet Union. The notes for his undergraduate stats courses are online, and they basically comprise a very readable textbook. As part of my prep for teaching intro biostats next term, I was skimming Cosma’s notes, and came across the chapter on linear regression, in which he asks whether R^2 “is a distraction or a nuisance”? Tell us what you really think, Cosma! 🙂
I’ll save you the trouble of clicking through; here are his reasons for dismissing R^2:
- It doesn’t measure goodness of fit. Even if your model is completely correct, R^2 can be made arbitrarily small by making the variance of X small. Conversely, R^2 can be arbitrarily close to 1 even when your model is wrong, as when the true model is nonlinear but the best linear approximation has a non-zero slope and the variance of X is large.
- It doesn’t measure prediction error. You can make R^2 take on any value just by changing the range of X. Mean squared error and out-of-sample error are much better measures of prediction error.
- R^2 doesn’t tell you how big your prediction intervals or confidence intervals are.
- R^2 can’t be compared across data sets.
- R^2 can’t be compared between models with transformed and untransformed Y, and can go down if you transform Y so as to better conform to model assumptions.
- The only situation in which you can compare R^2 values is when you’re fitting different models to the same data. But he says that in that case you might as well just compare mean squared error.
- R^2 is not “the fraction of variance explained” in any scientifically-meaningful sense, because it’s the same whether you regress Y on X or X on Y. (Elsewhere he suggests you think of regression as a smoothing method, and think of R^2 as the fraction of the variance in Y that’s “retained” by the predictions).
- R^2 is the square of the correlation coefficient. Which Shalizi, quoting Tukey, says is also always useless.*
In a footnote, he notes the recent work of Low-Decarie et al., compiling all reported R^2 values in ecology to ask if ecologists have gotten better over time at explaining the phenomena they study. He dismisses such exercises as “pointless”.
None of these technical points are new to me, and I doubt they’re new to anyone who’s had a basic stats course. For instance, when I teach intro biostats I teach that R^2 depends on the range or variance of X, that it depends on how you transformed your data, and that it’s the same whether you regress Y on X or vice-versa. But I don’t draw the conclusion that R^2 is useless. Why not?
Mulling it over, I think it’s for a few reasons:
- I think that points 3 and 5 are strawmen. I don’t see why you would want a measure of goodness of fit (or variance retained, or however you want to describe R^2) to be comparable across data transformations, or to be related to interval widths. So I don’t see why anyone should be bothered that R^2 doesn’t do those things.
- Point 8 puzzles me because I can think of various uses of correlation coefficients off the top of my head.** I can’t imagine that Tukey or Shalizi could possibly be forgetting these, though.
- Point 7 mostly seems like semantics to me. Yes, “explained” has causal connotations, that “retained” lacks. But I don’t think it’s that big a deal. Insofar as people misinterpret regressions as demonstrating causality, I don’t think it’s because of the words we use to summarize what R^2 means.
- Point 6 isn’t a criticism of R^2.
- I’m still mulling over 1-2. Is it really so bad to have to keep in mind that the ability of your regression model to explain (ok, “explain”) X-Y covariation depends on how much variation in X the model has to work with? I mean, surely the interpretation of mean squared error depends on context too.
- I’m still mulling over 4. This is Shalizi’s strongest argument, I think, especially in combination with 1-2. As I understand him, he’s saying that our usual informal, global understanding of what constitutes a “low” or “high” R^2 value is meaningless. For instance, this post of Brian’s, in which he says that models predicting metabolic rate as a function of body size are much better than models predicting abundance as a function of temperature, since the former have a “high” R^2 of ~0.9 while the latter have a “low” R^2 of ~0.2. As I understand him, Shalizi would say that’s a meaningless, apples-to-oranges comparison. Not only are they different datasets, but those datasets have different variables and presumably different underlying “true” models.
I always find it interesting when I disagree with people I usually agree with. And when two people I usually agree with (here, Brian and Cosma Shalizi) disagree with each other. It really makes me stop and think. So you tell me: is R^2 a zombie idea?***
*Shalizi says covariances are useful, but converting them to correlations is not.
**For instance, in principal components analysis and other dimension reduction techniques, when one has variables measured in different units, one ordinarily wants to to do the dimension reduction on the correlation matrix rather than on the covariance matrix. Cosma Shalizi himself does factor analysis on correlation matrices, so I assume he agrees with me that this is a good use of correlation.
***I think I can guess what Brian will say: that an imperfect or limited measure of goodness of fit (or however you want to describe R^2) is still better than none at all. And if you want to get ecologists to stop caring so much about p-values in favor of caring more about goodness of fit, your only hope is to talk them into using a familiar measure of goodness of fit: R^2.
I personally prefer out-of sample RMSE to R2 as a measure of prediction error too, but I wouldn’t throw out R2! As you guessed I would say, I prefer R2 to p-value in a whole heck of a lot of scenarios.
It may be true that R2 is the same if you flip X & Y (which is only true/relevant in univariate regression which is pretty rare), but if you are willing to believe that you have the direction of causality set up correctly in your regression (independent variables are explaining dependent rather than vice versa), R2 does give a nice measure of variance explained. Given that core questions in science are how? and why? I don’t see how R2 could not be a measure of progress in answering those questions. Not to say that regression is the only tool one should bring to bear. Nor to say that a field with R2=0.9 is “better” than a field with R2=0.2 (the first is likely just a field with fewer causal factors controlling the phenomenon of interest). But R2 definitely means something to me.
And for point #1, I have yet to learn the trick of how I can take variables in ecological systems and make their variance arbitrarily small at will. Somebody has to teach me that trick!
And for #3, that is just silly. No metric tells you everything you want to know.
I do not understand the claim you cannot compare R2 across datasets. I do understand that R2 is a normalized metric, so to the degree normalizing factors differ you will get different results. But that doesn’t compute to cannot compare across datasets for me.
And as for the idea that covariances are more useful than correlation and R2, well maybe if you’re deeply experienced with developing inutition in how covariances change with variance in X & Y and have no interest to normalize to compare across datasests. But that doesn’t describe many people I know.
PS And I don’t think the Low-Decarie paper was truly claiming the trend in mined R2 as a metric of scientific progress – or if they were it was a bit tongue in cheek.
“No metric tells you everything you want to know.”
Yes, that’s an undercurrent in Shalizi’s comments with which I definitely disagree. Several of his complaints basically amount to saying “R^2 is misleading or unhelpful if there’s something wrong with your regression model and/or your data”. You fit a linear model to a nonlinear relationship, you only chose to sample a narrow range of X values, etc. Which seems unfair to me. What single statistic tells you everything you want to know–that you’re fitting the right model, that you’ve estimated the parameters precisely, etc.?
The other undercurrent here (I think–what follows is speculative) is that Shalizi cares first and foremost about identifying and estimating the true model (or a sufficiently good approximation to the true model). So he doesn’t really care about R^2 because he doesn’t think R^2 helps you figure out if you’ve identified the true model and gotten precise, unbiased estimates of its parameters. And he worries about R^2 because he worries about cases in which misinterpreting R^2 might mislead you in your search for the true model (e.g., thinking you’ve got the true model just b/c you have a “high” R^2, or thinking you don’t have the true model just b/c you have a “low” R^2).
I think you may be right about the emphasis on “true models”. But to my mind true models exist only in simulated datasets. I cannot convert “true model” into a meaningfully useful phrase in ecology. Dozens of factors likely have non-zero influence (admittedly getting small quickly but non-zero). I wouldn’t even want a “true model” that incorporated all of those.
I am much more interested in discussions about better and worse models and whether a model gets us 70% of the way there or 10% of the way there. These are things for which R2 has some genuine value.
This idea of the true model is common in statistics which is easy enough to understand, but it is also true in econometrics. Its one reason they care so much about omitted variable bias and we don’t. But I have to confess I’ve never understand the idea of the one true model in economics either.
I personally find R2 values useful, but (like p-values) they should be treated with more caution than they have been. I think point #4 is making a very strong point though: comparing two R2 for different data sets is generally not useful in my mind, since R2 is driven so heavily by the range of predictors you have. Take the body size – metabolism comparison, with an R2 of 0.9; it only has that high of R2 when trying to predict metabolism across multiple orders of magnitude. It would be really easy to get a R2 of 0.2 if we just shrunk the range of comparison (say, if we looked at metabolism/body-size comparisons for species between 1 and 10 grams in weight). Further, we could increase the R2 of the abundance/temperature relationship by expanding the range of temperature we looked at: if we include measurements of abundance at absolute zero and the surface of the sun, the R2 value would go up a lot. This relates to Shalizi’s point 1: we can make the variance in our predictors arbitrarily small by choosing a very small range to measure that predictor over.
I definitely don’t think you can say that one field has more causal factors controlling something than another just from looking at R2, given that in the same field, you can get two very different R2 values just by measuring the same thing at different scales.
I think this does point to using fit measures that include explicit units like RMSE rather than unitless R2 metrics (although I’m making myself a hypocrite here, as I never really use them myself!).
Aha! I was hoping someone besides me would suggest that maybe Shalizi has a point with the combination of #1 and #4.
Re: the metabolic rate-body size example, when teaching intro biostats I use the example of the interspecific vs. intraspecific metabolic rate-body size allometry to illustrate the consequences of having a reduced range of variation in the predictor variable.
I’ll be over here while you and Brian duke it out. 🙂
Your point about R2 in allometries depending on range of X variable is very true. Thus I would never make a strong statement about R2 of 0.7 vs 0.6. On the other hand, I think your example of absolute zero and the sun would actually make R2 go down (it introduces nonlinearity). To get R2 in allometry to drop down to say 0.3 you typically have to be looking congenerically or even conspecifically which is really a different question (see very interesting figure in Tilman et al 2004 “DOES METABOLIC THEORY APPLY TO COMMUNITY ECOLOGY? IT’S A MATTER OF SCALE”. And I do think the significance of this effect of range X, while real, has been exaggerated – we’re talking about changing the range of X by many orders of magnitude. This still leaves me comfortable saying that our ability to explain variance in say metabolic rate is fundamentally different than our ability to explain variance in abundance based on the different R2s.
I don’t think there is a perfect answer or panacea. Despite being a fan of RMSE, its not perfect either. An RMSE of 1.4 degree C could be very good or very bad. For a project I’m on trying to predict ground station temperature by interpolating from stations 100s of km away it is very good. For trying to predict mammalian body temperature given outside temperature, it is a very bad RMSE. The reason being variance in temperature at a ground station is large (even day to day) while the variance in mammalian body temperature is low (outside of torpor).
For better and worse, R2 contextualizes its answer relative to the variance in the data (R2 is basically noise in Y vs observed range in Y (which is usually affected by range in X)). Sometimes that’s good. Sometimes that’s bad.
@Brian:
This kind of gets into the larger issue of normalization in general. If we want to know if some number is “big” or “small”, we pretty much have to ask “big or small compared to what?” But it’s often not obvious what we should compare it to, and different plausible normalizations can make the same number seem big or small.
In ecology, for instance, I think there’s a 1998 Petraitis book chapter (in Bernardo & Resetarits) which talks about different ways of normalizing interaction strength.
Yeah – and a huge debate in plant competition hinged on whether you normalized competition effects by biomass or not.
Personally I like having both normalized and non-normalized metrics around.
For the absolute zero and the sun issue, I was assuming fitting a non-linear (hump-shaped) model.
I think your point about the context-dependence of whether a given RMSE is good or bad is actually a point in its favour: it forces the user of a model to evaluate whether predictions from a model are good or not for their specific uses. To take an example from my current work: it would be pretty easy to get a decent R2 explaining population densities of a fish species across the whole of North America using only temperature. However, if someone used that model to predict between-lake abundances between lakes a km apart, it likely would make very poor predictions, but could put false faith in their predictions because of the high R2 of the original model. It would hopefully be more obvious that a RMSE of 5 fish per hectare meant that you were likely making poor predictions when you were only predicting a difference of 2 fish per acre between the study lakes.
I don’t think we disagree that much. You certainly won’t get me to argue against using RMSE (especially out of bag). But I will point out you were just comparing ratios of RMSE in your argument which is basically what R2 is doing (MSE of linear model vs null model). I think both R2 and RMSE have their place, and as always they work better when people understand what they’re good at and bad at.
@Eric Pedersen:
“I think your point about the context-dependence of whether a given RMSE is good or bad is actually a point in its favour: it forces the user of a model to evaluate whether predictions from a model are good or not for their specific uses.”
This is a good broader point, I think. There’s an argument to be made that good tools are ones that force the user to think a bit. No tool can be used well if used thoughtlessly, and any tool or combination of tools that purports to remove or greatly reduce the need for thought is going to run into problems at some point. Even error checking tools like checklists (which I’m all in favor of: https://dynamicecology.wordpress.com/2014/03/31/the-power-of-checking-all-the-boxes-in-scientific-research-the-example-of-character-displacement/) can lead you into problems if you thoughtlessly rely on them to catch errors they’re not designed to catch, or if they just create risk compensation.
I do think we agree on a lot more than we disagree here; I still think R2 is useful, it just shouldn’t be relied on as heavily as it is as the measure of how good a model is.
For me, I find R2 really useful within a given data set (“huh, those predictors work well for this data!”), more fraught with issues when looking at comparisons at different scales or outside the range of your predictors, and really not useful for comparing studies of very different types.
And Jeremy: I like your point that good tools should force the user to think about what they mean before using them. R2 is most problematic when people use it to mean “this has a low R2, it’s a poor model”, or worse, “this has a high R2, it must be a good model”.
Cheers for sharing this. I think I slightly disagree with some of the interpretations of Shalizi’s concerns. The original text is in quotes:
1) Seems more to do with the validity of the model rather than its explanatory power. ” By making Var [X] small, or σ2 large, we drive R2 towards 0, even when every assumption of the simple linear regression model is correct in every particular.” – I don’t see this as a problem. Just because the assumptions are correct doesn’t mean the model does a good job, so a low R2 isn’t wrong or misleading.
3) I think you could still compare R2s across datasets. Shalizi states that “exactly the same model can have radically different R2 values on different data.” – Again, thats not a problem. Everyone is happy with the idea that models will fit different data differently well. R2 is a measure of how well the model fits this data, as the previous comment says it doesn’t tell us which field is better or worse, but I am not sure anyone is using it like that?
I agree with your points over 5-8, so I am very inclined to not abandon R2s.
Admittedly I’m also reluctant to abandon R2s as I’ve recently learnt how to calculate them, and % variance explained for individual fixed effects, for my mixed models, and think its provided some really useful insights.I also think that most ecologists will not be that interested to the opinions of someone (Shalizi) who says that correlation coefficients are useless….
As an aside, you can get a range of R2s, and perhaps interpret them as some kind of confidence intervals of your model, if you’ve done it in MCMCglmm or similar, and calculate the R2 over a load of the draws. I did that once and got 95% ranges of 0.314-0.517 for an R2 of 0.406. However, that was over all draws, doing it over the final 1000 or something when it had homed in on better parameter estimates may have been more sensible.
“I don’t see this as a problem. Just because the assumptions are correct doesn’t mean the model does a good job”
My sense is that Shalizi would disagree with you there. I think he’d say that, if you’re fitting a regression, and you’ve got enough data to have precise parameter estimates, and the residuals all conform to the model assumptions (so you’re not misdescribing a nonlinear relationship or ignoring heteroscedasticity or etc.), then that’s what it means for the model to “do a good job”. (Though of course, he’d probably also say that some other model might do an even better job on the same data (his point #5, it’s #6 in my list). And he’d probably also say that it’s more important for a model to predict out-of-sample data.) In contrast, I think he’d say that your definition of a model “doing a good job” effectively blames the model for some failings that it shouldn’t be blamed for (such as the range of variation of X that happens to have been sampled).
“R2 is a measure of how well the model fits this data, as the previous comment says it doesn’t tell us which field is better or worse, but I am not sure anyone is using it like that?”
Don’t be so sure: http://ecoevoevoeco.blogspot.ca/2015/08/high-enthusiasm-and-low-r-squared.html
When I teach regression for the first time to my undergraduates, I use Møller (1993; Evolution 47:417-431) Figure 6 as an example of the utility of R^2 in evaluating the usefulness of a regression model and why p-values are not that useful in evaluating regressions. As far as Jeremy’s comments on Shalizi’s point 7, and Brian’s second point, seldom do researchers experimentally set the independent variable and then measure the response to that manipulation in the dependent variable so most of us should be interpreting causality with some caution, even if we think we have gotten the causal relationship correct.
“so most of us should be interpreting causality with some caution, even if we think we have gotten the causal relationship correct.”
Agreed, but that is not an indictment of R2, just a well understood limitation of regression in general.
Yup- I agree that’s usually the case in ecology as it concerns the response variable. Other more lab-oriented sciences appear otherwise, perhaps because they have more control over the systems they test.
rookie question, but how is R^2 not a measure of goodness of fit? I’m aware of other methods (partitioning mean squared error) for measuring GOF, but is there some other sense of “goodness of fit” being discussed here?
I’m the wrong person to answer since I think it IS a measure of goodness of fit. But I think the two main arguments are that:
a) R2 is a measure of goodness of fit only assuming a linear model is appropriate
b) R2 can vary depending on the property of the data. For example if you think of a regression scatter plot and then chop off the right half of your data the R2 will change even though arguably the model still is the same goodness of fit. You can see this since R2=1-MSE/MST (mean squared error over meqn squared toal). MSE should remain constant while MST reduced when you chopped half of the x-values out. You can also see this with this R code:
x=runif(50,0,10) #50 x values between 0 & 10
y=3+2*x+rnorm(50)
fit1=lm(y~x)
summary(fit1)
keep=x<5
x=x[keep]
y=y[keep]
fit2=lm(y~x)
summary(fit2)
This is all an indisputable mathematical fact. Its how important it is that we’re debating. For me chopping half the data off is pretty radical, but R2 only changed from 0.97 to 0.88 (in my random realization). There is definitely still a signal in R2 that is easier to interpret in many ways (about 90% of all the variation in Y is explained by X alone assuming I got the order of causation correct) than saying MSE in both models is about 1 (and hence RMSE which is the square root of MSE would also be about 1).
Click through to Shalizi’s chapter for details. But the short version the reason I gave in the post: R^2 can be near-zero even for a correct model, depending on how much variance in X there happens to be. So if you were to say that a model “is not a good fit” just because the fit has a low R^2, I think Shalizi would say you’re effectively blaming the model for lack of fit that might well not be the model’s fault. This gets back to my comment to Brian earlier, that what Shalizi cares about is whether you’ve identified the “true” model and estimated the parameters accurately and precisely. I think he’d say that, if you’ve done that, then your model is a “good fit” in the only useful sense of “good fit”. Even if the R^2 is low.
Also, I think Shalizi prefers to think of linear regression as a special kind of smoothing procedure. And so prefers to think of R^2 as a measure of variance “retained” or “kept”, rather than a measure of “variance explained” or “goodness of fit”.
I think you explain Cosma’s thoughts accurately, at least as I understand them.
But the low R2 scenario is a bit of a strawman. In the real world, if I have bothered to measure data and analyze it, I have presumably chosen a range of x values that is biologically relevant and of interest (or if it is observational data from a well designed sample than x almost by definition contains a real world relevant range of x). And if the R2 is then close to zero, it means that the x I have examined is poor at explaining y over the range of x that I decided was biologically relevant. In short go look for a different explanation for y.
Of course if I design an experiment where I have nitrogen fertilizer inputs that range from 14.9 to 15.1 g/m2 and then conclude that nitrogen inputs have little effect on productivities because of a low R2, then: a) I probably have gotten what I’ve deserved, and b) R2 is right in saying the variation in nitrogen in my experimental design did a poor job of explaining variation in productivity relative to other factors (presumably soil moisture, plant community composition, etc).
@Brian:
“But the low R2 scenario is a bit of a strawman. In the real world, if I have bothered to measure data and analyze it, I have presumably chosen a range of x values that is biologically relevant and of interest”
Do you think that’s generally true? Honest question. I’m thinking for instance of my example of inter vs. intraspecific allometries. I seem to recall reading more than one paper (and a Stephen J Gould essay) trying to explain why intraspecific body size allometries are flatter than interspecific ones, or taking for granted that they are flatter and using that “fact” to explain something else. And I don’t recall any of those papers ever addressing the fact that, if you reduce the range of X, you’re going to flatten your estimated slope (and reduce your R^2). But I’m totally not an allometry guy, so maybe my memory is foggy.
Interesting example. It is true that if you compare the slope of intraspecific allometries to interspecific they can be diferent. And sometimes flatter, but not necessarily flatter (lifespan within dogs is negative in contrast to the overall positive interspecific allometry). But to me that just says when you change scales, your processes and hence your patterns can change. it is true that if a slope becomes flat R2 drops. But that to me is a correct behavior. If you believe (it is still controversial I think) that the allometry of brain size on body size is fairly flat intraspecifically, it should have an R2 of zero, as body size does not predict brain size well intraspecifically. But that is about a change in process and thus change in slope with a change in scale.
The scenarios I have in mind are more like my R example or nitrogen addition example when you are looking a the same question at the same scale and hence the same process. That is what I think Cosma finds objectionable – that the R2 changes even when the underlying line/slope/error model is constant and only the range of X measured changes. I can see that objection, but I was arguing that we usually measure the range of X we are interested in.
I don’t think that the low R2 scenario Shalizi posits is too much of a strawman. Something I’ve been pondering lately in biogeochemistry work is that it often seems easier to predict soil properties (e.g. soil carbon) over large scales than over small, at least in an R2 sense. For instance, David Schimel and Co in the early 90’s had great success adapting CENTURY to predict soil C across sites in the Great Plains, but that was encompassing huge gradients in precipitation and ET, which of course correlate to both plant productivity and soil weathering. I’ve often wondered how much the SAME model would fall off in R2 predicting only at one site, even given parameters fit from a larger dataset. It would be interesting to contrast RMSE versus R2 across scales in this kind of setting…
Chris
But Chris, is that a flaw of R2 or an accurate report by R2 of the different levels of difficulty in identifying the role of a factor over large gradients vs small gradients? I agree it is rare for a model across large scales to apply across small scales, but that is not the fault of R2 – that’s just ecology.
Hi Brian, I think I basically agree with you. Scrolling up I see Eric Pederson is making much the same point as I had in mind. I think the point is just that in the example I cited, I expect R2 will fall off even if absolute predictive accuracy is unchanged (so same factors could be operating with same explanatory power). So yea, I guess this is just back to using R2 and RMSE appropriately and not confusing them. I like what Eric was saying about using R2 to compare predictors/models for the same data (at the same scale), so I do think it’s a useful goodness of fit measure in that comparative sense.
Cheers,
Chris
I’m not thinking this through too deeply, and so may be laughably wrong (when has that ever stopped me before?), but it seems like all the criticisms could also be applied to heritability. Is that useless too? Cue angry quantitative geneticists… 🙂
In my non-quantitative-geneticist understanding, heritability is a trickier concept than is usually taught, especially in contexts where there’s selection too. It’s not always conceptually straightforward to partition total evolutionary change into “selection” and “heritability”. Haygood 2005 Evolution is good on this.
“R^2 doesn’t tell you how big your prediction intervals or confidence intervals are.”
A potential approach around this issue is to establish confidence intervals, and alpha limits, about the means of categorical values of the X-axis prior to any other analyses. In this manner, one can assess the “significance” of those values and decide to include or exclude any of them from subsequent analyses, including simple regression.
Happy Holidays Everyone! Perhaps this bit serves as a holiday gift, it is a brief, but clear, blog piece from Paul Alison on distinctions between “predictive” modeling and causal modeling. One aspect of that difference is a contrast for the focus on R-square (versus parameter estimates). See http://statisticalhorizons.com/prediction-vs-causation-in-regression-analysis.
Pingback: Régression linéaire simple : le R2, info ou intox ? – DellaData