Regression through the origin is when you force the intercept of a regression model to equal zero. It’s also known as fitting a model without an intercept (e.g., the intercept-free linear model y=bx is equivalent to the model y=a+bx with a=0).

Every time I’ve seen a regression through the origin, the authors have justified it by saying that they know the true intercept has to be zero, or that allowing a non-zero intercept leads to a nonsensical estimated intercept. For instance, Vellend et al. (2017) say that when regressing change in local species richness vs. the time over which the change occurred, the regression should be forced through the origin because it’s impossible for species richness to change if no time passes. As another example, Caley & Schluter (1997) did linear and nonlinear regressions of local species richness on the richness of the regions in which the localities were embedded. They forced the regressions through the origin because by definition regions have at least as many species as any locality within them, so a species-free region can only contain species-free localities.

Which is wrong, in my view. Ok, choosing to fit a no-intercept model isn’t always a big deal (and in particular I don’t think it’s a big deal in either of the papers mentioned in the previous paragraph). But sometimes it is, and it’s wrong. Merely knowing that the true regression has to pass through the origin is not a good reason to force your estimated regression to do so.

Knowing that the true relationship between your predictors and the expected value of your dependent variable has to pass through the origin *would* be a good reason for forcing the estimated relationship through the origin *if you knew for certain what the true relationship was. *That is, you not only know that Y=F(X) passes through the origin, you know the functional form of F(X) and merely have to estimate its true parameter values.

Of course, we rarely know that in science, and in ecology I’m going to go out on a very short and sturdy limb and say we *never* know that.* Rather, the functional form of our regressions ordinarily is chosen for convenience, or because it fits the observed data reasonably well, or etc. In which case you should not be forcing your intercept through the origin. Forcing your intercept through the origin in such cases is a way of papering over misspecification of your regression. For instance, if your linear regression provides a reasonable-looking fit over range of X values you observed, but with a nonsensical estimated intercept, that’s your linear regression’s way of telling you that the true shape of F(X) is nonlinear near the origin, outside the range of X values you observed. Forcing it through the origin treats the (most obvious) symptom (the nonsensical estimated intercept) but not the disease (misspecification).

Now, you might respond to this by saying that you don’t care about the form of F(X) outside the range of X values you observed, or near the origin. You’re only forcing the regression through the origin in order to improve the accuracy of your parameter estimates of F(X) over the range of X values you observed, or that you care about. Which sounds plausible–you’re taking advantage of prior information–but is wrong. If F(X) is misspecified, you do *not* necessarily improve your parameter estimates by forcing your misspecified F(X) through the origin. A misspecified F(X) forced through a point through which you know the true F(X) passes will *not* necessarily pass any closer to the true F(X) over any particular range of X values than will a misspecified F(X) not forced through the origin. Especially if the range of X values you care about includes values far from the origin. Indeed, insofar as there’s a possibility that you misspecified F(X), you’re probably better off *not* forcing it through the origin. In most cases, the added flexibility you gain from estimating the intercept should improve the ability of your chosen F(X) to mimic the unknown true F(X) over the range of X values you care about.

Conversely, you might force F(X) through the origin because you worry about the nonsensical estimate of the origin that you get otherwise. To which: don’t worry. If all you really care about is estimating the form of F(X) over some range of X values away from the origin, why do you also care if extrapolating your fit to the origin gives you a nonsensical estimate of the origin? That you know that you’ve estimated the intercept badly does *not* mean that you’ve also estimated F(X) badly over the range of X values you care about, unless you’re sure that F(X) is correctly specified.

To put in Bayesian terms (which you don’t have to, but which I will just for completeness): in forcing your chosen F(X) through the origin, you’re not *just* saying that you have prior information that the true intercept is zero. Implicitly, you’re *also* saying that you have very strong prior information about the values F(X) takes on away from the origin (because your estimates of the other parameters of F(X) will change if you set the intercept of F(X) equal to zero). If the *only* thing you’re sure you know about F(X) is that it passes through the origin, then you either need to figure out a way to incorporate *only* that information into your fitting procedure**, or else you need to *ignore* that information in your fitting procedure.

Even if you’re pretty confident you know the true form of F(X), there might be some problem with your data that prevents you from obtaining a precise, unbiased estimate of the intercept (which you know must be zero). Measurement error or sampling bias or even just a small sample size. In which case I still don’t think you should try to paper over the inadequacies of your data by forcing the regression through the origin. In particular, if you are forcing a linear regression with one predictor variable through the origin in order to save one degree of freedom for your slope estimate, c’mon. It’s *one* degree of freedom. If it makes any appreciable difference, you don’t really have enough data to be drawing robust scientific conclusions about the relationship between your dependent and independent variables, and you shouldn’t be hiding that unhappy fact from yourself by forcing your regression through the origin so as to get P<0.05.

There may be some specific contexts in which there are context-specific reasons for regression through the origin. I can’t think of any off the top of my head but would be happy to hear of examples in the comments.

In writing this post, I’m pretty sure I’m disagreeing with many of my friends. To whom I now give permission, if permission were needed (which it wasn’t) to tell me I’m full of s**t. 😉 The rest of you have permission too, of course. 😉

p.s. Don’t read anything into my choice of examples in this post, they’re just the first two ecology papers I happened to remember that report regressions through the origin. And definitely do not extrapolate from my criticism of that statistical choice to criticism of anything else about those papers. In particular, I agree with every other substantive and technical point in Vellend et al. (2017). Plus, as I said above I think that theirs is a case in which the choice between models with and without an intercept makes no substantive difference to the scientific conclusions.

p.p.s. I’m traveling today and tomorrow, comment moderation may be a bit slow.

*Plus, the rare contexts in which we do know the true form of F(X) for certain also tend to be contexts in which if we estimated the intercept we’d get an estimate very close to zero and not significantly different from zero. So that there’s no need to force the regression through zero in order to get a sensible estimate of the intercept. I’m thinking for instance of various areas of physics in which quantitative physical theory tells us the form of F(X), and we have heaps of data confirming that theory to the upteenth decimal place.

**Which you can do pretty easily in some cases. For instance, if you’re bothered that a conventional linear regression gives you a negative intercept even though that’s nonsensical or physically impossible, switching to an appropriate generalized linear model with an appropriate link function might well solve the problem. In other cases, maybe there’s some fancypants Bayesian approach that would help? That is, maybe in some cases there’s some clever way to parameterize your model and choose your priors such that putting a strong prior on the intercept doesn’t implicitly put even a weak prior on the form of F(X) away from the intercept? I have no idea.

Jeremly, I’d love to take you up on the invitation to tell you you’re full of it; but actually, this is pretty persuasive!

Now I want to just start saying less and less persuasive things until someone finally tells me I’m full of it.

So: insect herbivores clearly don’t regulate plant population dynamics. In fact, insect herbivores are really boring, you should never study them for any reason. In fact, there are no insect herbivores; insects are autotrophs.

Sorry to disappoint you, but the last point is alredy well established:

http://www.nature.com/news/photosynthesis-like-process-found-in-insects-1.11214

http://dx.doi.org/10.1038/srep00579

I’m right even when I’m trying to be wrong! I feel like Bill Murray in Groundhog Day when he says “I am a god.” 😁

Well played, sir… You know how to push my buttons. Especially with that first sentence, which I keep meaning to write a blog post about 🙂

Not to mention that the relation as you get close to the origin might become highly nonlinear or take on a different functional form for a variety or reasons.

Centering: it works! And makes one stop thinking too much about the origin problem when you don’t need to.

I don’t quite follow. Centering your data changes the location of the origin, but that has nothing to do with forcing a regression to pass through the *original* origin vs. leaving it free not to do so. Am I being dense?

Uncentered regression is

y=a+bx which puts a great deal of emphasis on the y value at x=0 (aka a aka the intercept)

centered regression is conceptually the altenrative line equation which we probably all learned decades ago:

(y-ymean)=b(x-xmean)

Thus it puts ymean and b as the focal parameters and very much focuses the analysis around the center of your data (xmean), not x=0 which as you note could be completely outside of your data range (on both the x & y axis). It is doing everything you are asking for.

I don’t think your argument necessarily rules out the case where there is a strong expectation (theoretical or empirical) for something that is linear (your F(x)=a+bx) and is known to have to go through 0,0.

More generally I agree with Jarrett about centering as an underappreciated solution (although it does have costs of interpretability)

“I don’t think your argument necessarily rules out the case where there is a strong expectation (theoretical or empirical) for something that is linear (your F(x)=a+bx) and is known to have to go through 0,0.”

So you’re saying that there are cases in ecology where we know the true form of F(X) for certain, or at least we have very good reason to think we do? If you really do have a very good reason to think F(X) truly is linear (or whatever) as well as passing through the origin, then sure, force it through the origin. I can’t think of any such cases in ecology off the top of my head, but maybe there are a few.

I’d only add that “A linear model is a good fit to my data over the observed range of X” is *not* a good reason to think that F(X) is linear near the origin unless the observed range of F(X) includes sufficient data near the origin. The quality of your fit does not *on its own* ever justify extrapolating your fitted model beyond the observed range of the data.

Perhaps we should do a follow-up post on the trade-offs involved in centering your regression. Not something I’ve ever thought about, so somebody besides me should probably write it. Would need examples illustrating the interpretability issue. Any candidate examples come immediately to mind?

I see value in many cases for centering the data prior to a regression but this only ensures that the intercept value happens to correspond to the grand mean of the x value. It does not provide a y-intercept that corresponds to the scenario when x=0 and does not produce a regression line where the x and y variable are forced to both be equal to zero. If one forces the regression line to pass through 0,0 when using centered x data one is basically stating that the expected value of y is 0 for the average value of the x variable. If one centers both the x and y variable the regression line wouldn’t have to specify the regression runs through 0, 0 because by definition the regression line passes through the grand mean of the x and y values. Consequently, I don’t think centering fixes the issue identified by Jeremy. Am I wrong here?

@Dave:

As I just said to Jarrett above, I think you’re basically right about centering. I’m happy for the conversation to move to a discussion of centering if that’s what folks want to talk about. But I don’t see how centering one’s variables addresses the issue raised in the post. Centering your data on a new origin (whether the grand means of X and Y, or somewhere else) doesn’t have anything to do with whether or not you should force your regression to pass through the *original* origin because you know the true regression has to do so. Does it? Am I being really dense here?

So why not just do spline regression (e.g. GAM or LOWESS) if you don’t think a linear response is ever a valid assumption?

@Brian:

Hmm, I think we’re drifting off into a different issue. I’m fine with fitting regressions of prespecified functional form (e.g., linear), and I think that’s useful for various reasons even when you don’t have strong reason to think your chosen functional form is the true form.

Thinking about it further, I can see where centering your data addresses the issue raised in the post in some cases. If the investigator is worried in some vague way about getting a non-zero intercept estimate, well, one cure for that worry is to center the data so as to call the investigator’s attention away from whether the regression passes through the originally-specified origin. Distracting people from thinking about whatever’s worrying them is certainly one way to keep them from worrying. That’s fine as far as it goes. But I still don’t think it addresses what I think is the more common case of an investigator who insists that F(X) must pass through the (original) origin but who doesn’t know much else about the true shape of F(X). But perhaps I’m wrong to think that this case is the more common one; obviously just going on gut feeling on that.

One example where we ‘know the true form’ or at least something close in ecology might be predator functional response models. And the (usually nonlinear) regression should go through the origin: no prey available, no prey eaten.

Yeah, that’s one ecological case where one could argue that we know the true functional form. At least in sufficiently-controlled and well-studied contexts.

Even in the situation you describe Brian, you probably don’t want to force the fitted regression line through y = 0 at x = 0 because the least squares estimates of the other parameter(s) will no longer be unbiased. That seems a heavy price to pay just to meet some theoretical nicety.

Meh, unbiasedness is not all it’s cracked up to be. It doesn’t mean what the common parlance usage suggests. Better is better, and unfortunately bias is just one of many criterion to assess “better”. If there is strong theoretical justification or practical necessity to fit through the origin, that overrides any concern about “bias” in other coefficients to my mind…

Also, something I should have noted in the post: omitting the intercept from your linear regression model may will increase the R^2, but that does *not* indicate an improved fit! R^2 in a linear model with an intercept is measuring the explained variation as a fraction of the total variation around the grand mean. R^2 in a linear model forced through the origin is measuring the explained variation as a fraction of the total variation *around zero*. Usually, total variation around zero is much higher than total variation around the grand mean, so it’s very easy for a model to explain a lot of it and thus produce a high R^2.

You accurately describe the way R calculates R2 when a regression is forced through zero, but I don’t think it is the only or best way to calculate R2 when a regression is forced through zero. One can still use the sum squared residuals compared to a horizontal line through ymean in which case an R2 will be worse in a regression without an intercept than a regression with an intercept. I think this is a much more useful R2 in intercept-less regressions.

I can’t find it now but a post of mine to R help asking why in the world R chose to calculate R2 that way in an regression without an intercept was my first introduction to just how rude the R help world was.

I agree that your alternate R2 calculation seems more sensible than the usual calculation when a regression is forced through zero.

In my undergraduate final work I have noticed that using a linear regression through 0 in my data would result in an increased R^2. In that time, I tried to figure out what was happening, searched in many sources, but in the end I just accepted that result, even felling that I was missing something.

5 years latter, now, I finally understood. Thank you for that.

Also, I now know that I should not have forced the regression through zero.

To apply the spline approach, wouldn’t it be necessary to have x data values between 0 and the smallest possible x value that you have in order to extend the regression line to the y-intercept to evaluate if the relationship changes non-linearly as you get closer to zero?

I guess one other point to consider is whether we are really interested in scenarios where x=0 because those scenarios never occur in nature. For example, in species-area relationships one could claim that the number of species in a 0 meter square area is 0 and force the regression through the origin. However, there is no place on our planet that has an area of zero for which most people are obtaining estimates of diversity for. For some taxonomic groups (e.g., birds), the minimum possible area needed to sustain a single species will likely be some value of area that is much greater than 0 and forcing the line through the origin misses this point because we assume that an x value of 0 is the only possible value of x that also produces an expected y-value of 0. People are just not sampling ridiculously small patches of land to estimate bird diversity because people know that they won’t find birds in incredibly small patches. We are likely better off not making the assumption beyond our inference space as suggested by Jeremy in a situation like this.

Similarly, in relationships that describe how some trait (e.g., metabolic rate) varies with body mass, one could argue that an individual with a body mass=0 has a metabolic rate of 0 but in actuality there is no such thing as an animal with no body mass. The metabolic rate of an individual with zero mass is undefined rather than 0 and forcing the regression through the origin is not helpful in a scenario like this either.

Regarding the functional response analogy. We don’t really know what the true form of the functional response is. There are three types of functional response. Furthermore, I would argue that extending a predator’s functional response on a prey species to the scenario in which no prey are present is not useful. If there are no prey present the predators never have an opportunity to kill prey so a prey’s risk of mortality to predators in this scenario is undefined. As long as we have a single prey present we can assess risk of mortality to predators but I don’ think that we should make inferences about risk of mortality to predators when it is never possible to assess that risk of mortality.

“I guess one other point to consider is whether we are really interested in scenarios where x=0 because those scenarios never occur in nature”

In general, I’m sympathetic to this form of argument, but it’s not terribly difficult to come up with scenarios where one might be interested in y when x = 0.

1) Some scales go negative! When regressing anything against temperature, zero is a perfectly reasonable value (at least in C or F for biologists).

2) When regressing densities of two different taxa (such as predators and prey) against each other, it’s possible for one to go to zero even at reasonable sampling scales (e.g. a small island).

3) When regressing against light intensity over short time periods (such as when measuring photosynthetic efficiency), zero implies darkness and we are frequently interested in what happens at this value.

Sorry, I didn’t address your predator-prey point well. But for my point #2 you could imagine regressing two competing or mutalistic taxa against each other instead.

Sorry, I wasn’t clear when I referred to “where x=0”. I was intending to mean when x=0 while we are attempting to force the regression line through the axis intercept of 0,0. This would not preclude the ability to perform or interpret regressions in your case 1 or case 3.

In your case 2, I don’t think you would want to force the regression through the origin either. Just because 1 species is absent from a particular place in nature does not mean that a second species will be absent so we should not force the regression here either. The only exceptions to this that I can think of off the top of my head is if one of the species can only ever exist in a place if individuals of the second species occurs there (i.e., cannot be supported by any other species). For two species competing with each other it would not be wise to force the regression through the origin when regressing the density of one species against the other – the population size of one species should move further away from the origin as the population size of the other species gets closer to the origin.

I can see exceptions where forcing the regression through the origin (as described by imyerssmith below) is warranted but I also think there are lots of scenario’s where one might think it is okay to run the regression through the origin but it may not be the best approach.

Thanks for the clarification, David. I completely agree and wouldn’t want to force a regression through the origin in any of the examples I gave either.

While not wanting to “throw a wrench in the works” or “derail the train of thought”- it occurs to me that other issue(s) could impact estimation of the intercept. As is the case for almost all environmental/ ecological data- spatial and/ or temporal autocorrelation can introduce challenges when it comes to estimating the intercept. Autocorrelation does not bias regression coefficient estimates, but the standard errors are usually under-estimated- meaning that the observed intercept is very often shifted from its “true” value. I mention this because investigators might be better off devoting attention to the very real potential of autocorrelation masking the true value of the intercept- whether or not it is expected to pass through the origin.

Brian has a couple of old posts on this:

https://dynamicecology.wordpress.com/2013/10/02/autocorrelation-friend-or-foe/

https://dynamicecology.wordpress.com/2012/09/24/why-ols-estimator-is-an-unbiased-estimator-for-gls/

Thanks for the heads up, Jeremy! Brian makes some really insightful and groundbreaking points- especially in the first post you mention. “That is the mean, the variance and the covariance (or correlation) are all constant across space and time.” My group has delved into this to some degree- which was why I had mentioned potential effects on the intercept. Turns out we have just revealed the covariance does not always remain constant as once thought. Anyway I am drifting away from your point- which is a very good one.

My take on modelling assumptions in ecology is you can get it “wrong” both by making assumptions and by ignoring the assumptions that you have implicitly made.

In ecology (and in all sciences), I think we need to design models that take into account the appropriate structure of the data that we are modelling. When you are comparing a plot or sample to its self in say a duration analysis of time series data (as in Vellend et al. 2017 where I am a co-author), a zero intercept does logically describe the relationship between diversity change and duration of the study. By not setting the intercept to zero you are also making an assumption about the structure of the data/relationship and that assumption is being carried through to your final estimate in your statistical test. I agree that setting an intercept to zero will influence the fit of a linear model near zero and that this is a modelling assumption that should be undertaken carefully. What we need to be very careful to avoid is testing out different model assumptions until we find the result we are “looking for”; every assumption is a researcher degree of freedom (http://andrewgelman.com/2012/11/01/researcher-degrees-of-freedom/) that needs to be taken into account to avoid p-hacking (http://www.sciencedirect.com/science/article/pii/S0169534716300957).

From my co-author Lander Baeten: “In Vellend et al. 2017, we did not assume one particular form of F(X). Instead, we applied two alternative forms (one with and one without intercept) and presented the results for both models. Both forms make strong assumptions about the data, which we explained in the text. Given that the interpretation changes quite a bit when making different assumptions about the form of f(X), it is tricky to make strong conclusions about the parameters in one particular model specification (i.e. the negative slope in a model with intercept).”

I worry that we don’t spend enough time in ecology thinking about what assumptions our statistical analyses do and do not make and which are and are not appropriate. It goes both ways, sometimes we make assumptions that are not appropriate (e.g., assuming that a relationship is linear when it isn’t, assuming that variance is homogeneous when it isn’t, not taking into account spatial or temporal hierarchy in data, etc.) and sometimes we omit to make assumptions that do describe the relationships that we are trying to test (e.g., not setting an intercept to zero).

“Now, you might respond to this by saying that you don’t care about the form of F(X) outside the range of X values you observed” – Ideally, it is never a good idea to assume anything about the shape of relationships outside of the range of data being tested.

“you’re taking advantage of prior information” – Yes, we do need to take advantage of prior information and be really critical about what are the properties of the data that we are statistically analyzing and what are the assumptions in the models that we are testing.

I believe that the field of ecology is slowly moving away from the parametric statistical framework that we have historically been using towards Bayesian approaches that allow us to be much more specific about the distributions of the data we are analyzing, the specific structures of the spatial and temporal hierarchy in our data and the prior information that we have available to inform our analyses. As we move towards Bayesian frameworks, we need to become more and more explicit about the assumptions we are making and also those that we omit to make. Along with this greater critical assessment will come a better understanding of the ecological relationships that we are statistically testing.

P.S. I am a fan of centering in many situations, but as with all model assumptions it needs to be undertaken with some thought to how the resulting statistical test should be interpreted.

I absolutely agree that one should not be trying models both with and without an intercept and then reporting whichever one gives you the answer you were “looking for”. As you say, that’s p-hacking.

I also agree that it’s important to understand the assumptions one is making and their consequences. I hope that was the take-home message of the post.

I hear the term “p-hacking” thrown around quite a lot- and so I am curious if there is a settled definition for the term. I understand that when someone goes on a fishing expedition, and runs a series of tests, and then ignores all of the negative outcomes while reporting only a positive result- that this is p-hacking. Often- depending upon the question- I might opt to run several tests concerning a particular question as a means to either make a better-informed decision, or to validate a result. For instance- there are several tests for normalcy- but as you might suspect, they do not always agree with one another. Thus, I typically run three normality tests on any given distribution, and consider it normal if at least 2 of 3 tests indicate as such. I feel like I have more confidence in deciding if the distribution is normal or not- but I was curious if that would qualify as p-hacking. When I assess correlations, I almost always assess Pearson’s R, Spearman’s rho and Kendall’s tau for any given set of data (or, omit Pearson’s R if the data are not normal). Rarely is there disagreement, but I do this in part out of curiosity and also to discern potential weaknesses in the data. Whenever i run multiple tests, I always report the outcomes for all of them (good, bad or otherwise)- so I do not feel like I am p-hacking… but, would like to know if I am wrong about that. Thanks.