Regression through the origin is when you force the intercept of a regression model to equal zero. It’s also known as fitting a model without an intercept (e.g., the intercept-free linear model y=bx is equivalent to the model y=a+bx with a=0).
Every time I’ve seen a regression through the origin, the authors have justified it by saying that they know the true intercept has to be zero, or that allowing a non-zero intercept leads to a nonsensical estimated intercept. For instance, Vellend et al. (2017) say that when regressing change in local species richness vs. the time over which the change occurred, the regression should be forced through the origin because it’s impossible for species richness to change if no time passes. As another example, Caley & Schluter (1997) did linear and nonlinear regressions of local species richness on the richness of the regions in which the localities were embedded. They forced the regressions through the origin because by definition regions have at least as many species as any locality within them, so a species-free region can only contain species-free localities.
Which is wrong, in my view. Ok, choosing to fit a no-intercept model isn’t always a big deal (and in particular I don’t think it’s a big deal in either of the papers mentioned in the previous paragraph). But sometimes it is, and it’s wrong. Merely knowing that the true regression has to pass through the origin is not a good reason to force your estimated regression to do so.
Knowing that the true relationship between your predictors and the expected value of your dependent variable has to pass through the origin would be a good reason for forcing the estimated relationship through the origin if you knew for certain what the true relationship was. That is, you not only know that Y=F(X) passes through the origin, you know the functional form of F(X) and merely have to estimate its true parameter values.
Of course, we rarely know that in science, and in ecology I’m going to go out on a very short and sturdy limb and say we never know that.* Rather, the functional form of our regressions ordinarily is chosen for convenience, or because it fits the observed data reasonably well, or etc. In which case you should not be forcing your intercept through the origin. Forcing your intercept through the origin in such cases is a way of papering over misspecification of your regression. For instance, if your linear regression provides a reasonable-looking fit over range of X values you observed, but with a nonsensical estimated intercept, that’s your linear regression’s way of telling you that the true shape of F(X) is nonlinear near the origin, outside the range of X values you observed. Forcing it through the origin treats the (most obvious) symptom (the nonsensical estimated intercept) but not the disease (misspecification).
Now, you might respond to this by saying that you don’t care about the form of F(X) outside the range of X values you observed, or near the origin. You’re only forcing the regression through the origin in order to improve the accuracy of your parameter estimates of F(X) over the range of X values you observed, or that you care about. Which sounds plausible–you’re taking advantage of prior information–but is wrong. If F(X) is misspecified, you do not necessarily improve your parameter estimates by forcing your misspecified F(X) through the origin. A misspecified F(X) forced through a point through which you know the true F(X) passes will not necessarily pass any closer to the true F(X) over any particular range of X values than will a misspecified F(X) not forced through the origin. Especially if the range of X values you care about includes values far from the origin. Indeed, insofar as there’s a possibility that you misspecified F(X), you’re probably better off not forcing it through the origin. In most cases, the added flexibility you gain from estimating the intercept should improve the ability of your chosen F(X) to mimic the unknown true F(X) over the range of X values you care about.
Conversely, you might force F(X) through the origin because you worry about the nonsensical estimate of the origin that you get otherwise. To which: don’t worry. If all you really care about is estimating the form of F(X) over some range of X values away from the origin, why do you also care if extrapolating your fit to the origin gives you a nonsensical estimate of the origin? That you know that you’ve estimated the intercept badly does not mean that you’ve also estimated F(X) badly over the range of X values you care about, unless you’re sure that F(X) is correctly specified.
To put in Bayesian terms (which you don’t have to, but which I will just for completeness): in forcing your chosen F(X) through the origin, you’re not just saying that you have prior information that the true intercept is zero. Implicitly, you’re also saying that you have very strong prior information about the values F(X) takes on away from the origin (because your estimates of the other parameters of F(X) will change if you set the intercept of F(X) equal to zero). If the only thing you’re sure you know about F(X) is that it passes through the origin, then you either need to figure out a way to incorporate only that information into your fitting procedure**, or else you need to ignore that information in your fitting procedure.
Even if you’re pretty confident you know the true form of F(X), there might be some problem with your data that prevents you from obtaining a precise, unbiased estimate of the intercept (which you know must be zero). Measurement error or sampling bias or even just a small sample size. In which case I still don’t think you should try to paper over the inadequacies of your data by forcing the regression through the origin. In particular, if you are forcing a linear regression with one predictor variable through the origin in order to save one degree of freedom for your slope estimate, c’mon. It’s one degree of freedom. If it makes any appreciable difference, you don’t really have enough data to be drawing robust scientific conclusions about the relationship between your dependent and independent variables, and you shouldn’t be hiding that unhappy fact from yourself by forcing your regression through the origin so as to get P<0.05.
There may be some specific contexts in which there are context-specific reasons for regression through the origin. I can’t think of any off the top of my head but would be happy to hear of examples in the comments.
In writing this post, I’m pretty sure I’m disagreeing with many of my friends. To whom I now give permission, if permission were needed (which it wasn’t) to tell me I’m full of s**t. 😉 The rest of you have permission too, of course. 😉
p.s. Don’t read anything into my choice of examples in this post, they’re just the first two ecology papers I happened to remember that report regressions through the origin. And definitely do not extrapolate from my criticism of that statistical choice to criticism of anything else about those papers. In particular, I agree with every other substantive and technical point in Vellend et al. (2017). Plus, as I said above I think that theirs is a case in which the choice between models with and without an intercept makes no substantive difference to the scientific conclusions.
p.p.s. I’m traveling today and tomorrow, comment moderation may be a bit slow.
*Plus, the rare contexts in which we do know the true form of F(X) for certain also tend to be contexts in which if we estimated the intercept we’d get an estimate very close to zero and not significantly different from zero. So that there’s no need to force the regression through zero in order to get a sensible estimate of the intercept. I’m thinking for instance of various areas of physics in which quantitative physical theory tells us the form of F(X), and we have heaps of data confirming that theory to the upteenth decimal place.
**Which you can do pretty easily in some cases. For instance, if you’re bothered that a conventional linear regression gives you a negative intercept even though that’s nonsensical or physically impossible, switching to an appropriate generalized linear model with an appropriate link function might well solve the problem. In other cases, maybe there’s some fancypants Bayesian approach that would help? That is, maybe in some cases there’s some clever way to parameterize your model and choose your priors such that putting a strong prior on the intercept doesn’t implicitly put even a weak prior on the form of F(X) away from the intercept? I have no idea.