Why OLS estimator is an unbiased estimator for GLS

In a recent post, I discussed whether it was important to use special methods to correct for autocorrelation be it correlation in spatial, temporal or phylogenetic distance.

There seemed so much confusion on this point, I thought I would do a little explaining.

First point is that the way one corrects for autocorrelation is to move from the ordinary least square (OLS) regression model of

Y=XΒ+ε

to the generalized least squares (GLS) model of

Y=XΒ+η

Looks pretty similar doesn’t it? It is except that the assumptions about the error term (aka residuals) denoted here as ε and η are different.

First in both OLS and GLS, we assume E(ε)=0 and E(η)=0 (For those unfamilar with it, E stands for expectation and indicates the average or expected value). Still the same here.

In OLS we assume Var(ε)=σ²I where I is the identity matrix (1’s on diagonal, 0’s off the diagonal). Since we have 0’s off the diagonal the covariance/correlation between any two errors is zero. Since the diagonal is all the same the variance of each error is the same, or in fancy greek words homoscedastic.

But in GLS we assume Var(η)=σ²Σ where Σ is the covariance matrix (a matrix that is symmetric but which can have non-zero values off the diagonal). So the 2nd row, 3rd column of Σ has the covariance between the error term for the 2nd data point (of X) and the 3rd point (which is of course the same as the error covariance between the error term of the 3rd point and the 2nd point and hence the same as the 3rd column, 2nd row of Σ, making Σ symmetric).

Or to summarize:

OLS: Y=XΒ+ε where ε~N(0,σ²I)
GLS: Y=XΒ+η where η~N(0,σ²Σ)

and ~N(0,σ) means distributed normally (Guassian distribution) with mean 0 and variance σ

Its not really my goal here to explain how to do spatial regression or phylogenetic regression, but briefly these techniques build a model for the structure of Σ where the covariance is a function of the distance between two points (denoted by d) and one parameter reflecting the strength of autocorrelation (denoted by k), so the 2nd row, 3rd column of Σ is a function of the distance between the 2nd and 3rd point (be it distance in physical space, time, or phylogenetic distance≈time). Which function depends on how we think covariance changes with distance. So for example, for spatial autoregression we could assume covariance decays exponentially with distance, i.e. cov=exp(-kd) (see section 5.3.2 of Pinheiro & Bates book on mixed models for a GLS oriented verison of this)* or for phylogenies (see section 2 of this) assuming Brownian motion evolution cov=kd. But this should give a quick idea of how GLS does with autocorrelation in cases of ecological interest.

Now what I said that so many people didn’t understand is that the OLS estimator of β denoted by a hat (\hat{\beta}) is also the unbiased estimator of the GLS problem. That’s right solve the OLS problem ignoring the autocorrelation and it will give an unbiased estimate for the slope and intercept even under the GLS formulation with the autocorrrelation/covariances .

This is widely known in statistics. For example, p. 224 of Batalgi’s 2011 book Econometrics  or p. 224 of the classic book by Draper 1998 on Applied Regression Analysis (NB: not a typo – it is on page 224 of both books – a cosmic convergence!)

However, many readers were unhappy with citations, so I also provide a proof (skip this paragraph if you want!). To start, we know that the solution of the OLS problem (or estimator of β called \hat{\beta} or “beta hat”) is \hat{\beta}=(X’X)-1X’Y (yeah the matrix notation is getting heavy). Now the bias of this estimator has a formal definition:
Bias(\hat{\beta})=E(\hat{\beta})-β=E(\hat{\beta}-β)
So now we just do algebra:
Bias(\hat{\beta})=E(\hat{\beta}-β)=E[(X’X)-1X’Y-β]=E[(X’X)-1X'(Xβ+η)-β]
All we’ve done so far is substitute in our estimator for (X’X)-1XY and then substitute in our formula for Y from the GLS formulation, so continuing with simple algebraic manipulation:
Bias(\hat{\beta})=E[(X’X)-1X'(Xβ+η)-β]=E[(X’X)-1X’Xβ+(X’X)-1Xη-β]=E[(X’X)-1(X’X)β-β+(X’X)-1Xη)]=E[β-β-(X’X)-1Xη]=E[(X’X)-1Xη]=(X’X)-1X E(η)=(X’X)-1X 0 = 0
therefore Bias(\hat{\beta})=0 aka unbiased!

Now lets pull back. What we did was take the OLS estimator of β and calculate its bias assuming we’re solving the GLS model. And we showed the bias was zero. This shows the point estimator of the OLS is unbiased even if the underlying data follows the GLS model with covariance between points. Which was my claim.

This followed entirely because E(η)=0. This is just a fancy of way of saying the average error term is zero or the GLS line is centered between the error terms, or in other words, the sum of the residuals is zero. This property is enough to give us the OLS estimator being unbiased for ANY linear regression model. We can assume anything we want about the covariance of the error terms and still get the same answer. In hindsight this is intuitive.

Now, I didn’t begin to touch the variance of estimator \hat{\beta}. That is also just algebra, but more messy, and here the OLS and GLS solutions diverge because they depend on Var(ε) which is difference than Var(η). In principal the variance of the OLS estimate used when there is covariance of errors has to be larger than the variance of the GLS estimate (the GLS estimator is in an exact sense the best estimator in this case). In practice, this assumes that covariance matrix Σ is known which it never is which allows either to be bigger (a bad estimate of the covariance can make the GLS estimate have larger variance around \hat{\beta}). In my experience in practice the OLS estimate has a little more variance than the GLS but not a lot. To summarize again, the confidence interval (and p-value) for the OLS estimator are wrong in a GLS world. Personally I find this easiest to think about as the number of degrees of freedom being overestimated**.

So what if the estimate is unbiased but may have a bigger variance? Well that was the point of my previous post. The variance of the estimator can show up in several ways:

  1. in the confidence interval which is calculated/reported to be a bit smaller than the truth
  2. in the likelihood (and hence AIC value)
  3. in the p-value which is really a bit bigger than reported.

Now if one is focused on the confidence interval this is a real problem and one ought to use GLS if possible. If one is using AIC to, for example, compare models with and without GLS it is a real problem (but might not be a problem if one is just using AIC to select predictor variables and all models are run with OLS estimators and hence doing an apples-to-apples comparison). And, most importantly, in what I believe to be by far the most common case, where one is interested in the p-value vs. the threshold of 0.05, well if your p using OLS is just under 0.05 you should sweat because it might well go over with a correct GLS estimate. But if your p-value is well-below 0.05, you can stop sweating because real-world data is not going to push it over and the slightly off estimate of the variance is inconsequential in a null-hypothesis testing world. In this scenario one needs to weigh the extra work involved in GLS (creating a phylogeny, determining the correct spatial covariance structure) against the goals and under some goals (e.g. p-values) the very small improvements of using GLS.***

To summarize, if one uses OLS methods (i.e. simple regression) in a GLS world (i.e when there is correlation between error terms such as space, time or phylogeny), then the estimate of slope and intercept are as good as any. The understanding about the error of this estimate is a little bit off which may or may not matter depending on goals. Specifically, if one is doing a p<0.05 test and the calculated p value is safely far below 0.05, the reject the null decision WILL NOT CHANGE.

Thus I have now given a complete mathematical proof, multiple classic citations, and an intuitive explanation of my claim in the original post. If you don’t like mathematical proofs and don’t want to look up the citations given above , I recommend looking at Table 2 and Figure 1 of the paper by Dormann et al which clearly shows that the simulated confidence intervals (i.e. the boxes in the box plots) for OLS have the same center as the various spatial methods which is the same as saying they have the same expected value which given their common center is the known true value is the same as saying they’re unbiased

I have been surprised to see just how many people were unaware of these facts (indeed many people I respect that I talked to outside of the blog also seemed misinformed on this topic). But it is also true that every statistician I talked to said something like “of course”. There appears to have been a great deal of misinformation spread around this topic which is unfortunate. The real discussion should not be around statistical facts (which are unarguable) but about what the implications are for how one does statistics (which was the point of my original post).


* Incidentally, spatial GLS is the same as regression kriging. Most discussions of the technique take a kriging oriented approach instead of framing it as GLS. But it is under the hood the same thing. In Kriging you estimate the covariogram to get a model of how covariance changes with distance. Then assuming you are modelling a trend vs covariates (i.e. the regression piece) in addition to the covariance, you are doing regression kriging which is just GLS with the covariance matrix of the errors Σ determined by the kriging model you fit.

** This idea that the core effect of autocorrelation is to make the sample size smaller is rather intuitive. Imagine I have 10 points. Now if they are placed far enough apart from each other, then I truly have 10 points because the errors at each point are uncorrelated with the errors at any other point. Now, I take two points and move them towards each other. As I start to get into the range where those two points are close, they start to be come correlated and worth a bit less than two points, so maybe I have 9.5 points now. When those two points are on top of each other, they are perfectly correlated or redundant, and I really now only have 9 points. More realistically, as I start to move all 10 points closer to each other so they start to become correlated with each other, I start to have less than 10 points worth of data, maybe really 7.3 points. Thus my degrees of freedom is overestimated. Thus my p-value is too big (and my CI is a bit too small). This is not mysterious. And as I think maybe you can see intuitively, it doesn’t do anything that changes the slope of the regression. That is all that is going on. In a phylogenetic world imagine we have 5 species and 3 points in each species giving us 15 points. Now if there is no interspecific variation and all variation is intraspecific (aka no phylogenetic correlation), then we do effectively have 15 independent points. Conversely, if all variation is interspecific and no intraspecific variation then we effectively have only 5 independent points. In the real world there is some intraspecific and some interspecific variation so there is some phylogenetic correlation so the real degrees of freedom is more than 5 but less than 15.

*** Two points of confusion that cropped up that I don’t want to mess up the main post with clarifying:

  • The OLS and GLS estimate for β WILL be different in any particular set of data. But one is not more right than the other. Both will give the true answer on average (or with large enough data)
  • One could say the variance estimate of β IS biased if one using the OLS estimate in the GLS world. A statistician would understand what this means but probably not use this language. Bias of an estimator usually refers to how the average behavior of the estimator matches the true value (not its variance around the true value which is described using words like efficiency). In any case this is semantics – my core point – the estimated slope & intercept are good, the p-value (or other measure of error) is a little off – is unchanged.

14 thoughts on “Why OLS estimator is an unbiased estimator for GLS

  1. Hi Brian,

    cheers for writing this down in a more mathematical way, this makes it much easier to discuss. I think the variance bias has been covered sufficiently in and below your previous post, so I will concentrate on the point estimates.

    Your proof for unbiased point estimates assumes that η~N(0,σ²Σ) is multivariate normal and independent of X. In “ecological words”, this assumes that autocorrelation is not only multivariate normal, but also that it is more or less homogeneous. There are a number of situations where I would expect that this is not the case. Imagine, for example, that, for some reason, your spatial model has a systematic error that leads to a situation where it generally overpredicts for high altitudes, with consequent residual autocorrelation in areas of high altitudes, while all other areas show slight underprediction, but no RSA. I don’t think your proof still holds for such a situation. Of course, you might already see this in your non-spatial residuals, and you could also say that in this situation, I should rather correct my model, but I would say that model error (which is always there to some extent) can generally lead to RSA in spatial models, and it’s not clear to me why it should necessarily do so in a homogeneous fashion.

    So, I would say that, while it is good to know that point estimates for data with residuals that fit to a GLS with ~N(0,σ²Σ) can also be estimated unbiased with OLS, it doesn’t really solve the problem that you have to look at the spatial/phylogenetic residuals to see whether the residuals are really ~N(0,σ²Σ), i.e. the data fits to a GLS. And if you have to do this anyway, I wonder how much more work it is to use a GLS right away, which also gives you correct p-values/CIs.

  2. Hi Florian- thanks for the thoughtful response.

    You picked up on a 3rd complexity, I was trying to avoid getting into (clearly you followed the math so I’m glad I spent the time doing hybrid Latex/WordPress equations!). However it does not change the OLS/GLS story.

    You are correct, my proof assumes either that:
    1) X is constant (non-stochastic). This is the often stated but poorly understood (and rarely met) assumption of regression that the X variables are measured with no error.
    2) X and ε or η are uncorrelated with each other if there is error in the X measurements
    BUT – both OLS & GLS require this assumption. GLS does not in anyway change the story. It is necessary to go to Type II (e.g. RMA) regression to fix this problem.

    So while you are 100% right about the proof, and it highlights an assumption that I agree is often violated in ecology, it does not in any way affect the OLS vs GLS estimator debate. It is, however, a good argument for ecologists to pay more attention to Type II regression.

    • Hi Brian,

      if your assumption is that you are only interested in point estimates, and that the alternative to OLS would be a simple GLS, regardless of whether the GLS is able to remove the RSA completely, I see your point – the GLS only removes a certain “homogeneous” part of the autocorrelated error which likely doesn’t bias point estimates, and for the non-homogenous part of the residuals, it’s not clear that GLS is doing better than OLS.

      However, the way I see it, this doesn’t mean that OLS is unbiased and one can savely use OLS, but rather that OLS and simple GLS are potentially equally biased if there is non-homogenous autocorrelation, which actually isn’t that unlikely in the real world. Here, looking at the residuals allows me to see whether OLS/GLS assumptions are met, and to potentially correct them if not, e.g. by making ε or η functions of X or space, at least this would be my view on what should be done in an ideal world. Btw., I just had a look at the Dormann 2007 results that you cited and noted that they remark on the fact that their simulation results were created with a homogenous spatial correlation structure, so it’s not surprising that the point estimates didn’t shift.

      About your request for references: I just found a nice new review by Jennifer A. Miller (2012) Species distribution models. Progress in Physical Geography, 36, 681-692. She discusses the issue of whether point estimates shift in length. *BRIAN HERE – Florian I did have to remove the quote because it was too long (and I think it probably violates copyright law). However I agree this is a good review worth checking out to those interested in spatial autocorrelation

      • Non-stationarity – change in the structure of autocorrelation across space (including change in the structure with covariates) is a serious issue (although not differentiating OLS vs GLS which both have a stationarity assumption). I think it is likely a major issue in species distribution models/niche models.

        Personally, I would rather see us spend more time and energy truly understanding the structure of spatial autocorrelation (including its non-stationarity) and less time doing mindless GLS over OLS when we’re primarily interested in interactions between dependent and independent variables that we have a priori reasons to believe interact strongly independent of space.

  3. Great post Brian! Even as a stats-deficient reader it was easy to follow the plot. That said, could you elaborate on a few points?

    1) How does one arrive at the “kd” parameter that models correlation between points in space or species on phylogenies?

    2) “p-value which is really a bit bigger than reported”

    “if one is doing a p<0.05 test and the calculated p value is safely far below 0.05, the reject the null decision WILL NOT CHANGE."

    These sound reassuring, but what is the operational definition of "a bit" and "safely far below 0.05"? I'm not sure skeptical reviewers would be so easily convinced by those terms.

    3) "The OLS and GLS estimate for β WILL be different in any particular set of data. … Both will give the true answer on average (or with large enough data)"

    It seems like we are most interested in particular sets of data, which are unfortunately of finite size. Basically, can you give some examples on the flip side of your argument where phylogenetic or spatial correlation really does change things?

    Thanks! Chris

    • Hi Chris – great questions

      #1 – k and d are separate – d is the distance between any two points (in space – e.g. in km, in time e.g. in years, or in phylogeny, e.g. in millions of years). d is measured from the data. K is an estimated parameter of the model – i.e. it is curve fit. If we knew k in advance, GLS would be very elegant, but having to estimate k (and which exact curve to represent how autocorrelation changes with distance) as well as the coefficients (βs) requires a two step process and adds some limitations back into the GLS frame which contributes to my point (mainly it means the GLS estimates often don’t have much better error intervals than the OLS).

      #2 – It would require some really good simulations to answer this definitively and they haven’t been done as far as I know but in both spatial and phylogenetic GLS I rarely see an OLS p of 0.01 inflated above 0.05 (but often see p=0.03 inflated above 0.05). Thus I suggest a conservative value of if p<0.005 (order of magnitude less) you don't have to worry about autocorrelation if you are only doing p value and slope/intercept estimates. But so often in these regressions the p<0.000001 and it is just laughably safe of never going above 0.05.

      #3) You are right of course that this statistics world where we have multiple samples of the data doesn't exist (alas!). But for any given dataset, even though OLS and GLS will give slightly (and I do mean slightly different answers – often just a few percent different) there is no valid reason to prefer one over the other. Its a bit like the person who wears two watches and never knows what time it is when they disagree.

      Your last question is very pertinent and is really my central question. It is really hard to come up with an answer where it fundamentally changes things. In spatial regression I have never seen it fundamentally change the answer (yes it could take a p=0.03 and turn it into non-significant but that is artificially binary to say that GLS p=0.06 is fundamentally different than OLS p=0.03). In phylogeny it is usually the same. I do believe there are one or two classic cases where there really was not careful thinking about the fact that they were really just contrasting lots of species in two families and in some ways only had two points. I seem to recall a case about correlation between dioecy and berries for the fruit or some such. I will try to dig up anything I can find. But in the meantime other readers are welcome to contribute. This was really the main question of my first post, and I never got an affirmative answer where our ecological/evolutionary understanding fundamentally shifted thanks to these cases.

      • @Brian, @Chris – re #2 Sample size will play a large role in how far the p-value moves once the standard errors of the estimates are computed in the GLS world. If you have a lot of data, the p-value won’t change that much, but with small samples the effect can be large. For example (if you know a bit of R):

        x <- 1:15
        y <- 0 + (0.64 * x)
        set.seed(42)
        noise <- as.numeric(arima.sim(list(ar = 0.7), n = 15))
        y <- y + noise
        mod <- lm(y ~ x)
        summary(mod)
        require("nlme")
        mod2 <- gls(y ~ x, correlation = corAR1(form = ~ x))
        summary(mod2)
        

        This is a small data set (but not uncommon in studies) and reasonably strong autocorrelation. In this case the p-values changes by 4 orders of magnitude but remains significant (which sort of makes Brian’s point :-). If the original fit wan’t as significant then strong correlation could potentially flip a p = 0.001 result to insignificance.

        If you are concerned about this you will probably want to simulate data similar to those real data you are analysing (sample size, size of fixed effect, etc). But if you are going to those lengths I suspect it would be expedient to just use GLS in the first place.

  4. Pingback: Is using detection probabilities a case of statistical machismo? | Dynamic Ecology

  5. Pingback: Autocorrelation: friend or foe? | Dynamic Ecology

  6. Thanks for explaining this in an easy way! Some textbooks I’ve been reading are so hard to understand. You’ve done a really nice job of pointing out some things that I’ve simply not been able to grasp (such as the real difference between OLS and GLS is the error term).

  7. Pingback: Detection probabilities, statistical machismo, and estimator theory | Dynamic Ecology

  8. Just to add – the more my thinking has evolved on this, the more I think the fact that the GLS estimator of the variance of \beta (and thus p-values/confidence intervals/degrees of freedom) is better/more accurate ONLY if the covariance structure is correctly modeled and that a bad estimate of the covariance structure can fairly easily make GLS LESS accurate than OLS even for the variance is vastly underappreciated. I don’t know formally how often this happens but my intuition is that given we start with errors in our phylogenies, spatial locations etc, and then often have users use push-button routines to estimate \epsilon/\Sigma that don’t know how to carefully model and evaluate the goodness of fit of their models of covariance, I expect it is often.

  9. Pingback: Statistical Machismo | brouwern

  10. Pingback: In praise of courtesy p-values: perfectly correct p-values vs. pragmatically approximate p-values | Dynamic Ecology

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.