In a recent post, I discussed whether it was important to use special methods to correct for autocorrelation be it correlation in spatial, temporal or phylogenetic distance.
There seemed so much confusion on this point, I thought I would do a little explaining.
First point is that the way one corrects for autocorrelation is to move from the ordinary least square (OLS) regression model of
to the generalized least squares (GLS) model of
Looks pretty similar doesn’t it? It is except that the assumptions about the error term (aka residuals) denoted here as ε and η are different.
First in both OLS and GLS, we assume E(ε)=0 and E(η)=0 (For those unfamilar with it, E stands for expectation and indicates the average or expected value). Still the same here.
In OLS we assume Var(ε)=σ²I where I is the identity matrix (1’s on diagonal, 0’s off the diagonal). Since we have 0’s off the diagonal the covariance/correlation between any two errors is zero. Since the diagonal is all the same the variance of each error is the same, or in fancy greek words homoscedastic.
But in GLS we assume Var(η)=σ²Σ where Σ is the covariance matrix (a matrix that is symmetric but which can have non-zero values off the diagonal). So the 2nd row, 3rd column of Σ has the covariance between the error term for the 2nd data point (of X) and the 3rd point (which is of course the same as the error covariance between the error term of the 3rd point and the 2nd point and hence the same as the 3rd column, 2nd row of Σ, making Σ symmetric).
Or to summarize:
OLS: Y=XΒ+ε where ε~N(0,σ²I)
GLS: Y=XΒ+η where η~N(0,σ²Σ)
and ~N(0,σ) means distributed normally (Guassian distribution) with mean 0 and variance σ
Its not really my goal here to explain how to do spatial regression or phylogenetic regression, but briefly these techniques build a model for the structure of Σ where the covariance is a function of the distance between two points (denoted by d) and one parameter reflecting the strength of autocorrelation (denoted by k), so the 2nd row, 3rd column of Σ is a function of the distance between the 2nd and 3rd point (be it distance in physical space, time, or phylogenetic distance≈time). Which function depends on how we think covariance changes with distance. So for example, for spatial autoregression we could assume covariance decays exponentially with distance, i.e. cov=exp(-kd) (see section 5.3.2 of Pinheiro & Bates book on mixed models for a GLS oriented verison of this)* or for phylogenies (see section 2 of this) assuming Brownian motion evolution cov=kd. But this should give a quick idea of how GLS does with autocorrelation in cases of ecological interest.
Now what I said that so many people didn’t understand is that the OLS estimator of β denoted by a hat () is also the unbiased estimator of the GLS problem. That’s right solve the OLS problem ignoring the autocorrelation and it will give an unbiased estimate for the slope and intercept even under the GLS formulation with the autocorrrelation/covariances .
This is widely known in statistics. For example, p. 224 of Batalgi’s 2011 book Econometrics or p. 224 of the classic book by Draper 1998 on Applied Regression Analysis (NB: not a typo – it is on page 224 of both books – a cosmic convergence!)
However, many readers were unhappy with citations, so I also provide a proof (skip this paragraph if you want!). To start, we know that the solution of the OLS problem (or estimator of β called or “beta hat”) is =(X’X)-1X’Y (yeah the matrix notation is getting heavy). Now the bias of this estimator has a formal definition:
So now we just do algebra:
All we’ve done so far is substitute in our estimator for (X’X)-1XY and then substitute in our formula for Y from the GLS formulation, so continuing with simple algebraic manipulation:
Bias()=E[(X’X)-1X'(Xβ+η)-β]=E[(X’X)-1X’Xβ+(X’X)-1Xη-β]=E[(X’X)-1(X’X)β-β+(X’X)-1Xη)]=E[β-β-(X’X)-1Xη]=E[(X’X)-1Xη]=(X’X)-1X E(η)=(X’X)-1X 0 = 0
therefore Bias()=0 aka unbiased!
Now lets pull back. What we did was take the OLS estimator of β and calculate its bias assuming we’re solving the GLS model. And we showed the bias was zero. This shows the point estimator of the OLS is unbiased even if the underlying data follows the GLS model with covariance between points. Which was my claim.
This followed entirely because E(η)=0. This is just a fancy of way of saying the average error term is zero or the GLS line is centered between the error terms, or in other words, the sum of the residuals is zero. This property is enough to give us the OLS estimator being unbiased for ANY linear regression model. We can assume anything we want about the covariance of the error terms and still get the same answer. In hindsight this is intuitive.
Now, I didn’t begin to touch the variance of estimator . That is also just algebra, but more messy, and here the OLS and GLS solutions diverge because they depend on Var(ε) which is difference than Var(η). In principal the variance of the OLS estimate used when there is covariance of errors has to be larger than the variance of the GLS estimate (the GLS estimator is in an exact sense the best estimator in this case). In practice, this assumes that covariance matrix Σ is known which it never is which allows either to be bigger (a bad estimate of the covariance can make the GLS estimate have larger variance around ). In my experience in practice the OLS estimate has a little more variance than the GLS but not a lot. To summarize again, the confidence interval (and p-value) for the OLS estimator are wrong in a GLS world. Personally I find this easiest to think about as the number of degrees of freedom being overestimated**.
So what if the estimate is unbiased but may have a bigger variance? Well that was the point of my previous post. The variance of the estimator can show up in several ways:
- in the confidence interval which is calculated/reported to be a bit smaller than the truth
- in the likelihood (and hence AIC value)
- in the p-value which is really a bit bigger than reported.
Now if one is focused on the confidence interval this is a real problem and one ought to use GLS if possible. If one is using AIC to, for example, compare models with and without GLS it is a real problem (but might not be a problem if one is just using AIC to select predictor variables and all models are run with OLS estimators and hence doing an apples-to-apples comparison). And, most importantly, in what I believe to be by far the most common case, where one is interested in the p-value vs. the threshold of 0.05, well if your p using OLS is just under 0.05 you should sweat because it might well go over with a correct GLS estimate. But if your p-value is well-below 0.05, you can stop sweating because real-world data is not going to push it over and the slightly off estimate of the variance is inconsequential in a null-hypothesis testing world. In this scenario one needs to weigh the extra work involved in GLS (creating a phylogeny, determining the correct spatial covariance structure) against the goals and under some goals (e.g. p-values) the very small improvements of using GLS.***
To summarize, if one uses OLS methods (i.e. simple regression) in a GLS world (i.e when there is correlation between error terms such as space, time or phylogeny), then the estimate of slope and intercept are as good as any. The understanding about the error of this estimate is a little bit off which may or may not matter depending on goals. Specifically, if one is doing a p<0.05 test and the calculated p value is safely far below 0.05, the reject the null decision WILL NOT CHANGE.
Thus I have now given a complete mathematical proof, multiple classic citations, and an intuitive explanation of my claim in the original post. If you don’t like mathematical proofs and don’t want to look up the citations given above , I recommend looking at Table 2 and Figure 1 of the paper by Dormann et al which clearly shows that the simulated confidence intervals (i.e. the boxes in the box plots) for OLS have the same center as the various spatial methods which is the same as saying they have the same expected value which given their common center is the known true value is the same as saying they’re unbiased
I have been surprised to see just how many people were unaware of these facts (indeed many people I respect that I talked to outside of the blog also seemed misinformed on this topic). But it is also true that every statistician I talked to said something like “of course”. There appears to have been a great deal of misinformation spread around this topic which is unfortunate. The real discussion should not be around statistical facts (which are unarguable) but about what the implications are for how one does statistics (which was the point of my original post).
* Incidentally, spatial GLS is the same as regression kriging. Most discussions of the technique take a kriging oriented approach instead of framing it as GLS. But it is under the hood the same thing. In Kriging you estimate the covariogram to get a model of how covariance changes with distance. Then assuming you are modelling a trend vs covariates (i.e. the regression piece) in addition to the covariance, you are doing regression kriging which is just GLS with the covariance matrix of the errors Σ determined by the kriging model you fit.
** This idea that the core effect of autocorrelation is to make the sample size smaller is rather intuitive. Imagine I have 10 points. Now if they are placed far enough apart from each other, then I truly have 10 points because the errors at each point are uncorrelated with the errors at any other point. Now, I take two points and move them towards each other. As I start to get into the range where those two points are close, they start to be come correlated and worth a bit less than two points, so maybe I have 9.5 points now. When those two points are on top of each other, they are perfectly correlated or redundant, and I really now only have 9 points. More realistically, as I start to move all 10 points closer to each other so they start to become correlated with each other, I start to have less than 10 points worth of data, maybe really 7.3 points. Thus my degrees of freedom is overestimated. Thus my p-value is too big (and my CI is a bit too small). This is not mysterious. And as I think maybe you can see intuitively, it doesn’t do anything that changes the slope of the regression. That is all that is going on. In a phylogenetic world imagine we have 5 species and 3 points in each species giving us 15 points. Now if there is no interspecific variation and all variation is intraspecific (aka no phylogenetic correlation), then we do effectively have 15 independent points. Conversely, if all variation is interspecific and no intraspecific variation then we effectively have only 5 independent points. In the real world there is some intraspecific and some interspecific variation so there is some phylogenetic correlation so the real degrees of freedom is more than 5 but less than 15.
*** Two points of confusion that cropped up that I don’t want to mess up the main post with clarifying:
- The OLS and GLS estimate for β WILL be different in any particular set of data. But one is not more right than the other. Both will give the true answer on average (or with large enough data)
- One could say the variance estimate of β IS biased if one using the OLS estimate in the GLS world. A statistician would understand what this means but probably not use this language. Bias of an estimator usually refers to how the average behavior of the estimator matches the true value (not its variance around the true value which is described using words like efficiency). In any case this is semantics – my core point – the estimated slope & intercept are good, the p-value (or other measure of error) is a little off – is unchanged.