Teaching a graduate statistics class, I end up as a statistical consultant a lot. One of the questions I get most often is should I treat this as a fixed or a random effect? This topic seems to be shrouded in mystery. Indeed when I came of age statistically in the dark ages (=20 years ago), the main distinction given between a fixed and a random effect was philosophically based: are you measuring a few specific instances of interest in themselves (=fixed) or a few randomly chosen instances interesting only as representatives of a population (=random). This is not a bad approach, and seems clear to me, although I have to confess I have not had great luck teaching this distinction.
The modern distinction seems to be: am I interested in this variable (=fixed) or is it just a blocking factor/control variable* (=random). To my mind, this is not a great heuristic. Among other things, I regularly see people who have a continuous variable they are not interested in aka a control variable (e.g. age) and so they think it has to be a random variable and cannot figure out the syntax to force R to treat a continuous variable as random (hint you can’t).
Now the first question I always ask is should you really include it if you’re not interested in it?! Sometimes the answer is yes (see the footnote on blocking and control factors), but we include way too many blocking and control factors these days in my opinion. Personally I think we should only include the 1 or 2 really obvious blocking or control factors in the average situation. And AIC as a model selection tool only worsens this. But I am in a minority on this.
But if you’re really going to include a term despite my advice not to🙂, should it be fixed or random? To answer this, I find it inordinately helpful to actually step back and look at what the math is doing under the hood. Suppose you have a blocking factor on site, meaning you have n sites 1…n. And you are fitting some model Yi=a+b*xi+si+εi.where si is the site effect for observation i (not necessarily site i, presumably you have multiple observations per site or it wouldn’t be a blocking factor). Basically what is happening is the variable of interest x has its slope b as usual, and then the intercept (a) has an additional increment up or down, si, which depends on the site. So site #1 might be +2.4, site #2 might be -1.3, etc. This basically is the classic blocking factor* since it takes into account that in this case site #1 is a bit more productive (if Y is a measure of productivity) than average, site #2 is a bit less etc. So all those multiple measures i on a given site are controlled for site-specific effects (i.e. anything on site 1 needs to be .
So what’s the difference if I tell R that s is a fixed (y~x+s) or random (y~x,random=~1|s or thereabouts depending on the package)? Well if I treat s as fixed, then it is actually estimating AND reporting a value of s for site #1, #2, etc (as above – i.e. site 1 is +2.4, etc)**. If I treat s as random it is more or less estimating site values for those sites to control for them in the equation, but it is REPORTING just the variance of all those site effects, rather than the value of each one. So if it is fixed it is reporting n** site estimates (s1, s2,…,sn). If it is random it is reporting a single σ2 value with σ2=variance(s1, s2,…,sn). I am sweeping a few details under the rug. But that is what is happening conceptually. Now which should I prefer to have reported back: s1, s2,…,sn or σ2? Ah-ha! That would be a much better way to frame the question than is it fixed or random!
The main reason I might want to have s1, s2,…,sn reported is if I actually want to look at the sites and know which is an above average producer and which is below average. I don’t get that if s is random. This refers back to the original philosophical distinction described above.
Why might I prefer to just report σ2? There are two reasons that I commonly see. First is that σ2 is just one parameter and so it just “eats” one degree of freedom. Where as if I have n=20 sites, I eat n-1=19 degrees of freedom to report all the s1, s2,…,sn . If I have 1000s of data points it hardly matters. But if I am in a costly experimental design where I barely have more data points than variables I am estimating (try to avoid that but it happens), then I better be stingy with my degrees of freedom. This partly goes back to the original notion of “am I interested in the variable” – and indeed it puts an exact currency of cost in degrees of freedom on how badly I am interested in the variable. But the second reason is that σ2 just might be more relevant/interesting regardless of saving degrees of freedom. First, maybe it is actually more interesting to estimate the variance across all possible sites (of which I have only sampled a handful). This might be a more meaningful description of reality than arbitrary site #1 is +2.4 but arbitrary site #2 is -1.3. This again gets back to the philosophical distinction (are my sites of interest in themselves or an arbitrary sample of all sites of interest). But another very practical reason is I might want to do variance partitioning. Saying that 70% of all the variance in my data is due to variation between sites is a very useful thing to say (although probably a bit depressing). Such variance partition can be done by hand with a few simple calculations on outputs from mixed model packages (nlme, lme4) or by functions like varcomp in package ape.
Understanding the math under the hood also makes obvious two common mistakes. First you CANNOT treat a continuous variable as a random effect. So if you are putting area or temperature or body size is in they may be a nuisance/control variable but they are a fixed effect. Of course you are only estimating one parameter (the slope) so there is no degree of freedom cost to treating it as random. And it makes no sense to ask what is the variance across a continuous variable. Similarly, you should not treat a categorical variable with only two levels (e.g. two sites), also known as a binary variable, as a random effect. You wouldn’t take two measures and then try to estimate variance, but that is what you’re asking R to do if you treat it as random. Beyond that there is a lot of debate. But many people think you should have at least 5 levels (e.g. 5 sites) before you treat something as random. Some think even more. This is one place I see students really get in trouble a lot. They’ll have a blocking factor with 2 or 3 levels (year is a frequent culprit) and try to treat it as random. R will go ahead and report something, but it is pretty much the garbage estimate you would expect if you tried to estimate variance from 3 observations.
Another reason to have have s1, s2,…,sn reported back is that the math for that is really clear. In particular the math for calculating degrees of freedom and p-values. Some simple designs with σ2 can return a p-value (especially those simple designs that can be handled by the lme command of nlme such as one random variable in a split plot design or nested random factors). But many modern complex designs with random effects cannot realistically calculate degrees of freedom and p-values. The package that can handle more complex random designs – lmer – pretty much guarantees you don’t know what your p-values are (for good reasons) and the most common workarounds are wrong and often anti-conservative to such a degree that the authors’ of the package refuse to provide p-values (e.g. this comment and Bolker’s comments on Wald tests). So while random variables have some purely statistical advantages (saves degrees of freedom), they have some pretty severe purely statistical disadvantages (often cannot calculate p-valuees).
So my basic decision tree to decide whether to treat something as fixed or random looks like
You may have noticed this decision tree makes it kind of hard to make something random. That is not intentional. I like random effects. But that seems to be the state of the field today. What do you think? What causes you to assign something as fixed vs random?
* Blocking and control variables – these are collectively sometimes referred to as nuisance variables. You put them in because you need to, not because they are of interest. Blocking is typically used when you have repeated measures or nesting in your design. EG if you have 5 replicates (often of different treatments) at each site, there might be a site effect, so you include site as a blocking variable. This then effectively calculates site means and subtracts these out, so you can compare the replicates across sites without having the extra noise (and fear of charges of pseudoreplication) of having a “known unknown” in your data. That is to say you know there are site level effects but you don’t know what they are. Control variables are basically used to eliminate covariance between your variables of interest that occur in observational data. One alternative to putting site in as a blocking variable would be to measure key variables (e.g. soil moisture, productivity) and put these in as continuous variables (although you might still need a hierarchical design to avoid charges of psuedoreplication). These would then be control variables – instead of controlling for site you control for site moisture and productivity. This is also common in medical and educational designs where you know socioeconomic status, age, gender etc are likely to have effects so you control them out by putting them in the regression allowing you to then assume the remaining linkage between say a treatment and a test result are connected (rather than the assignment of treatments was non-randomly correlated with gender).
My own opinion is that control variables are a much stronger inferential framing than blocking (measure variable of known effect like moisture and productivity instead of arbitrary stand ins by site). In general, ecologists tend to be obsessed with blocking factors and not worry anywhere near as much as economists and medical researchers about control factors. That is probably unfortunate. Blocking is mostly driven by a fear of charges of pseudoreplication which is unfortunate because it is probably overly emphasizing p-values and as even the original author Hurlbert argued, the solution to pseudoreplication is good experimental design, not fancier statistics. And I’m not saying having one blocking factor is a bad idea, but 3 or 4 blocking factors feels a bit adrift to me.
** Technically given the way contrasts usually work it is calculating estimates of site effects for one less site than I have or n-1 site effects, with the last site rolled into the intercept.