Teaching a graduate statistics class, I end up as a statistical consultant a lot. One of the questions I get most often is should I treat this as a fixed or a random effect? This topic seems to be shrouded in mystery. Indeed when I came of age statistically in the dark ages (=20 years ago), the main distinction given between a fixed and a random effect was philosophically based: are you measuring a few specific instances of interest in themselves (=fixed) or a few randomly chosen instances interesting only as representatives of a population (=random). This is not a bad approach, and seems clear to me, although I have to confess I have not had great luck teaching this distinction.

The modern distinction seems to be: am I interested in this variable (=fixed) or is it just a blocking factor/control variable* (=random). To my mind, this is not a great heuristic. Among other things, I regularly see people who have a continuous variable they are not interested in aka a control variable (e.g. age) and so they think it has to be a random variable and cannot figure out the syntax to force R to treat a continuous variable as random (hint you can’t).

Now the first question I always ask is should you really include it if you’re not interested in it?! Sometimes the answer is yes (see the footnote on blocking and control factors), but we include way too many blocking and control factors these days in my opinion. Personally I think we should only include the 1 or 2 really obvious blocking or control factors in the average situation. And AIC as a model selection tool only worsens this. But I am in a minority on this.

But if you’re really going to include a term despite my advice not to :smile:, should it be fixed or random? To answer this, I find it inordinately helpful to actually step back and look at what the math is doing under the hood. Suppose you have a blocking factor on site, meaning you have n sites 1…n. And you are fitting some model Y_{i}=a+b*x_{i}+s_{i}+ε_{i}.where s_{i} is the site effect for observation i (not necessarily site i, presumably you have multiple observations per site or it wouldn’t be a blocking factor). Basically what is happening is the variable of interest x has its slope b as usual, and then the intercept (a) has an additional increment up or down, s_{i}, which depends on the site. So site #1 might be +2.4, site #2 might be -1.3, etc. This basically is the classic blocking factor* since it takes into account that in this case site #1 is a bit more productive (if Y is a measure of productivity) than average, site #2 is a bit less etc. So all those multiple measures i on a given site are controlled for site-specific effects (i.e. anything on site 1 needs to be .

So what’s the difference if I tell R that s is a fixed (y~x+s) or random (y~x,random=~1|s or thereabouts depending on the package)? Well if I treat s as fixed, then it is actually estimating AND reporting a value of s for site #1, #2, etc (as above – i.e. site 1 is +2.4, etc)**. If I treat s as random it is more or less estimating site values for those sites to control for them in the equation, but it is REPORTING just the variance of all those site effects, rather than the value of each one. So if it is fixed it is reporting n** site estimates (s_{1}, s_{2},…,s_{n}). If it is random it is reporting a single σ^{2} value with σ^{2}=variance(s_{1}, s_{2},…,s_{n}). I am sweeping a few details under the rug. But that is what is happening conceptually. Now which should I prefer to have reported back: s_{1}, s_{2},…,s_{n} or σ^{2}? Ah-ha! That would be a much better way to frame the question than is it fixed or random!

The main reason I might want to have s_{1}, s_{2},…,s_{n }reported is if I actually want to look at the sites and know which is an above average producer and which is below average. I don’t get that if s is random. This refers back to the original philosophical distinction described above.

Why might I prefer to just report σ^{2}? There are two reasons that I commonly see. First is that σ^{2} is just one parameter and so it just “eats” one degree of freedom. Where as if I have n=20 sites, I eat n-1=19 degrees of freedom to report all the s_{1}, s_{2},…,s_{n} . If I have 1000s of data points it hardly matters. But if I am in a costly experimental design where I barely have more data points than variables I am estimating (try to avoid that but it happens), then I better be stingy with my degrees of freedom. This partly goes back to the original notion of “am I interested in the variable” – and indeed it puts an exact currency of cost in degrees of freedom on how badly I am interested in the variable. But the second reason is that σ^{2} just might be more relevant/interesting regardless of saving degrees of freedom. First, maybe it is actually more interesting to estimate the variance across all possible sites (of which I have only sampled a handful). This might be a more meaningful description of reality than arbitrary site #1 is +2.4 but arbitrary site #2 is -1.3. This again gets back to the philosophical distinction (are my sites of interest in themselves or an arbitrary sample of all sites of interest). But another very practical reason is I might want to do variance partitioning. Saying that 70% of all the variance in my data is due to variation between sites is a very useful thing to say (although probably a bit depressing). Such variance partition can be done by hand with a few simple calculations on outputs from mixed model packages (nlme, lme4) or by functions like varcomp in package ape.

Understanding the math under the hood also makes obvious two common mistakes. First you CANNOT treat a continuous variable as a random effect. So if you are putting area or temperature or body size is in they may be a nuisance/control variable but they are a fixed effect. Of course you are only estimating one parameter (the slope) so there is no degree of freedom cost to treating it as random. And it makes no sense to ask what is the variance across a continuous variable. Similarly, you should not treat a categorical variable with only two levels (e.g. two sites), also known as a binary variable, as a random effect. You wouldn’t take two measures and then try to estimate variance, but that is what you’re asking R to do if you treat it as random. Beyond that there is a lot of debate. But many people think you should have at least 5 levels (e.g. 5 sites) before you treat something as random. Some think even more. This is one place I see students really get in trouble a lot. They’ll have a blocking factor with 2 or 3 levels (year is a frequent culprit) and try to treat it as random. R will go ahead and report something, but it is pretty much the garbage estimate you would expect if you tried to estimate variance from 3 observations.

Another reason to have have s_{1}, s_{2},…,s_{n } reported back is that the math for that is really clear. In particular the math for calculating degrees of freedom and p-values. Some simple designs with σ^{2} can return a p-value (especially those simple designs that can be handled by the lme command of nlme such as one random variable in a split plot design or nested random factors). But many modern complex designs with random effects cannot realistically calculate degrees of freedom and p-values. The package that can handle more complex random designs – lmer – pretty much guarantees you don’t know what your p-values are (for good reasons) and the most common workarounds are wrong and often anti-conservative to such a degree that the authors’ of the package refuse to provide p-values (e.g. this comment and Bolker’s comments on Wald tests). So while random variables have some purely statistical advantages (saves degrees of freedom), they have some pretty severe purely statistical disadvantages (often cannot calculate p-valuees).

So my basic decision tree to decide whether to treat something as fixed or random looks like

_{1}, s

_{2},…,s

_{n}(perhaps because you have lots of data and so lots degrees of freedom to burn and are curious how sites differ)? →Fixed

^{2}, perhaps because it saves you degrees of freedom you really need or perhaps because the variance is more interesting (or useful for variance partitioning) than a bunch of estimates of site effects nobody will ever look at? go to D

You may have noticed this decision tree makes it kind of hard to make something random. That is not intentional. I like random effects. But that seems to be the state of the field today. What do you think? What causes you to assign something as fixed vs random?

* Blocking and control variables – these are collectively sometimes referred to as nuisance variables. You put them in because you need to, not because they are of interest. Blocking is typically used when you have repeated measures or nesting in your design. EG if you have 5 replicates (often of different treatments) at each site, there might be a site effect, so you include site as a blocking variable. This then effectively calculates site means and subtracts these out, so you can compare the replicates across sites without having the extra noise (and fear of charges of pseudoreplication) of having a “known unknown” in your data. That is to say you know there are site level effects but you don’t know what they are. Control variables are basically used to eliminate covariance between your variables of interest that occur in observational data. One alternative to putting site in as a blocking variable would be to measure key variables (e.g. soil moisture, productivity) and put these in as continuous variables (although you might still need a hierarchical design to avoid charges of psuedoreplication). These would then be control variables – instead of controlling for site you control for site moisture and productivity. This is also common in medical and educational designs where you know socioeconomic status, age, gender etc are likely to have effects so you control them out by putting them in the regression allowing you to then assume the remaining linkage between say a treatment and a test result are connected (rather than the assignment of treatments was non-randomly correlated with gender).

My own opinion is that control variables are a much stronger inferential framing than blocking (measure variable of known effect like moisture and productivity instead of arbitrary stand ins by site). In general, ecologists tend to be obsessed with blocking factors and not worry anywhere near as much as economists and medical researchers about control factors. That is probably unfortunate. Blocking is mostly driven by a fear of charges of pseudoreplication which is unfortunate because it is probably overly emphasizing p-values and as even the original author Hurlbert argued, the solution to pseudoreplication is good experimental design, not fancier statistics. And I’m not saying having one blocking factor is a bad idea, but 3 or 4 blocking factors feels a bit adrift to me.

** Technically given the way contrasts usually work it is calculating estimates of site effects for one less site than I have or n-1 site effects, with the last site rolled into the intercept.

Hi Brian, interesting discussion! Personally, I’ve found it most helpful to think of “random effects” as either varying intercepts, slopes or both, following Gelman’s textbook (BDA3) and papers. The key is that they are given a group-level distribution, say N(0, sigma), where the variance is estimated, which engages the power of partial pooling for the separate estimates (s1, s2,..). I really think it’s an unfortunate disadvantage of the lme4 approach that it is very difficult to get confidence intervals for s1, s2, …, and this is one compelling reason to learn Stan! There are times when it seems best to let treatment effects vary by group (genotype, species, whatever), and we are specifically interested in inference for the separate estimates in addition to the estimate of the overall variance component.

My $0.02

Chris

You are of course correct about varying slopes and the assumption of N(0,sigma) (although I didn’t want to bring those complications into my discussion for reasons of brevity).

What I don’t understand is that in practice you can have varying slopes or intercepts that are either fixed or random. ANCOVA without random effects incorporates all of these concepts.

Yep. The key distinction as Gelman likes to say is “modelled varying coefficients” versus “unmodelled varying coefficients”. In the former case your Si ~ N(0, sigma) but in the latter you have Si ~ N(0, Inf) which is just “fixed effects” (what ANCOVA does). The difference in the estimates is precisely analogous to the impact of the prior on the posterior, in that an “uninformative” prior is set to have N(0, ~Inf) and it therefore doesn’t shrink the posterior estimate toward 0 at all (which isn’t a good thing IMO!). So, at least for hierarchical Bayes the modelled varying coefficients approach uses the estimated group-level mean as a sort of informative prior for each of the separate Si estimates. The shrinkage is greater with lower group-level variance, and/or noisier Si…

One thing I disagree with Jeremy’s comment below is that I don’t think that an effect being of interest in itself is a reason to treat is as “fixed”, or N(0, Inf). Partial pooling makes the separate coefficient estimates more statistically efficient, so my current understanding is to use N(0, sigma) wherever possible.

Of course it doesn’t always make sense. If I have an agricultural experiment with N and P applications, I don’t want the estimate of the P effect borrowing strength from (partially pooled with) the N estimate. BUT, if I’ve got multiple genotypes getting N and P, I would want their estimates partially pooled to the group-level means for N and P respectively. Obviously, block effects should also always be given a N(0, sigma) type model.

Chris

OK – that helps. That is a pretty specific STAN/Gelman/sort-of-Bayesian view of it. But I get where you’re coming from. I don’t know how many ecologists will have been exposed to that view, but I find it interesting. Thanks.

And I think your disagreement with Jeremy has more to do with your choice of definitions than any true disagreement. the linear mixed model world definitely has the distinction Jeremy made, not the one you made.

@Chris Wilson:

Brian’s summarized why I don’t actually disagree with you, at least I don’t think I do. My answer to SB’s question below took for granted some assumptions about the context of the question.

Hi guys, fair enough. I suppose from the standpoint of using lme4 (which is really useful in some cases and impressively fast, don’t get me wrong) any effect of interest needs to be treated as “fixed” since there’s no good way to get CI’s/p-values from the random estimates. I do think it’s unfortunate in that it confines random effects (hence any quantity estimated with partial pooling) to be sort of nuisance parameters. In several problems I’ve worked on, it made sense to have modeled varying coefficients, and it was really important to do inference with them, so we needed CIs. For me, this wasn’t a problem since I started learning all this stuff from a Bayesian perspective, so I’m comfortable with the mechanics of MCMC and specifying priors, etc. But I can see how there’s definitely a need to reach across backgrounds to make accessible tools, and statistics is a pluralistic discipline at any rate…

All true what you say. And I for one appreciated learning about the STAN/semi-Bayesian (is that a fair description of STAN? – its not meant as pejorative) view of mixed models. I didn’t know anything about it.

Good topic.

In many ecological experiments, the blocks (be they spatial blocks, temporal blocks, or whatever) are neither fixed nor random. They aren’t randomly sampled from the population of interest (indeed, the “population of interest” often can’t be defined all that precisely). But nor does the experiment include all possible levels of the blocking factor, or all of the levels of interest; the experimenter doesn’t even much care about the specific levels of the blocking factor included in the experiment. So one has to make a judgment call; you have to jam the square peg of your experiment into one of two round holes.

“They aren’t randomly sampled from the population of interest (indeed, the “population of interest” often can’t be defined all that precisely).”

According to Hurlbert (and in general good experimental design) they probably should be randomly sampled from the population of interest – but you’re of course right that they rarely are.

I’m with Chris on this. I found the whole fixed vs. random effects thing murky and confusing until I started reading Gelman and Hill with a study group. It turns out that the terminology is different across disciplines (social scientists do not use the term ‘random effect’), and I personally think the terminology ‘random’ is both misleading and confusing, as it doesn’t have anything to do with mathematical random variables. I find it much easier to think of the so-called ‘random’ variables as ‘grouping variables’. And when I think about the math, I think about partial (and no and full) pooling as an easy way to think about where the variation is being partitioned. Once I realized that ‘mixed-effect’ models are really just hierarchical models, it all became a lot clearer to me. Maybe we shouldn’t teach ‘mixed-effects’ modesl and instead should teach hierarchical models.

Also, I think I’m in Gelman’s camp on including variables: include everything, don’t do model selection, and learn what you can from your model. But I’ll caveat it with: it depends on the question you’re asking. Maybe I’ll write a blog post on this one day…

I agree – thinking about random effects as grouping is a very productive way to think about them. Not sure I agree about having nothing to do with random variables – as Chris pointed out they are exactly modelled as effect sizes as random samples from a N(0,sigma) random variable.

Your/Gelman’s camp of include everything and stop (i.e. no model selection) is my second favorite choice. I still prefer think carefully and do a simple model related to your question and stop. But I think both of those approaches have more in common than not when compared to the Crawley/Zuur approach of model selection and moving things in and out at will until you’re “happy”.

I agree with Margaret (and Chris) that partial pooling is almost always the way to go. Taking a hierarchical modeling perspective makes this problem more intuitive, IMO.

Though I also agree with Brian that building up from a simple model is usually safer and potentially more informative than starting off with the full Cadillac version. I think Gelman would argue that approach also, though he in practice would probably start with the most complex model that worked and only simplify when there are problems. There is a figure in Gelman and Hill (p. 416) illustrating a type of debugging scenario when one has trouble fitting a complex model, which applies to many situations beyond random effects.

Oh, and I’m with you on all of this Brian. Especially on only including one obvious blocking factor in most situations.

I second Margaret and Chris’ comments. I’ll just add that for me this discussion of fixed v. random should include an understanding how shrinkage towards the mean operates on group (or “random”) effects in a hierarchical model. Whether or not I think that shrinkage is appropriate often drives my decision on the modeling approach.

^ A key concept. Also, prediction differs between the two approaches. Both methods can predict future observations for observed groups. But, with “random” effect models, predictions can also be made for future groups using the parameter estimates for the sampling distribution of group level parameters.

I have heard the following:

a) If you have all levels of the factor in your design, then it is a fixed effect

b) If you have some levels of the factor missing, then it is a random effect

Is this a reasonable approach?

a is correct. A standard example in biology is studying differences between males and females in a species where those are the only two sexes. That’s a fixed effect.

Not quite sure I understand b, but on my provisional understanding I don’t think it’s correct. For instance, consider studying differences between males and females in a species that also has hermaphrodites. There are lots of reasons why the male-female difference might be of interest biologically, and it would be a fixed effect. Both because the difference between those two specific levels of the “sex” factor is of interest to you, and because as a practical matter you don’t want to be estimating a variance from only two levels. It’s a fixed effect even though you’re leaving out the “hermaphrodite” level of the “sex” factor.

I realize this is a discussion of statistics, ….but it turns out biologically that if you want to understand almost anything interesting that involves males, females and hermaphrodites in the same population , you will want to compare the males reproductive actions to the male function of the hermaphrodite, and ditto for the female side. Your statistics better be able to do this.

This is the old biometry advice, but it is incorrect. The design is irrelevant to whether you model an effect as fixed or random. The question is one of regularization. Practical considerations also matter. But it is definitely irrelevant whether or not some levels are omitted, because even when all levels are present, you will do better by regularizing (shrinking) the estimates. This will provide a better estimate of both the individual effects and the average effect. Page 20 of this Gelman paper has a nice list of some of the inconsistent ways that fixed and random are defined: http://www.stat.columbia.edu/~gelman/research/published/AOS259.pdf

Just wanted to add my 2 cents that I don’t think this definition is terrible in the linear mixed model context in my mind when the blog was written. There are some very different perspectives on modelling out there. I don’t think it is correct to say one is innately better than the other. They both seem useful to me.

“…so they think it has to be a random variable and cannot figure out the syntax to force R to treat a continuous variable as random (hint you can’t).”

YES!!! The number of times I’ve had to explain this…

I ran into an interesting situation recently that I’d love your opinion on. A colleague has been talking about including time (e.g., Year) as both a continuous effect *and* a random effect. The logic is they are looking for a temporal trend, but also acknowledge that the value might just vary randomly from year to year as well. So, something like:

population ~ 1 + year + (1|year)

From a model perspective, I see what they are doing. There’s both a long-term temporal trend and random variation from year to year. Year is just a label we put on a thing, and it’s treated here as both categorical and continuous. But this also seems very very fishy to me. In practice, all it does is widen the confidence interval on the temporal trend coefficient. The estimate does not change. I’ve tried to search for a reference that explains that this is either a valid or invalid model, but to no avail.

Thoughts?

I think your example raises a number of interesting points.

1) Grouping by a continuous variable. Year is of course special because you probably do only have 2 or 3 values, but to make a generic point I see people do y~something+1|mass – this will run, but it will treat every unique value of mass as a separate group – almost certainly not what was intended. In your example of year it probably will sort of work as intended – that is to say give a slope for a trend vs year and a random grouping by year as a sort of blocking

2) The issue I see in what you describe is that once you remove a variable intercept (group mean) for each year, how can you calculate a slope for year also? In my experience this will still run, mostly because there is still variance to run a regression vs year through after removing the average value for each year (assuming you have more than one observation per year), but that slope is likely to be close to zero.

3) Even more generally I see people often try to include things as a fixed and a random effect. This is really not that different from trying to put a variable into a model twice – you have two highly collinear variables so the question of which variable to assign the effect to is ill-defined and the resulting estimates likely arbitrary (even if it will often run for the same reasons as it does in #2).

I’d be curious to see if somebody who knows more than I do disagrees.

What I do think might be valid is to fit something like:

population~1+days+1|year where days is something like days since start of experiment (i.e. goes from 0 to 1000 over most of a 3 year period). This is effectively just a hierarchical model. Anybody else have an opinion on this?

The format:

population ~ 1 + year + (1|year)

is actually valid; think of the linear term (the first mention of year) as the trend, and the random effect models non-linear but consistent year effects around this trend. It should come out with consistent estimates (although it’s not how I would recommend fitting a non-linear year effect…). It should be similar to decomposing a time series into trend and random components.

It’s analogous to if you had multiple sites situated along a moisture gradient, so each site has a distinct and increasing level of moisture, fitting the following model:

population ~ 1 + moisture + (1|site).

Both models will return valid fits. Note that the fixed effects forms of these models(population ~ 1+moisture +site or population~ 1+ year + factor(year)) will not return stable estimates, as year and factor(year) will be co-linear. The hierarchical model structure restricts how much variability allowed for “site” or “factor(year)”, allowing you to fit both trend and group-level effects.

If you have a more continuous variable (say you had day of year for each point, for instance), you couldn’t model it as a random effect.

Hi Eric – thanks for weighing in. What you say makes a lot of sense if the fixed effects are estimated before the random effects. I take it you believe that is true? My understanding is admittedly limited or at the very least dated. I thought the estimation algorithm typically swapped back and forth between a fixed effects step conditional on the last estimate of the random effects then a random effects step conditional on the last fixed effects estimate and repeat, which wouldn’t lead to this kind of precedence. But I could *very* easily be wrong. Could you describe in more detail why you think the fixed effects take precedence over random effects? Thanks! I love all the things I learn in the comments.

Hi Brian,

The fixed effects and random effects are estimated simultaneously (or by iterative fitting); it’s the presence of the penalty term (the standard deviation of the random effect) that allows it to fit accurately. Basically, the algorithm will keep squeezing the random effects smaller and smaller, as long as all the differences between years fall on a straight line. If all the inter-year variability falls on a straight line, lme4 will shrink the random effects standard deviation to zero.

In a way, the fixed effect slope does take precedence, because it’s not penalized. As there’s no penalty term reducing the value of the year slope, it can take whatever value is needed to fit the year trend. Then the fitting procedure will shrink the random effects so they represent the best fitting deviations from this line.

Thanks! that makes a lot of sense. Not to take too much advantage of your knowledge, but I think what you say about penalization applies to lme4 but I think maybe not to nlme? Have I got that right?

I’ve worked less with nlme than with lme4, but from what I understand, they should both work that way. Basically, any random effect model works by penalizing the size of the random coefficients, with a penalty term chosen based on goodness of fit. There’s a very close connection between L2-regularization (ridge regression) and random effects modelling (although don’t ask me to show you the math… that’s above my pay grade).

nlme and lme4 use different algorithms for estimating their random effects (and can do different things using them), but they should give very similar results for normally distributed single-level hierarchical models like we’re talking about here.

Looks my understanding of how nlme/lme is fitting is wrong. You are right. Both lme and lmer work fairly well in the scenario Jarret gave. Here’s some R code to create a dummy dataset. They both estimate the fixed and random effects pretty well.

`set.seed(11213)`

#data shape noreps replications per year for noyrs years

noreps=5

noyrs=5

#parameters fixed effects y=a+bx, random year effect sig, noise esig

a=5

b=2

sig=2

esig=1

yreffect=rep(rnorm(noyrs,0,sig),each=noreps)

yr=rep(1992:2001-1992,each=noreps)

y=a+b*yr+yreffect+rnorm(noreps*noyrs,0,esig)

df=data.frame(y=y,yr=yr)

m0=lm(y~yr,data=df)

#summary(m0)

coef(m0)

require(nlme)

m1=lme(y~yr,data=df,random=~1|yr)

#summary(m1)

fixef(m1)

VarCorr(m1)

`require(lme4)`

m2=lmer(y~yr+(1|yr),data=df)

#summary(m2)

fixef(m2)

VarCorr(m2)

Ha, it’s funny to see the code, because I was working on something similar when I saw your demo…

Following your point, I also wanted to reply to jebyrnes’ point that this structure widens the confidence interval for the estimate of the year effect; I see that as a feature, not a bug. For why, see the code below; basically, if you have consistent year-to-year differences and enough observations per year, a straight linear regression will give you arbitrary significance for a linear trend, even if there’s no trend by construction. The hierarchical construction allows the model to recognize that most of that variation is idiosyncratic year-to-year differences, and that there’s no confidence in a directional trend. Basically, the hierarchical regression “recognizes” that it only has 6 independent data points to estimate the year slope, and won’t over-estimate the confidence in the slope no matter how many distinct within-year obs. you have.

set.seed(11213)

library(lme4)

#generate 6 years of data,

#w/ 10,000 data points per year

noreps = 10000

noyrs = 6

b = 0.0

sig = 0.1

esig =0.5

yr_vals = rnorm(noyrs, b*(1:noyrs), sig)

reg_data = as.data.frame(expand.grid(yr = 1:noyrs,

indiv = 1:noreps))

reg_data$obs = rnorm(noreps*noyrs, yr_vals[reg_data$yr],esig)

#fit a linear model and a hierarchical model

lm_fit = summary(lm(obs~yr, data=reg_data))$coefficients

lmer_fit = summary(lmer(obs~yr+(1|yr),

data=reg_data))$coefficients

#print coefs and standard errors

print(lm_fit)

print(lmer_fit)

Awesome! I definitely learned something new on this. Thank you Eric.

And just want to add one more thing – the year example works because it is treated as continuous in one place (fixed) and discrete in the other (random). I just went back and verified that my first response to Jarrett about putting a discrete factor/grouping in as both random and fixed (which I see people try to do) doesn’t estimate very well. As indeed I don’t think it could – they are essentially highly collinear at that point.

You can see this by adding the following code to the end of my example:

`require(lme4)`

m3=lmer(y~as.factor(yr)+(1|yr),data=df)

#summary(m2)

fixef(m3)

VarCorr(m3)

Thanks for the demo code and explanations Eric! I confess I don’t know much about the MLE/frequentist approach to random effects, so it’s always interesting to get a concise summary.

It seems that one way it’s different is that with full Bayes/MCMC there’s no ‘privileging’ the fixed effects, for lack of a better term. This means the random effect variance cannot be shrunk to zero, and that’s often considered a good thing in settings where literally zero makes no sense. The other distinction is that hyperprior uncertainty is propagated in full Bayes, whereas the point estimate of group-level variance is used in MLE/frequentist approach, for instance as the penalty term you described.

Chris

hmm, in the extreme that there is no year trend, then this model would reduce to

population = int + si + ei

where si ~ N(0, sigma.year)

and ei is error variance component.

Clearly no problem here.

On the other end if there’s a strong trend, the random year estimates si will be centered around the group mean given by

int + b.year*year[i]

I think the key statistical issue here is having sufficient # of observations per year to get a decent estimate for your si. In the extreme of only one observation per year, you wouldn’t be able to decompose error variance from the random effect. With several to many observations per year, that shouldn’t be a problem.

So it seems that model should be valid and work, but I may be missing something.

Chris

For those following this question avidly, be sure to check out Florian’s point#4 and link further below

So you don’t actually need to specify a different variable as fixed and random for this to work? lmer does it for you? E.g., if “year” is numeric (continuous), I was thinking you might need to do something like

yr1<-as.factor(year)

and then feed each variable (one continuous, one random) into the model specification, but looking at the example code in both places in this post, perhaps not.

Fantastic post, by the way.

lmer does it for you. Here’s the thing many don’t realize is that the anytime you use (1|x) in lmer it is basically like there is an as.factor() wrapped around x – lmer (and all linear mixed models) don’t know how to do anything except with a discrete variable so it is treated like a discrete variable whether it is or not. That’s why something like (1|mass) doesn’t give an answer most people want – it just uses every unique value of body mass as a group.

I’d like to give another vote to the “Gelman” perspective. If you want better estimates, use random effects. It’s a form of regularization. I have a forthcoming book that has a more approachable explanation: http://xcelab.net/rm/statistical-rethinking/. See Chapter 12.

To add something new to the discussion, it is indeed possible to treat continuous variable as random effects. These are known as Gaussian processes, and phylogenetic regressions are an example. Well, they are a simplified example. In principle, any continuous variable like patristic distance or age or size or kinship or agreement on movie ratings can be used in a shrinkage analysis. Gaussian processes accomplish this in a particular way. They are very popular in machine learning, but I haven’t seen them much in biology, outside of phylogenetic models. An important exception is this recent paper: http://arxiv.org/abs/1412.8081

Interesting to hear so many votes for the Gelman approach. In general, I agree that regularization in general is an underappreciated approach in ecology. So many ecologists are happy to treat as estimation as a black box but a perfect one with perfect estimates that the idea of regularization=pushing estimates off MLE/OLS for any reason is abhorent. But I think regularization is a very realistic solution to the problems that plague ecological data, including especially covariation.

Thanks for bringing up type II regression in a Gelman context.

Dynamic Ecology: come for the blog posts, stay for the R code…

Hi, a few comments:

1) Regardless of random or fixed effects, the group-level effects are always estimated, for mixed models they are just not displayed by default. You can get then in lme4 by typing

fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)

ranef(fm1)

Hence, the mixed model above estimates the same number of variables as the fixed effect version

lm(Reaction ~ Days * Subject, sleepstudy)

2) The only difference between the two models is that the mixed version makes the assumptions that the group-level parameters (those reported in ranef(fm1) for subject) come from a normal distribution with variance sigma that is estimated as a hierarchical model. Hence, in the random version the estimates for each subject are dependent on each other, unlike in the fixed model version. The connection via the normal distribution leads to the shrinkage that was mentioned by someone else above – if you compare the fixed and random effect version of the same model, the random effects will typically be more strongly centered around the mean. An nice visualization of that is here

http://mbjoseph.github.io/2015/01/20/shrink.html

3) Fixed and random – it is your choice to treat a parameter estimate as nuisance or not. Specifically, you can interpret random effect estimates biologically, if you consider the underlying model biologically meaningful.

4) Ben Bolker and I have answered the year fixed / random question here https://biologyforfun.wordpress.com/2015/08/31/two-little-annoying-stats-detail/#comment-324

I should add that the shrinkage caused by connecting the random effect estimates via this normal distribution is why we do the mixed models – as a result, we loose less degrees of freedom and can add more variables in the model. How much less depends on the variance of the random effects. The fact that the degrees of freedoms are not fixed but variable leads to all these issues when trying to estimate CIs and the p-values.

Although I am very familiar with the idea of regularization generally, its application in mixed models is new to me. This is great.

Thanks Florian. All good points. And some great specifics and links to back them up.

Glad to see you and Ben confirmed quickly what I got to slowly to with Eric’s help on the year question. Namely it works when it is treated as a continuous fixed effect and categorical random effect (which is a bit of a special case but definitely true for year), but not when it is treated as categorical for both fixed and random (which I have seen many people try to do).

I think the reason for this is, as you say, that many people teach mixed models as a) fixed effects are the interesting variables, and b) random effects the nuisance variables.

I suppose this originates from the application of random effects for dealing with blocking in experimental designs, but if taught as a general rule it does not only lead to the issue with continuous nuisance variables you point to here, but it also prevents people from using random effects for things they might be interested in, e.g. species-specific or intra-specific variability of an effect , where you could fit models such as

lmer(Response ~ EcologicalFactor + ( EcologicalFactor | Species))

and still interpret the species-specific effects.

Do we need to be concerned with correlation between the fixed and random effect? Would high correlation give us less confidence in our coefficient estimates?

There are a 2 comments bringing up this concern within the discussion that Florian posted:

biologyforfun.wordpress.com/2015/08/31/two-little-annoying-stats-detail/#comment-324

In Eric’s example below we would expect moisture and site to be highly correlated. Would that be a valid model if the goal were to explain/interpret the effect of moisture on the population?

Eric Pedersen

on November 4, 2015 at 10:22 pm said:

—————————————————

It’s analogous to if you had multiple sites situated along a moisture gradient, so each site has a distinct and increasing level of moisture, fitting the following model:

population ~ 1 + moisture + (1|site).

Learned a lot from this post and the comments – thanks!

Hi, Florian so far gives the best description.

The “fixed/random” and “of interest/not of interest” dichotomies are not necessarily wrong, but are redundant and confusing. It is much easier to consider all models to be hierarchical, with the ordinary lm() just having one hierarchical level. If you want to go more complex, just draw whatever parameters from whatever distributions you consider reasonable. Just like a construction set.

I propose that this is how it should be taught in any applied stats course, or even in intro courses. The best way to do that is to ditch the classical jargon, and jump straight into teaching likelihood. I am always surprised how (relatively) smooth this is, and how students mostly get it. Because writing likelihood functions, and/or writing models in JAGS/STAN, is actually easier than memorizing set of guidelines, models, tests and buzzwords, or R model formula syntax.

Hi Petr,

I certainly agree with you about putting likelihood front and center. However, having taught stats to a lot of grad students, I have to say it depends on you math background. Many students still find memorizing rules of thumb more accessible than working through complex likelihoods (or simple likelihoods).

I am indeed exaggerating a bit — the best way is probably to combine a healthy dosage of recipes with a not-too-formula-heavy likelihood theory and a bunch of practical examples. But I’d still argue that you can teach likelihood without that much math. It is the beauty of having pnorm(), ppois(), pbinom() and alike (or JAGS for that matter) — they do the tedious math for you, yet you still get a firm grip over the exact meaning. And once a student understands that data are draws from distributions, and parameters of these distributions can be drawn from distributions, and so on, it is like a statistical satori, mixed-effect models are no longer a mystery and it all clicks together. I remember about half a dozen of such cases of profound instant understanding, it can be very formative experience for the student, and an extremely rewarding moment for the teacher.

Very interesting discussion and thoughts. But I wonder how declaring an effect as random instead of fixed should increase the degrees of freedom “left over” for the error variance. You correctly stated that the random effects are estimated, but hidden in the table. The degrees of freedom for these effects go into the estimation of the variance of the random factor. Although this is just a single parameter, it can eat a lot of degrees of freedom. These guys are just greedy.

Ute, it’s not as easy as that – basically, making a fixed effect random reduces the degrees of freedom of the model, because you make an additional restriction on the parameters. How much the RE parameters are restricted depends on the estimated variance, i.e. the degrees of freedom of a mixed / hierarchical model are not fixed. This is also the reason why doing MS / tests of RE structures is so problematic.

See, e.g., http://biomet.oxfordjournals.org/content/88/2/367.abstract

Coming late to this, and I have NOT read everything above, but I reckon that the fixed-random distinction boils down to this:

The random effects assumption is that the intercepts (or slopes) for different factor levels are independently drawn from some specific family of probability distribution (often normal distributions in practice, but this is not necessary). If that assumption is valid, and if the factor has enough levels to plausibly estimate the parameters of the distribution, then the random effects model is more efficient and is likely to produce better estimates not only of the population mean, but also of the individual intercepts (or slopes) (this is Bayesian shrinkage in a nutshell). Of course, if the random effects assumption is violated, then the model cannot be trusted to accurately represent reality. IMO, nothing else matters. If you’re willing to argue that the random effects assumption is valid, use a random effect. If you’re not, use (potentially inefficient) fixed effects.

The rub is that it’s difficult and non-standard to test a posteriori the validity of the random effects assumption–you can’t just plot the residuals and look for normality and homogeneity of variance, for example, which is how we check assumptions for simple linear models.

Hi Jacob,

That’s an interesting comment, but I don’t think I agree with your take on this; a random effects model should still tend to make as accurate or more accurate (as measured by mean squared error) estimates of group-level parameters than the fixed effect model as long as you have a relatively high number of groups.

The assumption about the distribution of random effects around the mean is the last assumption you should be worrying about when fitting a model (as with the assumption about the distribution of residuals in standard linear regression), as it’s violation will have very little effect on the coefficients that the model estimates for the random effects. If the coefficients really don’t lie near a normal distribution, a decent estimation algorithm will just inflate the st. dev. of the random effects to infinity, which will just give you your fixed effects estimates.

Now, if you’re trying to make predictions about new groups, then the distributional assumptions will start to become more important; However, switching to a fixed effect model won’t help there, as you just won’t be able to make predictions for new groups at all.

Thanks for this comment. I tentatively trust you that the random effects model is guaranteed to give more accurate inference about the individual groups, though I haven’t fully convinced myself. And of course with any model, some degree of violation of assumptions is expected and acceptable.

But I do think that severe violations of the distributional assumptions of the random effect lead to bad inference. This can happen *even if the random effect improves model accuracy* if the random-effects model overestimates its precision. If you generate data under some non-gaussian process, then assume that the data are normal, and then estimate the mean and variance, the variance does NOT tend to infinity, and you can get over-confident inference about the mean. See r-code pasted at the end of this post.

When it comes to the individual group-level intercepts (or slopes), there is also an issue. When each group has a lot of data, it doesn’t matter much, because the random-effects assumption won’t pull the group-level estimates around much. But consider a group with only a few data points. If the actual distribution of group means (or slopes) is non-normal, a normal random effect will shrink the estimate by the wrong amount–either too far or not far enough–and estimate a confidence interval (or credible interval) that is too narrow, because the model doesn’t understand that the estimate has been shrunk by the wrong amount.

Lastly, and most simply, note that if you fit models with different probability distributions for the random effect to the same data, they give different parameter estimates!

The r-code below generates data under a log-normal distribution with arithmetic mean 0, then uses lm() to generate a p-value for a test of the hypothesis that the mean is nonzero under the assumption of normality, and iterates the procedure 5000 times. At an alpha level of 0.1, the probability of type I error is nearly doubled at a sample size of 20, and inflated by ~15% at a sample size of 200.

m.test <- function(samp.size){

p <- rep(NA, 5000)

for(i in 1:5000){

n <- rlnorm(samp.size)-exp(0.5)

# logrormal with location 0 and scale 1 has arithmetic mean equal to exp(0.5)

# so n has mean zero.

m <- lm(n~1)

p[i] <- summary(m)$coefficients[,4]

# Estimate mean of p, and extract the p-value that the mean differs from zero

}

return(p)

}

p <- m.test(20)

hist(p)

# Check the distribution of p-values, and note that the model concludes that

# the mean is nonzero far too often.

p<- m.test(200)

hist(p)

# At sample-size 200 the issue is ameliorated, but not completely. There still

# seems to be a ~15% excess in p-values less than 0.1

The fact that a random effect model will give (almost always) lower mean-square-error predictions of the random effects than an equivalent fixed effect model is due to Stein’s paradox: when you have more than 2 groups (or parameters in general) there is always an estimator that has lower MSE than estimating each parameter individually; that is, pooling is always better at reducing your overall error (see this article for a great description of Stein’s paradox: http://statweb.stanford.edu/~ckirby/brad/other/Article1977.pdf)

I wasn’t saying that distributional effects aren’t something to check; just that they should be low down on the list of issues (below issues like “is my data non-linear” or spatial dependencies). In the situation you set up, violation of the distributional assumptions will make your p-values mis-estimate; in the case you use, though, a non-parametric bootstrap will correct for that. Generally, distributional violations will skew confidence intervals, and cause problems with prediction, but not affect bias of an estimator.

Also, while I’m not an expert on the math, it seems like the algorithm lme4 at least uses is pretty robust to non-normal random effects. I’ve posted an example before that shows that lme4, in the case of many groups with few data points each, gives lower total error for group coefficients while still having (approximately) accurate coverage for the 95% CIs for the group-level coefficients. So even with pretty strongly non-normal group level coefs, lme4 gives you less overall error and good coverage.

#R code for simulations

library(lme4)

library(arm)

calc_rmse = function(true, est) sqrt(mean((true-est)^2))

#generate 6 years of data,

#w/ 10,000 data points per year

noreps = 5

nogrps = 100

b = 0.0

grp_sig = 2

resid_sig =0.5

#generate group coefs from a log-normal dist.

grp_vals = rlnorm(nogrps,0,grp_sig)

reg_data = as.data.frame(expand.grid(grp = 1:nogrps,

indiv = 1:noreps))

reg_data$obs = rnorm(noreps*nogrps, grp_vals[reg_data$grp],resid_sig)

#fit a linear model and a hierarchical model

lm_fit = lm(obs~factor(grp)+0, data=reg_data)

lmer_fit = lmer(obs~(1|grp),data=reg_data)

lm_coef = coef(lm_fit)

lmer_coef = coef(lmer_fit)$grp[,1]

#calculates and prints the ratio of RMSE of lmer model to lm model.

#Smaller numbers mean lmer fits better

lm_rmse = calc_rmse(grp_vals,lm_coef)

lmer_rmse = calc_rmse(grp_vals,lmer_coef)

print(lmer_rmse/lm_rmse)

#Calculates standard errors of group coefs

#Determines if the diff between obs. and true random effects falls in the 95% CI

#with the expected level of coverage (95% of the time)

lmer_conf = se.ranef(lmer_fit)$grp[,1]

is_in_conf = abs(lmer_coef-grp_vals)<(1.96*lmer_conf)

print(mean(is_in_conf))

That paper on Stein’s example is great, thanks for sharing it!

I’m really late to this, but I don’t think anyone has brought up genotype. That is where the main debate is in the sorts of experiments I do — we often do experiments with multiple genotypes, where “multiple” ranges from 2 to dozens. When we’re on the dozens end of the spectrum, I think it makes sense to treat them as a random effect. And, with 2-3, a fixed effect seems to make more sense. Where the switch happens, though, is not clear. Reviewers often have opinions on this, but, fortunately, I’ve never had a case where it affected the qualitative results. (At least, not that I can recall.)

Oh, and I also learned the distinction that you give at the beginning from the statistical dark ages.

Hi, Thank you for the helpful discussion above. I am still having some issues understanding when to include a Random effect. I’m potentially at flow-chart point A) Can I be talked out of using the variable.

As an example of my data/study design: I am looking at how size of fish (continuous) changes with depth (also continuous) across rocky reefs in California. I also have some nominal habitat variables that I could include in my model. Let’s say I sampled fish at 10 sites across California.

It seems like I could include site as a fixed effect and get s1,s2,s3… etc, or include site as a random effect to get one variance and generalize the relationship between fish size and depth of capture across California.

Or (and here is the real question), could I not just forget about the sites collected altogether, pool the data, and then look at one overall relationship (regression) between size and depth? Is it ever more appropriate to take this latter approach- or do I really need to account for the sites where my fish were collected specifically as some factor in a model.

Hi Ryan – if you have multiple observations at one site (as I think I understand you do), then you will almost certainly get critiqued by reviewers of pseudoreplication. That is to calculating p-values under the assumption that the errors are independent when they’re not (errors within a site have some correlation). This is anti-conservative meaning you will get an artificially low p-value. For that reason it is probably expected you include site.

How serious the problem of pseudoreplication is really is a matter of debate. And if you’re not calculating p-values it technically doesn’t matter although many don’t recognize that. Bottom line it might be good statistical judgement to pool (and it might not – I would need to know more), but it will make your life hard with reviewers so I think you have to include site.

This is one of the most interesting discussion I’ve ever seen about this topic – thanks everyone for enlightening me.

I do have a question: I am trying to fit a discrete population growth model where I combine data from multiple populations repeatedly sampled through time. The populations can be partitioned into several groups (meta-populations), so I am fitting this is a mixed effects model with random effects for population ID nested within meta-population ID. The problem is that I have substantial temporal autocorrelation at the meta-population level (but not at the population level) that is retained in the residuals. I have tried fitting a correlation structure in lme, but it is fitted only within the lowest hierarchical level (the population level), which misses the point here. My question is, what are people’s thoughts about adding sampling time as a random effect? It works great (in terms of model performance and residual distributions), but the resulting BLUPs are obviously autocorrelated, which seem to me a violation of the underlying assumption that the random effects are i.i.d. Than again, how is that different from using year as a random effect in the presence of temporal autocorrelation?

This is a very old school response, not at all moderny statistics. But sometimes, if you have enough data and it is compatible with your question aggregating data (summing or averaging) up to a higher level (e.g. metapopulation in your case) can be a really good way to eliminate pseudoreplication, and get data of a simple enough structure now you can bring in an AR1 correlation structure if needed.

Thanks Brian, but I need to maintain the individual populations because I am after the local environmental attributes (fixed effects) that affect growth rate, so that’s not really an option

Pingback: mixed effect models | social by selection