I serve on a lot of graduate committees. I also end up as statistical consultant for a lot of students and faculty. So I see a lot of people thinking through their models and statistical approaches.
One trend I am noticing is that most people are staying within the linear framework (e.g. not GAM or regression tree), but their models are becoming increasingly complex. That is to say more and more terms are being included. And they are more and more of what I would call blocking/nuisance terms. I’m not talking about “big data” or exploratory or data mining approaches where people have dozens of potential variables and no clue which to use.
I’m talking about traditional field experiments or behavioral/physiological observations of individuals or small scale observational studies. And I’m noticing in the dark ages (=when I was in graduate school=11 years ago) there would be a 2 or at most 3-way ANOVA with maybe one or two interaction terms. Now everybody is running multivariate regressions or more often mixed effect models. And there are often 4-5 fixed effects and 3 or 4 (often with nesting) random effects and many more interaction terms and even sometimes people want/try to look at interaction terms among random effects (an intensely debated topic I am not going to weigh in on).
As one example – in the past I almost never saw people who collected data over two or three years (i.e. all PhD programs and grants) include year as an explanatory factor (fixed or random) unless there was really extreme variability that got turned into a hypothesis (e.g. El Nino vs La Nina which happened not infrequently in Arizona). Now everybody throws in year as an explanatory factor even when they don’t think there was meaningful year-to-year variability.
And for what it’s worth, putting even two crossed (as opposed to nested) random factors into the lme command in the nlme R package was somewhat arcane and of mixed recommendability, while crossed random effects are easily incorporated in the newer lmer command in lme4. So it might just be evolving software, but I don’t really believe software capacity alone is driving this because I’m also seeing the number of fixed factors going up and I never used to hear people complaining about it being hard to include 2 crossed random factors in lme. But it does prove the complexity of models has gone up since the models I see as common place today weren’t even supported by the software 3-5 years ago.
Now I am on record that the move to multivariate regression framing instead of ANOVA is a very good thing. And I haven’t said it in this blog but every time I teach mixed effect models I say they’re one of the most important advances in statistics for ecology over the last couple of decades. So I’m not critiquing the modelling tools.
But I am suspicious of the marked increase of the number of factors from approximately 2-3 with few interactions to 4-8 with many interactions (and again this is not in an exploratory framework with dozens of variables and something like regression trees). I’m a notorious curmudgeon, suspicious of any increase in statistical complexity that is not strongly justified in changing our ECOLOGICAL (not statistical interpretations). But I’m clearly out of the mainstream. And although I can say some particular specific practices or motives around complex models are wrong, I cannot say that more complex models in general are wrong. So maybe I’m missing something here.
So please enlighten me by taking the below poll on why you think models have become more complex over the last 10 years. You can check up to six boxes but 2-4 is probably more informative.
(I am going to offer my own opinions in a future blog post but I don’t want to bias the poll because I am really genuinely curious about what s driving this phenomenon – and by the same token I’m not going to be active in the comments on this post but hope you are).
Do you have any good examples where your ecological understanding was greatly increased by 4-8 factors instead of 2-3? Do you have an example of a killer interpretation of 4 factors in one model? Do you think you’re still in a hypothesis testing framework when you have, for example, 5 fixed factors and three random factors? What about if you’ve done some model comparisons to get from 5 fixed/3 random down to 3 fixed/2 random?