I serve on a lot of graduate committees. I also end up as statistical consultant for a lot of students and faculty. So I see a lot of people thinking through their models and statistical approaches.
One trend I am noticing is that most people are staying within the linear framework (e.g. not GAM or regression tree), but their models are becoming increasingly complex. That is to say more and more terms are being included. And they are more and more of what I would call blocking/nuisance terms. I’m not talking about “big data” or exploratory or data mining approaches where people have dozens of potential variables and no clue which to use.
I’m talking about traditional field experiments or behavioral/physiological observations of individuals or small scale observational studies. And I’m noticing in the dark ages (=when I was in graduate school=11 years ago) there would be a 2 or at most 3-way ANOVA with maybe one or two interaction terms. Now everybody is running multivariate regressions or more often mixed effect models. And there are often 4-5 fixed effects and 3 or 4 (often with nesting) random effects and many more interaction terms and even sometimes people want/try to look at interaction terms among random effects (an intensely debated topic I am not going to weigh in on).
As one example – in the past I almost never saw people who collected data over two or three years (i.e. all PhD programs and grants) include year as an explanatory factor (fixed or random) unless there was really extreme variability that got turned into a hypothesis (e.g. El Nino vs La Nina which happened not infrequently in Arizona). Now everybody throws in year as an explanatory factor even when they don’t think there was meaningful year-to-year variability.
And for what it’s worth, putting even two crossed (as opposed to nested) random factors into the lme command in the nlme R package was somewhat arcane and of mixed recommendability, while crossed random effects are easily incorporated in the newer lmer command in lme4. So it might just be evolving software, but I don’t really believe software capacity alone is driving this because I’m also seeing the number of fixed factors going up and I never used to hear people complaining about it being hard to include 2 crossed random factors in lme. But it does prove the complexity of models has gone up since the models I see as common place today weren’t even supported by the software 3-5 years ago.
Now I am on record that the move to multivariate regression framing instead of ANOVA is a very good thing. And I haven’t said it in this blog but every time I teach mixed effect models I say they’re one of the most important advances in statistics for ecology over the last couple of decades. So I’m not critiquing the modelling tools.
But I am suspicious of the marked increase of the number of factors from approximately 2-3 with few interactions to 4-8 with many interactions (and again this is not in an exploratory framework with dozens of variables and something like regression trees). I’m a notorious curmudgeon, suspicious of any increase in statistical complexity that is not strongly justified in changing our ECOLOGICAL (not statistical interpretations). But I’m clearly out of the mainstream. And although I can say some particular specific practices or motives around complex models are wrong, I cannot say that more complex models in general are wrong. So maybe I’m missing something here.
So please enlighten me by taking the below poll on why you think models have become more complex over the last 10 years. You can check up to six boxes but 2-4 is probably more informative.
(I am going to offer my own opinions in a future blog post but I don’t want to bias the poll because I am really genuinely curious about what s driving this phenomenon – and by the same token I’m not going to be active in the comments on this post but hope you are).
Do you have any good examples where your ecological understanding was greatly increased by 4-8 factors instead of 2-3? Do you have an example of a killer interpretation of 4 factors in one model? Do you think you’re still in a hypothesis testing framework when you have, for example, 5 fixed factors and three random factors? What about if you’ve done some model comparisons to get from 5 fixed/3 random down to 3 fixed/2 random?
I’m in the “exploratory/predictive” framework using information theoretic approaches with observational data for most of what I do. I prefer BIC because it avoids the situation with lots of complex models performing very similarly — in my experience that makes ecological understanding worse, not better.
Out of interest, I recently compared backwards selection using likelihood ratio tests with BIC based model averaging. A reviewer had asked us to use backwards selection to a single model because it would shrink the confidence limits on our model averaged predictions. Surprise! Quite the opposite.
This is an interesting point. I agree with you that statistical models are getting more complex and the question is why this should be the case. One thing that I think has happened is that we are much more aware nowadays of the various sources of non-independence in our data and we are more likely to try to correct them using statistical techniques. As an example, fifteen years ago it would have been rare to see someone including block as an explanatory variable, or to try to take family into account in a data analysis that wasn’t specifically genetic. Nowadays it’s much easier to do this, and is that necessarily a bad thing? If “family” is a source of non-independence in an experiment, and we can include family as a random factor, then I don’t think that is a bad thing at all – surely it’s better than not doing so?
Another reason why people are fitting more complex models is that more people are using software that allows it. Don’t forget that until the mid to late 90s even straightforward generalised linear models were seen as quite esoteric and you could really only do them properly in the notoriously horrible GLIM. The use of mixed effects models has really only become commonplace over the last few years as R has become widely adopted.
I’m too young to offer much insight on why models are more complex now than they were 10 years ago. However, I can say that what I have been taught, both in undergrad and now grad school, has generally been to start with a more complex model and then to whittle down to a simpler model using some sort of criteria (often AIC).
Maybe ecological theory and our mathematical and conceptual models have become more complex by considering more environmental factors and drivers or many individual traits simultaneously. So people measure more stuff and need to put these data into their analyses. In that way, I think, the bars may have been raised.
Perhaps this is related to the fear of miss-specification bias? I think many of these complex models may lie in the GLHT arena where there may be only a few model terms that are of interest for interpretation. In a non-logistic regression framework it seems perfectly reasonable to me to include anything that may have potential explanatory power in order to reduce the variance of your estimates and avoid miss-specification bias.
Including more explanatory variables (even if they are the right ones) will actually increase the variance of estimates, but decrease their bias.
Your post seems to imply that most people are including terms in hopes that they will prove to be useful explanatory variables. I do not disagree that many people include terms for this reason. However, a different, perfectly valid reason, to include variables such as family, individual or site is to account for study design. Due to many logistical constraints, field ecologists are often unable to carry out fully balanced, replicated, randomized studies. Including extra “nuisance” variables in the analysis to account for the actual study design is just good practice in a traditional hypothesis-testing framework. I have done this many times, actually losing degrees of freedom (i.e., taking a penalty) to account for potential carryover or site effects that did not exist.
I look forward to reading your follow-up post! Discussions like these are critical to improving the statistical education of ecologists. Including extra variables in a model should be done with careful forethought, not in the slim hope that throwing something extra at the model will suddenly shift your P-value of interest across that magical 0.05 boundary.
Definitely didn’t mean to imply a motive. I honestly don’t know what people’s motives are – hence the poll. All I know is that 10 years ago I didn’t see people include things like year or even site and now they do and I’m trying to figure out why.
Personally, I think it is a mixture of better stats education in ecology, more awareness of pseudoreplication, and pressure from reviewers. Access to powerful software that can fit these sophisticated models (e.g. R) has made it easier to implement as well, but I’m not convinced that is the main driver. I’ve often been approached by people wanting to fit certain types of models, but without any idea how they can do so. Something else is pushing them to incorporate random or nested effects in their models.
Admittedly, this is a self-serving plug, but this feeling of a trend towards ever increasing complexity of the statistics presented in articles or requested by reviewers was the starting point of the research we describe in : http://www.esajournals.org/doi/abs/10.1890/130230 (I appreciate that this article has already been mentioned in this blog). Though we could not discern between an increase in the number of models being fit and a rise in the number of variables being included in any given model, we did find a large increase in number of p-values reported that plateaued at 10.7 p-values/article or 5.2 p-values/article/author. In view of this post, the plateau or even recent decline in average reported p-values per articles could be attributed to a decrease in null hypothesis testing while the number of variables included in models may still be rising unabated but simply no longer accompanied by p-values. In discussing these and other trends, Caroline Tucker at the EEB & Flow suggest the need for a clear definition of success for ecology :http://evol-eco.blogspot.co.uk/2014/08/researching-ecological-research.html.
Your paper did cross my mind while writing this post. And I agree that your rigorous but non-specific finding of increasing p-values is very consistent/suggestive of my purely intuitive but specific hypothesis of increasing numbers of model terms. The decline in p-values might also be attributed to having so many they are now placed in tables rather than reported in ways your text search algorithm could find too! And I’ll note that so far only 2 people out of 80 or so have chosen the “You’re wrong – there hasn’t been an increase in the number of model terms” so I’m pretty happy to accept it as fact that the # of model terms has increased.
Pingback: Um dedo apontando para a Lua: os perigos do abuso da estatística e da modelagem matemática por ecólogos | Sobrevivendo na Ciência
Pingback: Links 12/6/14 | Mike the Mad Biologist
Pingback: How many terms should you have in your model before it becomes statistical machismo? | Dynamic Ecology
Pingback: Mixed Effect Models | Social by Selection
Pingback: Os perigos do abuso da estatística e da modelagem matemática por ecólogos – Sobrevivendo na Ciência