Before the holidays, I ran a poll asking why people’s models have gotten bigger and more complex (i.e. more terms in regressions). First it is worth noting that essentially nobody disagreed with me that models have gotten more complex.So I am taking it as a given that my original characterization that the typical model has increased from a 2-way ANOVA (maybe with an interaction term) to say 4-8 terms (several of which may be interaction terms or random factors) just in the last 15 years.
Like every topic I place under the statistical machismo header, there is no one answer. No right or wrong. Rather it is a question of trade-offs where I hope to make people pause and question conventional wisdom which seems to always and only lead to ever increasing complexity. Here I definitely hope to make people pause and think about why they are building models with 5-8 terms. (NB the following is a typically long-winded blog post for me, feel free to skip to the bold summary paragraph at the bottom).
In econometrics this issue is taught under the title “ommitted variable bias” (and it is frequently taught in econometrics and often in psychology). One can mathematically prove that if you leave a variable out which is correlated with the variables you include this will lead to a bias in your estimation of the slopes for the variables you did include. The trade-off is that including more variables in a regression leads to a loss of efficiency (bigger error bars around your slope estimates). This seems then to boil down to a classic bias vs variance trade-off. I’m personally not too sold on this view point for this particular problem. First the mathematical proof has a very unrealistic assumption that there is a single definitive set of variables which alone cause the dependent variable – but this is never the real world. Second, although it might introduce bias, there is no way to know whether it biases slopes positively or negatively which means in practice you don’t know how its biased which in a weird meta way goes back to being effectively unbiased. The whole omitted variable bias is pretty decisively shredded in Clarke 2005.
Most ecologists I think are instead coming pretty much from Hurlbert’s extremely influential paper on pseudoreplication (which got a lot of confirmation in the survey). Hurlbert introduced the idea of pseudoreplication as a problem and made two generations of ecologists live in fear of being accused of pseudoreplication. However, nobody seems to recall that adding more terms to a regression is NOT one of the solutions Hurlbert suggested! And I’m willing to bet it would not be his suggested solution even with modern regression tools so easily available. His primary argument is for better more thoughtful experimental design! There is no easy post hoc fix through statistics for bad experimental design and I sometimes think we statisticians are guilty of selling our wares by this alluring but flawed idea. Beyond careful experimental design, Hurlbert, basically points out there are two main issues with pseudo-replication: confoundment and bad degrees of freedom/p-values. Let me address each of these issues in the context of adding variables to complexify a regression to solve pseudo-replication.
1) Confoundment – Hurlbert raises the possibility that if you have only a few sites you can accidentally get some unmeasured factor that varies across your sites leading you to mistakenly think the factor you manipulated was causing things when in fact its the unmeasured factor that is confounded with your sites by chance. However, and this is a really important point – Hurlbert’s solution (and anybody who thinks for five minutes about experimental design) is to make sure your treatment is applied within sites, not just across sites, thereby breaking confoundment. Hurlbert also goes into much more detail about relative advantages of random vs. intentional interspersion of treatments and etc. But the key point is confoundment is fixed through experimental design. This is harder to deal with in observational data (one of the main reasons people extol experiments as the optimal mode of inference). But in the social sciences and medicine it is very common to deal with confoundment in observational data by measuring and building in known confounding factors. Thus nearly every study controls for factors like age, race, income, education, weight, etc by including them in the regression. For example propensity to smoke is not independent of age, gender or income which in turn are not independent of health, so decisive tests of the health effects of smoking need to “remove” these co-factors (by including them in the regression). Either Hulrbert’s experimental design or social science’s inclusion of co-factors make sense to me. But in ecology , we instead tend to throw in so-called nuisance factors like site (and plot within site) and year but this does NOT fix confoundment (and is more motivated by non-independence of errors discussed below). To me confoundment is NOT a reason for the kinds of more complex models we are seeing in ecology. If you are doing an experiment, then control confoundment in the experimental design. And if it is observational include more direct causal factors (the analogs of age and demographics) like temperature, soil moisture, vegetation height and etc instead of site and year nuisance factors if you are worried about the confoundment problem of pseudoreplication.
2) Bad degrees of freedom/p-values – Hurlbert’s second concern with pseudo-replication (which is totally unrelated to confoundment and is not fixable by experimental design) relates to p-values. This is because non-independence of error terms violates assumptions and essentially leads us to think we have more degrees of freedom than we really have, which since we divide by degrees of freedom to get p-values leads us to think our p-values are lower than than they really are (i.e. p-values are wrong in the bad way – technically known as anti-conservative). This is a mathematically true statement so the debate comes in with how worried we should be about inflated p-values..If we decide are worried we can just stop using p-values (recall this was Hurlbert’s recommendation but very few remember that part of the paper!). Nor does Hurlbert imply there are larger problems than the p-value inflation (and the confoundment raised above). In fact Hurlbert says psuedoreplication without p-values can be a rational way forward.
The question of whether to report p-values or not interacts in an interesting way with one of the main results of the survey. Many people feel like having more complex models is justified because they are switching to model selection approaches (i.e. mostly AIC). This approach is advocated by two of the best and deservedly most popular ecology stats books (Crawley & Zuur et al). But I have to confess that I am uncomfortable with this approach for several reasons. First, the whole point of model selection initially (e.g. Burnham and Anderson’s book) was to move away from lame models like the null hypothesis and compete really strong models against each other as Platt (1964 Science) recommended in his ode to strong inference. Comparing a model with and without a 5th explanatory factor does not feel like comparing strongly competing models so it does not feel like strong inference to me. Second, model selection is a great fit for prediction because it finds the most predictive model with some penalty for complexity (recall in the world of normally distributed errors AIC is basically the SSE minus 2* the # of parameters and SSE is also the numerator in R2 making a precise mathematical link between AIC and R2). But model selection is a really bad fit for hypothesis testing and p-values (again as anybody who has read the Burnham and Anderson book will have seen but few follow this advice). Although I don’t go as far as Jeremy and Andrew Gelman (I think doing one or two very simple pre-planned comparisons such as with or without interaction term and then reporting a p-value is probably OK), I strongly believe that one should not do extensive model selection and then present it as a hypothesis test. While I agree with Oksanen’s great take-down of the pseudoreplication paper that argues p-values are only courtesy tools, I don’t think most people using model selection and then reporting p-values treat them that way. I’m fine – more than fine – with pure exploratory approaches, but I think a lot of people are noodling around with really complex models and lots of model selection and then reporting p-values like they’re valid hypothesis tests. Indeed, I have had reviewers insist I do this on papers. This strikes me as trying to have your cake and eat it too and I think is one of the reasons I am so uncomfortable with the increasingly complex models – because they are so highly intertwined with model selection approaches.
I do think it is important to note that whatever the motive, there are genuine costs to building really complex models. The biggest cost is the loss of interpretability. We know exactly what a model with one explanatory factor is saying. We have a pretty good idea what a model with two factors is saying. And I even have a really precise idea what a 2-way ANOVA with an interaction term is referring to (the interaction is the non-additivity). But I have yet to see ever a really convincing interpretation of a model with 5 factors (other than “these are things that are important in controlling the dependent variable” at which point you should be doing exploratory statistics). And interaction terms (often more than one these days!) are barely interpretable in the best circumstances like when the hypothesis is explicitly about interaction. And while mixed models with random effects are a great advance, I don’t see too many people interpreting random effects in any meaningful way (e.g. variance partitioning) but the most commonly used mixed model tool – lmer – pretty much guarantees you don’t know what your p-values are (for good reasons) and the most common workarounds are wrong and often anti-conservative to such a degree that the author’s of the package refuse to provide p-values (e.g. this comment and Bolker’s comments on Wald tests). Again – if you want to do exploratory statistics, go to town and include 20 variables. But if you’re trying to interpret things in a specific context of particular factor X has an important effect, you’re making your life harder with more variables.
Another big problem with throwing lots of terms in is collinearity – the more terms you have, the more likely you are getting some highly correlated explanatory variables. And when you have highly correlated explanatory variables, the “bouncing beta problem” means you are basically losing control of the regression (i.e. depending on arbitrary properties of your particular data, the computer algorithm can assign almost all of the explanatory power – i.e slope – to either one or the other correlated variable – or in other words – if you drop even one data point the answer can completely change).
So, in summary, adding variables is a very weak substitute for good up front experimental design. It might be justified when the added variables are known to be important and are used to control for confoundment with sampling problems in an observational context. But that’s about it. And the techniques often invoked to make complex models viable such as random effects and model comparison pretty much guarantee your p-values are invalid. I find it very ironic so many people go to great lengths including nuisance terms to avoid pseudoreplication (to ensure their p-values are valid) then guarantee their p-values are invalid by using random effects and model selection. And good luck interpreting your complex model especially when coefficients are being assigned to collinear variables arbitrarily! So to my mind complex regression models straddle the fence very uncomfortably between clean hypothesis testing contexts (X causes Y and I hypothesized it in advance) and pure exploratory approaches – this fence sitting complex model approach to my mind has the worst of both worlds, not the best of both worlds.
To put it in blunt terms, it would appear from popular answers in the survey that many people are complexifying their models in response to Hurlbert’s issues of pseudo-replication and Burnham & Anderson’s call for model comparison but seem to forget that both of them actually call for abandoning p-values to solve these problems. And that Hurlbert’s paper was really a call for better experimental design and Burnham & Anderson’s book was a call for a return to strong inference by competing strong models against each other not tweaks on regressions. So these were both calls for clear, rigorous thinking before starting the experiment, NOT for post hoc fixes by adding terms to regression models.
So, I have to at least ask, how much of this proclivity for ever more complex models is a result of peer pressure, fear of reviewers and statistical machismo? I was a little surprised to see that no small fraction of the poll respondents acknowledged these factors directly.So I urge you to think about why you are complexifying your model. Is it an observational study (or weakly controlled experimental study) where you need to control for known major factors? Should you really switch to an exploratory framework? Are you willing to give up p-values and the hypothesis testing framing? if not, say no to statistical machismo and keep your model simple!