There is tremendous variation in ecology in how ANOVAs are interpreted, and in terms of whether model selection is used. This post, which represents the first attempt Brian, Jeremy, and I make at a joint post, is aimed at exploring that variation, and, when possible, making recommendations for best practices.
To lay out some background: simply reading the literature or attending one afternoon’s worth of talks at ESA will reveal that there is substantial variation in statistical practices, even for something as seemingly basic as an ANOVA. Here, we’ll focus first on the question of how interaction terms should be interpreted. That is, do you interpret main effects first? Not at all if the interaction is significant? Something else? Then, we’ll address a second issue: should model selection be used? That is, should you drop non-significant terms from the model and then refit the model?
As a grad student, I was taught that, if the interaction effect is significant, you stop interpreting there (that is, don’t interpret the main effects if there’s a significant interaction). This is supported by Sokal & Rohlf, who say (page 336, Biometry, 3rd edition),
In the artificial example the effects of the two factors (main effects) are not significant in any case. However, many statisticians would not even test them after finding the interaction mean square to be significant, since in such a case an overall statement for each factor would have little meaning, (e.g. Kempthorne, 1975).
That paragraph continues on in a way that suggests that, if the interaction is significant, one should not interpret main effects.
My thinking about this recently, which led to this post where I polled readers, was spurred by reading a paper where a scientist I respect a lot and who I think of as being very careful with stats interpreted the main effects even though there was a significant interaction. This got me thinking again about variation in this practice. My first reaction was to pull Sokal & Rohlf off the shelf again, but then I decided to google the topic. I was surprised to see this post from Andrew Gelman, which starts with “We all know to look at main effects first and then look for interactions.” Hmmm. Maybe all statisticians know that, but all ecologists (myself included) certainly don’t.
This made me go pull various books off my shelf to see what they say. The Gotelli and Ellison primer says (page 332*),
It is sometimes claimed that nothing can be said about main effects when interaction terms are significant. This statement is only true for very strong interactions, in which the profile curves cross one another. However, in many cases, there may be overall trends for single factors even when the interaction term is significant.
it seems reasonable to say both that there’s an interaction between food quality and parasitism, and that there is an overall effect of parasitism on fecundity. In addition, another problem with the “only interpret main effects if the interaction is not significant” approach is that, with an infinitely large sample size, there will surely always be a statistically significant interaction.
At that point, I decided it would be interesting to poll our readers, though I will admit that I worried I would find that there was lots of consensus and I was the odd ecologist out. Nope. With 210 respondents, the results were:
27%: I first look at the main effects, and then at the interaction. I don’t ignore the main effects if the interaction is significant.
18%: I first look at the main effects, and then at the interaction. I only ignore the main effects if the interaction is significant and very strong (the interaction profile plots cross).
23% I first look at the interaction and, if it’s significant, I don’t interpret any of the main effects.
22% I decide ahead of time whether I want to look at the interaction or main effects based on the question in which I’m interested. If I’m not interested in the interactions, I don’t include that term in the model.
10%: ANOVAs are evil.
Clearly there was no problem with there being too much consensus in the poll! And I should note that the last option (“ANOVAs are evil”) was, in part, a nod to this post of Brian’s. Brian feels that we should use ANOVA much, much less often (80-90% less often), since 1) most of the time, the independent/explanatory variable could be treated as continuous (e.g., nitrogen concentration, competing species abundance or biomass) and analyzed using regression, and 2) ANOVA encourages people to focus on p instead of R2 and effect size.
So, what should we do, assuming you have a good reason to use ANOVA? The best approach is one that is likely to be a little frustrating to many, because it’s not a cut-and-dried recommendation. The best option is to decide ahead of time what model makes the most sense for the question in which you’re interested. In other words, it’s not possible to make a blanket statement that one should always interpret main effects first, or that one should always look at the interaction first and only interpret main effects if the interaction is not significant. Instead, you should decide ahead of time if you care about the interaction term and/or the main effects and only include the terms you are interested in in the model, then interpret them all together. Brian thinks that most truly well-formed hypotheses (e.g. hypotheses informed by theory) lead to a question either about main effects or about interactions but not both, and that the issue of which to interpret mostly comes up when you don’t really know what you’re asking from your data. I (Meg) am not sure I agree. I often am interested in both main effects and interactions, but I suppose that one could argue that it’s in cases where there isn’t particularly well-developed theory. One note, though, is that, if you have a significant interaction, you can’t interpret non-significant main effects either way, as they often would become significant if you were to remove the interaction (but we are NOT suggesting you drop the significant interaction term!)
So, what if you run the model that you decide on ahead of time and the interaction term is not significant. Is it okay to drop it and refit the model? This relates to the general question of model selection/simplification. I think of this as something that has become especially popular because it is recommended in Mick Crawley’s R Book, which is hugely popular, and which recommends this approach in Table 9.2 in the first edition. Model simplification was not something that I ever did, but it came into my lab with postdocs who had been trained in R and using The R Book. In the poll of our readers, 47% said they use model simplification, 35% said they don’t, and 17% said they do sometimes. I said “sometimes”, given that I didn’t object to model simplification when my postdocs have used it in the past, given that they were able to cite the Crawley book to support that practice. Dropping a non-significant interaction term is a form of model selection, though it’s a pretty mild version of it (it only adds one more test in a two-way ANOVA). So, it’s okay to go ahead and drop the interaction term and rerun the model (even though this makes Jeremy cringe a bit). As Brian says: “ignore the people who get too uptight about the sanctity of p-values – they’re only approximate and only approximately useful anyway.”
What about more involved model selection (stepwise regression)? Quoting Andrew Gelman again,
Stepwise regression is one of these things, like outlier detection and pie charts, which appear to be popular among non-statisticians but are considered by statisticians to be a bit of a joke. For example, Jennifer and I don’t mention stepwise regression in our book, not even once.
This one seems to be an issue where statisticians are in agreement that it should not be done, and Jeremy has cautioned against in the past. Why? Quoting from the Gotelli and Ellison book (page 283):
The problem here is that coefficients – and their statistical significance – depend on which other variables are included in the model. There is no guarantee that the reduced model containing only the significant regression coefficients is necessarily the best model. In fact, some of the coefficients in the reduced model may no longer be statistically significant, especially if there is multicollinearity among the variables.
Want more reasons why? See also here for an accessible blog post with R code, here for a good overview, and here for a review in the context of ecology. The basic intuition is that it’s circular reasoning. You’re using the data to tell you what hypothesis to test in the first place, and then testing that hypothesis on the same data, which greatly inflates your actual type I error rate over the nominal level. Note, though, that our recommendation not to use stepwise model selection is only referring to cases where you use stepwise regression for hypothesis testing. As Brian has written about, it’s perfectly fine to use it for exploratory analyses. Jeremy’s advice – linked at the beginning of this paragraph – discusses this as well. Though Jeremy adds that if you want to do this, you should think about why you’re doing it.**
And, overall, that is the main message of this post: that it’s important to think through the analyses you want to do ahead of time, and think carefully about what terms you’re interested in. Construct a model that includes those terms. If that model includes an interaction term that ends up being non-significant, it’s fine to drop that term from the model and rerun the analysis. But please don’t go whole hog on stepwise model selection.
* Apparently, interaction effects are discussed on page 33X in stats books (n = 2).
** More thoughts from Jeremy on the topic: What do you hope to gain by sequentially dropping terms and then recalculating the model? You may think you’re getting better estimates of the remaining terms (less biased and/or more precise), but you almost certainly are not—indeed often just the opposite. And if what you want is just some broad sense of which predictor variables are “most important”, well, I’d say you’re probably better off getting that from the full model. In my anecdotal experience, people often do model selection without good reason, as if having non-significant terms in your model is somehow “bad”, or as if you’ve somehow failed if you haven’t selected “the” “best” model. And even if you do have a good reason to do model selection, I think there are other ways to do it that are just as easy to understand and implement as stepwise procedures, but that perform better. Just my two cents.