There is tremendous variation in ecology in how ANOVAs are interpreted, and in terms of whether model selection is used. This post, which represents the first attempt Brian, Jeremy, and I make at a joint post, is aimed at exploring that variation, and, when possible, making recommendations for best practices.

To lay out some background: simply reading the literature or attending one afternoon’s worth of talks at ESA will reveal that there is substantial variation in statistical practices, even for something as seemingly basic as an ANOVA. Here, we’ll focus first on the question of how interaction terms should be interpreted. That is, do you interpret main effects first? Not at all if the interaction is significant? Something else? Then, we’ll address a second issue: should model selection be used? That is, should you drop non-significant terms from the model and then refit the model?

As a grad student, I was taught that, if the interaction effect is significant, you stop interpreting there (that is, don’t interpret the main effects if there’s a significant interaction). This is supported by Sokal & Rohlf, who say (page 336, Biometry, 3rd edition),

In the artificial example the effects of the two factors (main effects) are not significant in any case. However, many statisticians would not even test them after finding the interaction mean square to be significant, since in such a case an overall statement for each factor would have little meaning, (e.g. Kempthorne, 1975).

That paragraph continues on in a way that suggests that, if the interaction is significant, one should not interpret main effects.

My thinking about this recently, which led to this post where I polled readers, was spurred by reading a paper where a scientist I respect a lot and who I think of as being very careful with stats interpreted the main effects even though there was a significant interaction. This got me thinking again about variation in this practice. My first reaction was to pull Sokal & Rohlf off the shelf again, but then I decided to google the topic. I was surprised to see this post from Andrew Gelman, which starts with “We all know to look at main effects first and then look for interactions.” Hmmm. Maybe all statisticians know that, but all ecologists (myself included) certainly don’t.

This made me go pull various books off my shelf to see what they say. The Gotelli and Ellison primer says (page 332*),

It is sometimes claimed that nothing can be said about main effects when interaction terms are significant. This statement is only true for very strong interactions, in which the profile curves cross one another. However, in many cases, there may be overall trends for single factors even when the interaction term is significant.

That makes intuitive sense to me, and I have to say that the blanket “don’t interpret main effects when the interaction is significant” has always bothered me a bit. If your data look like this:

it seems reasonable to say both that there’s an interaction between food quality and parasitism, and that there is an overall effect of parasitism on fecundity. In addition, another problem with the “only interpret main effects if the interaction is not significant” approach is that, with an infinitely large sample size, there will surely always be a statistically significant interaction.

At that point, I decided it would be interesting to poll our readers, though I will admit that I worried I would find that there was lots of consensus and I was the odd ecologist out. Nope. With 210 respondents, the results were:

27%: I first look at the main effects, and then at the interaction. I don’t ignore the main effects if the interaction is significant.

18%: I first look at the main effects, and then at the interaction. I only ignore the main effects if the interaction is significant and very strong (the interaction profile plots cross).

23% I first look at the interaction and, if it’s significant, I don’t interpret any of the main effects.

22% I decide ahead of time whether I want to look at the interaction or main effects based on the question in which I’m interested. If I’m not interested in the interactions, I don’t include that term in the model.

10%: ANOVAs are evil.

Clearly there was no problem with there being too much consensus in the poll! And I should note that the last option (“ANOVAs are evil”) was, in part, a nod to this post of Brian’s. Brian feels that we should use ANOVA much, much less often (80-90% less often), since 1) most of the time, the independent/explanatory variable could be treated as continuous (e.g., nitrogen concentration, competing species abundance or biomass) and analyzed using regression, and 2) ANOVA encourages people to focus on p instead of R^{2} and effect size.

So, what should we do, assuming you have a good reason to use ANOVA? The best approach is one that is likely to be a little frustrating to many, because it’s not a cut-and-dried recommendation. The best option is to decide ahead of time what model makes the most sense for the question in which you’re interested. In other words, it’s not possible to make a blanket statement that one should always interpret main effects first, or that one should always look at the interaction first and only interpret main effects if the interaction is not significant. Instead, **you should decide ahead of time if you care about the interaction term and/or the main effects and only include the terms you are interested in in the model, then interpret them all together**. Brian thinks that most truly well-formed hypotheses (e.g. hypotheses informed by theory) lead to a question either about main effects or about interactions but not both, and that the issue of which to interpret mostly comes up when you don’t really know what you’re asking from your data. I (Meg) am not sure I agree. I often am interested in both main effects and interactions, but I suppose that one could argue that it’s in cases where there isn’t particularly well-developed theory. One note, though, is that, **if you have a significant interaction, you can’t interpret non-significant main effects either way**, as they often would become significant if you were to remove the interaction (but we are **NOT** suggesting you drop the significant interaction term!)

So, what if you run the model that you decide on ahead of time and the interaction term is not significant. Is it okay to drop it and refit the model? This relates to the general question of model selection/simplification. I think of this as something that has become especially popular because it is recommended in Mick Crawley’s R Book, which is hugely popular, and which recommends this approach in Table 9.2 in the first edition. Model simplification was not something that I ever did, but it came into my lab with postdocs who had been trained in R and using The R Book. In the poll of our readers, 47% said they use model simplification, 35% said they don’t, and 17% said they do sometimes. I said “sometimes”, given that I didn’t object to model simplification when my postdocs have used it in the past, given that they were able to cite the Crawley book to support that practice. Dropping a non-significant interaction term is a form of model selection, though it’s a pretty mild version of it (it only adds one more test in a two-way ANOVA). So, **it’s okay to go ahead and drop the interaction term and rerun the model **(even though this makes Jeremy cringe a bit). As Brian says: “ignore the people who get too uptight about the sanctity of p-values – they’re only approximate and only approximately useful anyway.”

What about more involved model selection (stepwise regression)? Quoting Andrew Gelman again,

Stepwise regression is one of these things, like outlier detection and pie charts, which appear to be popular among non-statisticians but are considered by statisticians to be a bit of a joke. For example, Jennifer and I don’t mention stepwise regression in our book, not even once.

This one seems to be an issue where statisticians are in agreement that it should not be done, and Jeremy has cautioned against in the past. Why? Quoting from the Gotelli and Ellison book (page 283):

The problem here is that coefficients – and their statistical significance – depend on which other variables are included in the model. There is no guarantee that the reduced model containing only the significant regression coefficients is necessarily the best model. In fact, some of the coefficients in the reduced model may no longer be statistically significant, especially if there is multicollinearity among the variables.

Want more reasons why? See also here for an accessible blog post with R code, here for a good overview, and here for a review in the context of ecology. The basic intuition is that it’s circular reasoning. You’re using the data to tell you what hypothesis to test in the first place, and then testing that hypothesis on the same data, which greatly inflates your actual type I error rate over the nominal level. Note, though, that **our recommendation not to use stepwise model selection** **is only referring to cases where you use stepwise regression for hypothesis testing**. As Brian has written about, it’s perfectly fine to use it for exploratory analyses. Jeremy’s advice – linked at the beginning of this paragraph – discusses this as well. Though Jeremy adds that if you want to do this, you should think about why you’re doing it.**

And, overall, that is the main message of this post: that it’s important to think through the analyses you want to do ahead of time, and think carefully about what terms you’re interested in. Construct a model that includes those terms. If that model includes an interaction term that ends up being non-significant, it’s fine to drop that term from the model and rerun the analysis. But please don’t go whole hog on stepwise model selection.

* Apparently, interaction effects are discussed on page 33X in stats books (n = 2).

** More thoughts from Jeremy on the topic: What do you hope to gain by sequentially dropping terms and then recalculating the model? You may think you’re getting better estimates of the remaining terms (less biased and/or more precise), but you almost certainly are not—indeed often just the opposite. And if what you want is just some broad sense of which predictor variables are “most important”, well, I’d say you’re probably better off getting that from the full model. In my anecdotal experience, people often do model selection without good reason, as if having non-significant terms in your model is somehow “bad”, or as if you’ve somehow failed if you haven’t selected “the” “best” model. And even if you do have a good reason to do model selection, I think there are other ways to do it that are just as easy to understand and implement as stepwise procedures, but that perform better. Just my two cents.

Thanks for the poll, summary, and insights – I’m going to share this with my graduate stats-for-ecologists students. I think this will help them think carefully aboout their approaching research and recognize (again) that much of this practice is built on forethought. And I found it useful, too!

You write:

but ANOVA and regression are the same thing, a general linear model. Hence it would be simpler all round if we focussed on how we want to represent treatment or other variables in our model rather than concerning ourselves with artificial distinctions between methods that no longer really need concern us. The unification in the general linear model allows us to just fit a regression (linear model) and work from there.

Now, I think that this (continuous vs factor vs ordered factor representation) was the main point you were trying to make, it just struck me that we still maintain a terminology from way back in the day that no longer really applies in some (most?) software (especially R and it’s

`lm()`

function).I just taught GLM (covering regression, ANOVA, ANCOVA) last week so I agree it is how it should be taught. However, it is not how it IS taught (only one student in my class new they were the same).

Also unless you run ANOVA using lm in R, in all likelikehood you are doing different calculations (sums of squares) then you would do in GLM. And most of this software (other than lm in R) is probably not emphasizing effect sizes and r2 – just a p-value which is my biggest complaint.

But all that aside, to take Meg’s example above, putting food quality in as a binary dummy variable is much less interpretable putting in “energetic content” or some other measure of food quality as continuous – even if it only has two levels, but once you start to think of it as continuous you probably don’t design it with two levels.

Must be the time of year to blog about P-values, model selection, etc. I just did so, from a slightly different perspective here: http://spiraclesandgills.wordpress.com/2014/09/30/still-a-place-for-p-values/

Didn’t know about the poll, but I would have slotted in with the 23% that chose “I first look at the interaction and, if it’s significant, I don’t interpret any of the main effects”. Why? Mostly because that’s how I was taught about it — thanks for this post for encouraging me to be a bit more flexible on this point.

The post reinforces my thinking on stepwise methods, which I really only “buy” when you are approaching the issue from an ANCOVA perspective, and I concur with Jeremy’s footnote on the issue.

Wait, you have a blog now? And you’re telling me by commenting on my blog? That probably says something profound about the age we live in, but I’m not sure what. 🙂

Like Twitter, it’s something I set up a long time ago and then did almost nothing with. Now that I’m fully engaged with Twitter, this just tagged along, as it were. It’s barely in existence, but I expect I’ll add to it on an irregular basis, as I find myself wanting to expand on reactions to Twitter (& other online) posts/articles.

I’m not 100% sure what it says either, but let’s go with: wow, how totally modern of us to share this news in this manner. 😉

By the way, what the heck is up with your profile picture? 🙂

I just combined two of your blog posts…we discussed this post, with everybody bringing in an example paper to compare that used a 2-way ANOVA for my lab meeting today! 🙂

Nice overview, thank you.

Comment: when I hear “stepwise model testing”, I think AIC-based selection criteria (rather than p-values). AIC-based model selection seems (thankfully) to be growing more common in Ecol & Evol.

Using R’s step() function provides a very nice functionality for a coherent “sweep” of predictor space. With a large data set, for example, I might use forward model selection with a BIC criteria to favor a more parsimonious model, especially when my goal is to identify generalities rather than build the ideal predictive model.

Re: AIC, see this post:

https://dynamicecology.wordpress.com/2015/05/21/why-aic-appeals-to-ecologists-lowest-instincts/

Jeremy, thanks for that.

Just to clarify:

* using parameter estimate p-values for “stepwise model selection” is rarely a good idea, since standard errors and thus p-values are conditional on the overall model structure (i.e., inclusion or exclusion of other predictors)

* Using deltaAIC for “stepwise model selection” should not be applied blindly:

— use putative predictors, rather than all possible covariates

— pick a goal (parsimony, predictive power), and justify a model choice accordingly (“we use forward model selection with an AIC criteria, and pick the most parsimonious model with deltaAIC > 2 “).

Does this make sense?

I would also add that even in the case of a strong interaction where your lines cross, in some cases you may still want to interpret the main effects and that can be perfectly valid, depending on the circumstances. For example, in an ANCOVA you can run a Johnson-Neyman test to determine the significance region. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3682820/