In ecology, it’s common to manipulate two factors at once – say, nutrient levels and herbivory. The standard way to analyze such a design is with a two-way ANOVA. What I’m interested in knowing (in part as the basis for a future post) is how you interpret the results.

And, while we’re on the topic of ANOVAs, another thing that varies is whether people use stepwise removal of model terms. I think of this as something that has become more common as a result of Crawley’s R book. Do you use stepwise removal of model terms?

(as were others, of course–these points are hardly original to me, though I agree that they’re not as widely known as they should be).

There’s also this comment from Andrew Gelman: “Stepwise regression is one of those things, like outlier detection and pie charts, which appear to be popular among non-statisticians but are considered by statisticians to be a bit of a joke.” (http://andrewgelman.com/2014/06/02/hate-stepwise-regression/)

Hey there,
during my BSc thesis last year I became a big fan of multimodel inference approach (http://link.springer.com/article/10.1007/s00265-010-1029-6) for testing multiple macroecological hypotheses (i.e. in my case structural effects on the distribution-abundance hypothesis for hervivorous insects). The stepwise approach appeared to me kind of arbitrary, but maybe thats more of a philosophical question on depicting “truth”.

I never saw any downsides of that approach, but maybe thats due to my non-statistician background. What’s your opinion on that?

Greetings from Germany.

Thanks for the links, Jeremy! They’re going straight in my “advice for PG students” file!

None of the above. First, I look at the interactions. If they are significant, but I’m still interested, a priori, in main effects, I separate the data and test for the relevant main effects in the subsets of the data separately, with appropriate control for multiple comparisons.

Isn’t an ANOVA (one-way, two-way…n-way) just another linear model with a normal distributional assumption? What about ANCOVA? If you specify a model with the form of a “two-way ANOVA” (i.e., write out the linear model with it’s predictor variables and associated beta coefficients) and change the distributional assumption to something like a Gamma distribution, aren’t the statistical hypotheses much the same? I can’t see how an ANOVA-class model is any more “evil” than GLM model based on a full factorial study design. The included model terms and distributional assumptions need to be justified in either case.

One more thought, I like the poll, but I think the responses are going to be weighted by voters’ experiences with stats software. Correct me if I’m wrong, but SAS’s “type III” sum-of-squares estimates interaction effects in the absence of main effects, unlike the sequential sum of squares that is default in R. SAS’s method may not be problem for othogonal designs or when main effects are not of interest (which can be the case!), but I’ve noticed that SAS users tend to see interaction effects in a different light than some R users, especially with the least-square means comparisons available in SAS.

“Isn’t an ANOVA (one-way, two-way…n-way) just another linear model with a normal distributional assumption?”

Yes, but the question about the interpretation of interaction terms is one that many (most?) ecologists first encounter when they’re taught ANOVA (as opposed to more general cases like GLMs, GzLMs, etc.). Framing the question in the context of ANOVA keeps things simple and familiar.

You could well be right that voters’ experiences with SAS vs. base R* (which do indeed have different SS defaults) will shape their opinions. Which is a slightly depressing thought–that ecologists might tend to assume that the “right” approach (to anything) is “whatever is the default in whatever stats package I happen to use”.

*pedantic footnote: some R packages have different defaults than base R. For instance, the lmPerm package, for doing permutation tests on linear models, defaults to Type III SS. Presumably just to keep users on their toes. 🙂

“You could well be right that voters’ experiences with SAS vs. base R* (which do indeed have different SS defaults) will shape their opinions. Which is a slightly depressing thought–that ecologists might tend to assume that the “right” approach (to anything) is “whatever is the default in whatever stats package I happen to use”.”

I agree wholeheartedly. I use R but have to mimic the type III SS results that SAS produces for other people’s satisfaction. I’ve learned a lot about linear models in learning about SS and contrast matrices (you CAN make you own!), and now have the freedom to do what I want in R.

I’m eager to see where the polling results will take future discussions here.

The problem is that the classical ANOVA only works well for very simple experimental designs. Throw in any complexity (e.g., unbalanced sampling, nesting, crossing) and there is no good consensus on how to proceed, aside from scrapping the ANOVA approach all together.

A perfect illustration is the confusion over the sum-of-squares options for 2-way ANOVAs with unbalanced designs. In the presence of a significant interaction, none of those options provides a clear interpretation of results anyway, which makes the software differences moot.

Andrew Gelman has a great paper illustrating how a hierarchical regression approach leads to a more comprehensive and straightforward analysis of the data, especially when there is complexity (which is typically the case for any field ecologist). Many concepts behind an ANOVA are still important (hence the title), but there are better procedures for fitting models and interpreting/presenting results. And it’s this last part that is particularly important – the output from an ANOVA does not typically address the interesting biological questions (which I think Brian has emphasized in the past).

What I like about SAS is that you can easily choose what type of SS to use and you can construct your own contrast matrices. My students use R and, like SYSTAT before it, it seems to dictate what can and cannot be done (or at least cannot be done without a lot of rigamarole). Of course, that perception probably has something to do with the fact that I’ve been using SAS for 38 years (starting with punch cards that, as an undergrad and, later, grad student, I walked across campus to the computer center) and so know a lot of its ins and outs, which I don’t for R.

You can specify your own contrasts with R too (IIRC via contrasts(), or possibly C() – don’t trust me on having recalled the correct function as I don’t do this very often) and there are add-on packages which can assist with this process or present a higher-level interface that hand-coding contrast matrices. The Anova() function in the car package springs to mind which allows simple specification of which of the “types” of SumSqs to report.

I also learned in SAS, and my familiarity with how to do appropriate analyses there has kept me from moving fully to R. But, since my students and postdocs all learn R (which I think is the right move), I feel like I need to make the switch, too. But it’s painful!

If you had, I’m sure some R user would have been quick to link to Exegeses on Linear Models, a conference paper by Bill Venables (PDF) in Section 5 of which he discusses SAS’s Type III SS. Not everyone in the R community agrees with everything in that section (hence the Anova() function in car I mentioned above, but it did make me stop and think about what these analyses/representations were doing when I first read it.

I hope it doesn’t affect your poll (well, it affected my answer in a sense), but this was doing the rounds on twittersphere recently:

http://onlinelibrary.wiley.com/doi/10.1111/j.1365-2656.2006.01141.x/abstract

We were way ahead of the twittersphere on this one:

https://dynamicecology.wordpress.com/2011/06/08/advice-tips-for-talks-and-stats/ (see statistical tips 2-3) 🙂

(as were others, of course–these points are hardly original to me, though I agree that they’re not as widely known as they should be).

There’s also this comment from Andrew Gelman: “Stepwise regression is one of those things, like outlier detection and pie charts, which appear to be popular among non-statisticians but are considered by statisticians to be a bit of a joke.” (http://andrewgelman.com/2014/06/02/hate-stepwise-regression/)

Thank you! I hadn’t seen that going around.

Hey there,

during my BSc thesis last year I became a big fan of multimodel inference approach (http://link.springer.com/article/10.1007/s00265-010-1029-6) for testing multiple macroecological hypotheses (i.e. in my case structural effects on the distribution-abundance hypothesis for hervivorous insects). The stepwise approach appeared to me kind of arbitrary, but maybe thats more of a philosophical question on depicting “truth”.

I never saw any downsides of that approach, but maybe thats due to my non-statistician background. What’s your opinion on that?

Greetings from Germany.

Thanks for the links, Jeremy! They’re going straight in my “advice for PG students” file!

None of the above. First, I look at the interactions. If they are significant, but I’m still interested, a priori, in main effects, I separate the data and test for the relevant main effects in the subsets of the data separately, with appropriate control for multiple comparisons.

This old post is relevant too: https://dynamicecology.wordpress.com/2014/02/10/beating-model-selection-bias-by-bootstrapping-the-model-selection-process/

Isn’t an ANOVA (one-way, two-way…n-way) just another linear model with a normal distributional assumption? What about ANCOVA? If you specify a model with the form of a “two-way ANOVA” (i.e., write out the linear model with it’s predictor variables and associated beta coefficients) and change the distributional assumption to something like a Gamma distribution, aren’t the statistical hypotheses much the same? I can’t see how an ANOVA-class model is any more “evil” than GLM model based on a full factorial study design. The included model terms and distributional assumptions need to be justified in either case.

One more thought, I like the poll, but I think the responses are going to be weighted by voters’ experiences with stats software. Correct me if I’m wrong, but SAS’s “type III” sum-of-squares estimates interaction effects in the absence of main effects, unlike the sequential sum of squares that is default in R. SAS’s method may not be problem for othogonal designs or when main effects are not of interest (which can be the case!), but I’ve noticed that SAS users tend to see interaction effects in a different light than some R users, especially with the least-square means comparisons available in SAS.

“Isn’t an ANOVA (one-way, two-way…n-way) just another linear model with a normal distributional assumption?”

Yes, but the question about the interpretation of interaction terms is one that many (most?) ecologists first encounter when they’re taught ANOVA (as opposed to more general cases like GLMs, GzLMs, etc.). Framing the question in the context of ANOVA keeps things simple and familiar.

You could well be right that voters’ experiences with SAS vs. base R* (which do indeed have different SS defaults) will shape their opinions. Which is a slightly depressing thought–that ecologists might tend to assume that the “right” approach (to anything) is “whatever is the default in whatever stats package I happen to use”.

*pedantic footnote: some R packages have different defaults than base R. For instance, the lmPerm package, for doing permutation tests on linear models, defaults to Type III SS. Presumably just to keep users on their toes. 🙂

Point taken on your first point.

“You could well be right that voters’ experiences with SAS vs. base R* (which do indeed have different SS defaults) will shape their opinions. Which is a slightly depressing thought–that ecologists might tend to assume that the “right” approach (to anything) is “whatever is the default in whatever stats package I happen to use”.”

I agree wholeheartedly. I use R but have to mimic the type III SS results that SAS produces for other people’s satisfaction. I’ve learned a lot about linear models in learning about SS and contrast matrices (you CAN make you own!), and now have the freedom to do what I want in R.

I’m eager to see where the polling results will take future discussions here.

The problem is that the classical ANOVA only works well for very simple experimental designs. Throw in any complexity (e.g., unbalanced sampling, nesting, crossing) and there is no good consensus on how to proceed, aside from scrapping the ANOVA approach all together.

A perfect illustration is the confusion over the sum-of-squares options for 2-way ANOVAs with unbalanced designs. In the presence of a significant interaction, none of those options provides a clear interpretation of results anyway, which makes the software differences moot.

Andrew Gelman has a great paper illustrating how a hierarchical regression approach leads to a more comprehensive and straightforward analysis of the data, especially when there is complexity (which is typically the case for any field ecologist). Many concepts behind an ANOVA are still important (hence the title), but there are better procedures for fitting models and interpreting/presenting results. And it’s this last part that is particularly important – the output from an ANOVA does not typically address the interesting biological questions (which I think Brian has emphasized in the past).

What I like about SAS is that you can easily choose what type of SS to use and you can construct your own contrast matrices. My students use R and, like SYSTAT before it, it seems to dictate what can and cannot be done (or at least cannot be done without a lot of rigamarole). Of course, that perception probably has something to do with the fact that I’ve been using SAS for 38 years (starting with punch cards that, as an undergrad and, later, grad student, I walked across campus to the computer center) and so know a lot of its ins and outs, which I don’t for R.

You can specify your own contrasts with R too (IIRC via

`contrasts()`

, or possibly`C()`

– don’t trust me on having recalled the correct function as I don’t do this very often) and there are add-on packages which can assist with this process or present a higher-level interface that hand-coding contrast matrices. The`Anova()`

function in thecarpackage springs to mind which allows simple specification of which of the “types” of SumSqs to report.I also learned in SAS, and my familiarity with how to do appropriate analyses there has kept me from moving fully to R. But, since my students and postdocs all learn R (which I think is the right move), I feel like I need to make the switch, too. But it’s painful!

Judging by this comment thread, perhaps we should’ve included another poll question: what type SS do you use for unbalanced designs?

If you had, I’m sure some R user would have been quick to link to

Exegeses on Linear Models, a conference paper by Bill Venables (PDF) in Section 5 of which he discusses SAS’s Type III SS. Not everyone in the R community agrees with everything in that section (hence the`Anova()`

function incarI mentioned above, but it did make me stop and think about what these analyses/representations were doing when I first read it.Stewart-Oaten 1995 is a good discussion of this topic in an ecological context (and of the broader issue of “rules vs. judgement” in statistics):

http://www.jstor.org/stable/1940736

(EDIT: link fixed, thank you ucfagls)

Pingback: Poll: how do you calculate sums of squares in an unbalanced ANOVA? | Dynamic Ecology

Pingback: Interpreting ANOVA interactions and model selection: a summary of current practices and some recommendations | Dynamic Ecology