# Ecologists need to do a better job of prediction – part I – the insidious evils of ANOVA (UPDATED)

When I teach a graduate statistics class, I spend some time emphasizing that most statistical analyses can produce a p-value, an effect size and an R2. Students are quick to get that p<0.05 with effect size of 0.1% and R2 of 3% is not that useful. This is not a particularly novel insight, but it is not something many students fresh out of a first semester stats class realize where all p-value all the time is emphasized. All 3 statistics have their roles. p-values are used in a hypothetico-deductive framework telling us the probability the signal could have been observed by chance (does nitrogen increase crop yield?). Effect size tells us the biological significance (how much does crop yield increase given a level of nitrogen addition?). This is the main focus in medicine. What is the increase in odds of survival for a new drug? And R2 tells us how much what we’re studying explains vs the other sources of variation (how much of the variation in crop yield is due to nitrogen). It tells us how close we are to done in understanding a system.  I am not biased. I like all three summary statistics (and the different modes of scientific inference that they imply).

But if you ask me how they are used in ecology, then I think the answer is pretty clear that the p-value is way over-emphasized relative to the other two. You can find dozens of papers making this same claim (e.g. this and this). But the one that really pounds this  home to me is this 2002 paper by Møller and Jenions. In this paper they conduct a formal metanalysis of a random subset of papers published in ecology journals and look at the average R2 across these papers. Anybody want to guess what it is? We know it won’t be 80% or 90%. Ecology lives in a multi-causal world with many factors influencing a system simultaneously. Maybe 40% might be reasonable? Nope. 30%? Nope. At least 20%?! No. Depending on the exact method it is 2-5%! To me this is astonishing. Papers published in our top journals explain less than 5% of the variance (UPDATE: see footnote). In a less jaw-dropping result but very much In the same vein,  Volker Bahn and I showed that we can predict the abundance of a species better by using spatial autocorrelation (basically copying the value measured at a site 200km away) then by advanced models incorporating climate, productivity, land cover, etc.

Whether you are motivated by basic or applied research goals, it seems clear that we ecologists need to do better than this! To be very specific we need to deliver on the predictive aspects of science that care about effect size and R2. From an applied side, policy makers and on-the-ground practitioners looking for recommendations for action care almost entirely about effect size and R2. Knowing that leaving retention patches in clear-cut forests increases species richness on average by 10 species (a large effect size) is the main topic of interest. As an additional nuance, if we further know that retention patches only explains 10% of the variation in species richness between cutting sites because it mostly depends on the history of the site and chance immigration events and weather in the first year of regrowth, then at a minimum we need to do more work and we might just pass on taking action. Knowing p is pretty immaterial once a certain credible minimum sample size threshold is passed. From a basic research side, there are also good arguments for focusing on effect size and R2. I don’t really believe science is about saying “I can show factor X influences measurable variable Y with less than a 5% chance of being wrong” and I don’t think most other people do either.  I am a big fan of Lakatos, who rejects Popper’s emphasis on falsification and suggests that the true hallmark of science is producing “hitherto unknown novel facts” (elsewhere he raises the bar with words like stunning and bold and risky) and I would agree. Lakatos gives the example of Einstein’s general theory of relativity truly being accepted when the light of a star was observed to be bent by the gravity of the sun – a previously unimagined result. Ultimately, if all we’re doing is post hoc explanation, it is at best a deeply diminished form of science. And even if one is rooted in the innate value of basic research, at some purely practical level there has to be a realization that if ecology as a whole (not individual researchers) doesn’t in some fashion step up and provide the basic tools to help us predict and navigate our way through this disastrous anthropogenic experiment known as global change, then society just might decide we deserve a funding level more like what the humanities receive.

So assuming you bought my above two arguments that:

• Ecology is bad at prediction
• Ecology needs to get better at prediction

then what do we do?

This question is going to be the topic of a series posts (currently planned at three). In this post, beyond introducing the topic of prediction in ecology and arguing that we need to do a better job (i.e the half of the post you just read), I want to examine one major roadblock to getting better at prediction in ecology, what I provocatively called in my title the “insidious evil of ANOVA”.

Now I want to be clear I have nothing against analysis of variance per se. It is a perfectly good technique for putting statistical significance on regressions. And indeed the basic ANOVA-derived F-ratio test on a univariate regression gives the same p-value as a t-test on the slope and has much in common with likelihood ratios (and hence with AIC*) for normally distributed errors (the –log likelihood of normally distributed errors is nothing more than a constant times the sum of squares and hence a likelihood ratio is a ratio of sums of squares just like an F-test barring a few constants and degrees of freedom). So to take away ANOVA sensu strictu you would have to take away most of our inferential machinery. And variance partitioning (including R2 that I am promoting here) is a great thing. I have published papers that use ANOVA and will continue to do so.

What I object to is how ANOVA is typically applied and used (Click here for a humorous detour on ANOVA), starting from the experimental design and ending with how ANOVA is reported in journals. I’ll boil these down to two “evils” of traditional use of ANOVA:

1. ANOVA hides effect sizes and R2 Again, in a technical sense, ANOVA is just the use of an F-statistics to get a p-value on a regression. And you can get R2 and effect size off of a regression. But in a practical sense as commonly used (and in every first year stats class), ANOVA specifically means a study where you have a continuous dependent (Y) variable and one or more discrete, categorical independent/explanatory/X-variables. The classic agricultural example is yield (continuous) vs. fertilizer added or not added (discrete). Or abundance of target species in the presence and absence of competition. There is again nothing per se wrong with this. Its just that most software packages out there (including R for the aov command) only report a p-value when you run this kind of ANOVA. They don’t report an effect size (difference in mean yields for with vs without fertilizer). And they don’t report an R2. You can get these values out with a little extra work but they’re not in the default reports. And then reviewers and editors let the authors write the ms reporting only a p-value without an R2. This is how we end up with a literature full of p<0.00001 but R2=0.04. I would argue that every single manuscript should be required to report its R2 and effect size. I hypothesize this requirement alone would cause the average R2 of our field to go up. At least some people would be embarrassed to publish a paper with an effect size of 2% or an R2 of 3%. It would be painful, because this is the state of our field today, but it would be really healthy in the long run to mandate always publishing an R2 and an effect size alongside the p-value.
2. ANOVA experimental designs focus on the wrong question – A typical ANOVA setup asks the question does X have an effect on Y. In ecology it is not surprising                                      that more often than not X does have an effect on Y (everything has an effect on Y when Y is something like abundance or birth rate or productivity). Indeed it would be shocking if X did not have an effect on Y. At that point it just becomes a game of chasing a large enough sample size to get p<0.05 and then walla! its publishable. Instead we should be asking “how does Y vary with X”. This doesn’t require a drastic change in the experimental methods. Just a shift in thinking to response surfaces instead of ANOVA. A response surface measures Y for multiple values of X and then interpolates a smooth curve through the data. This is just as accessible as ANOVA – specifically it does not require an a priori quantitative model. However, it sure feeds into formulating new models or testing models somebody else developed. There was a nice recent review paper on functional responses by Denny and Benedetti-Checchi that shows how powerful having a response surface is in ecology (although they focus on mechanism and scaling rather than statistics and experimental design). Having a tool and mindset  that are more focused on prediction leads directly to questions of R2 (how big is the scatter around our line) and effect size (how much does the line deviate from a flat horizontal line) AND we get something drives us immediately to models and we get a quantitative prediction (albeit phenomenological) even before we have a mechanistic model. This is a no-lose proposition.

A few technical details and an example will make response surfaces more clear. One reason response surfaces have been limited in use is that they traditionally required using NLS (nonlinear least squares regression) unless the response surface was a boring straight line, but in this day and age of GAM (basically spline regression) this excuse is gone! The figure below shows a contrast between traditional ANOVA approach (the boxplot in subfigure A) and response surfaces (subfigure B and then a 2-D version in subfigure C). Other than a shift in mindset (plot a spline through the data instead of a box plot), the only practical implication is that we ought to emphasize number of levels within a factor (e.g. four levels of nitrogen addition or competition instead of just the two levels of control and manipulation). This can be cost free because we can reduce the number of replicates at each level proportionately (i.e. 4 levels with 2 replicates instead of 2 levels with 4 replicates) (again compare subfigure A with subfigure B).

Just to expand briefly on the benefits of a response surface, it serves as a wonderful interface to modeling. If we have a model, we can test if it produces the response surface found empirically; if we don’t have a model it immediately suggests phenomenological models and may even suggest mechanistic models (or at least whether the positive or negative feedbacks are dominating and whether there is an optimum). Further we have a prediction immediately useful in a applied context. If we have a response surface of, say, juvenile survival vs. predator abundance, we can immediately have an answer for that conservation biologist who says, given that I have limited dollars, what kind of benefit will I see if I invest in eradication of the invasive predator? Being scientists, we will immediately want to caveat this analysis with warnings about context dependence, year-to-year variation etc. That’s fine, but I bet the conservation donor will still be a lot happier with you if you have this response surface than if you have p<0.05! So, there are both basic and applied reasons to use response services.

Of course there is lots of fine work in ecology that addresses both of my concerns. But without doing a formal meta-analysis I am pretty sure that the version of ANOVA I am picking on is a plurality if not an outright majority of studies published in major ecology journals. Occasionally the independent variable truly is categorical and unordered so a response surface won’t work, but this is rare and even then can often be improved to a continuous variable with a little thinking (and it doesn’t excuse not reporting a measure of goodness of fit and effect size!).

In conclusion, I did not get into science to conclude “Factor X has an unspecified but significant effect on factor Y”. This is all the traditional use of ANOVA tells us. Two simple proposed changes: 1) require reporting the R2 and effect size to publish and 2) shift to a response surface mentality (specifically more levels with fewer replicates and fitting a surface through the data) lets us go past this question. Now we can ask “how does Y vary with X and how much of Y is explained by X?”. This is much more exciting to me and I hope to you! This subtle, labor cost neutral, but critical reframing lets mechanistic models in the door much more quickly and gives an ability to answer applied questions in the meantime. The opportunity cost of failing to do this is what I call “the insidious evil of ANOVA”. I am convinced that our field would progress much more quickly if we made this change. What do you think?

*don’t get me started on AIC – that is for another post, but note for now that focusing on AIC neatly dodges having to report an R2 and effect size

Footnote from Jeremy: Upon reflection, Brian and I have decided we ought to alert readers that Anders Møller has been subject to numerous very serious accusations of sloppiness and data falsification, one of which led to retraction of an Oikos paper. See this old post for links which cover this history. We leave it to readers to decide for themselves whether that history should affect their judgment of a paper (Møller and Jenions 2002) that has not, to our knowledge, been directly questioned. In any case, Brian and I believe that his post is robust: many R2 values reported in ecological papers are indeed low, few are extremely high, and the point of the post does not depend on the precise value of the average R2.

This figure shows classic agricultural yield data based on amount of fertilization. Data from Paris 1992 (The return of Von Liebig’s “Law of the Minimum”) with some random noise added by me. A) A traditional ANOVA approach showing yield vs. nitrogen addition with only two or a few levels. Here there is borderline non-significance (p=0.08). B) The much more revealing response surface (it turns out this data was from the 0 phosphorous addition set of the data and the plants actually fare poorly when too much nitrogen is added without phosphorous). C) The full fitted 2-D surface of yield vs nitrogen and phosphorous. The Paris paper actually fits different functional forms to test whether Liebig’s law of the minimum is true or not. This is a far cry from (A)!

This entry was posted in Instructional, New ideas by Brian McGill. Bookmark the permalink.

I am a macroecologist at the University of Maine. I study how human-caused global change (especially global warming and land cover change) affect communities, biodiversity and our global ecology.

## 69 thoughts on “Ecologists need to do a better job of prediction – part I – the insidious evils of ANOVA (UPDATED)”

1. Excellent post Brian! I hope you will expand on the different measures of effect size throughout this series. All too often I review analyses that simply throw many independent variables in a model and use AIC to determine the most parsimonious model. Of course, AIC will propose a top model but this model may yield poor prediction. You suggest that we may be embarrassed to publish a paper with small effect size and R^2. While I generally agree, I think there is a key point missing. I think if we can design our analyses to maximize inference as opposed to simply throwing everything into the mix, then we can extract more meaningful effect sizes. These may be small but in my opinion more meaningful as we will have more confidence that we are representing the “true” effect size of interest. A classic example would be when two independent variables interact to influence a dependent variable. In this case, we are interested in knowing the effect size of the interaction and not the main effects in our model. Consequently, we should design our analysis to reflect our interest in the interaction. I look forward to reading the next posts in this series!

• Thanks Shawn. You raise some good points about AIC/model selection and multivariate regression/collinearity/model selection. These are hot issues in the so called “big data” (i.e. data mining) movement. They are I suppose some of the dark side of if the pendulum in ecology were to swing to the opposite side and become obsessed with eeking out 0.02 more on R2. I hadn’t planned a post on that, but it could be interesting.

2. Terrific post Brian! I have nothing to add, but wanted to leave a positive comment in hopes that it would encourage you to post more:) This sort of post is very helpful in providing me additional ways to get the point across about p-values, effect sizes and r2 to my students. I have mixed views on AIC (but I’m far from an expert), so would love to hear your take on when it is and isn’t appropriate to apply.

• Andrea: this old post links to a primer on model selection methods, including AIC. Shane Richards also has a nice Ecology paper from 2005 or so on AIC, explaining what it is, and using simulations to test its ability to reliably pick out the true model from a candidate set of models.

• Great, thanks Jeremy.

• I’d also add Symonds and Mousalli 2010 in Behavioral Ecology and for a big overview of alternative methods, Hobbs and Hilborn in Ecology in 2006.

3. One problem is that we stop our studies when we get p<0.05. This should merely be step 1. The next step is to use the prediction equations derived from step 1 to inform a quantitative prediction of an intervention in the system (i.e. if I change x to x' then y should be y'). Do/observe the intervention (experiment) and compare (quantitatively) how well your prediction fared. Then you have a solid claim of understanding the system. This exercise makes obvious both effect sizes and R^2. It also requires that you do a good job modelling the deterministic (mean for example) as well as stochastic parts (selecting reasonable prob. dists. for the response variables) of your system.

• I like the two step framing. It is essentially the same as the idea of a training dataset and a test dataset except the test dataset is a post hoc experiment. Very strong inferential method indeed!

4. Great post, Brian. This also overlaps with many of the issues that I enjoyed seeing addressed in Cottingham et al.’s 2005 piece about why we should design experiments based on regression rather than ANOVA. The argument there centered on power, and the authors were just focusing on OLS regression (although NLS & GLMs should produce the same results under nonlinear situations, I would imagine). What intrigued me about the responses (besides some weirdly out of touch vitriol in one) was the discussion of R2 – namely that increasing dispersion of treatment levels can increase R2.

I’m curious as to your thoughts along the lines of R2 as a valuable indicator of what we learn from an experiment when treatment dispersion, sample size, and (in the case of experiments) our ability to control extrinsic conditions can all play very large roles in R2.

For me, I think a combination of R2, effect size, an evaluation using a comparison of several models, and an evaluation of support for parameters being different from 0 represents a solid approach at evaluating the balance of evidence about an treatment of interest. What’s tricky is then moving forward with prediction (particularly when you’re working forward from an experiment, and not a field survey with all of the requisite noise).

All of which is a roundabout way of saying, I’m looking forward to where you go with this, and I think careful consideration of what influences all of the above quantities – both due to experimental design and the very concept of experiment versus observational study – is important.

• I totally agree. I wasn’t as clear about it as you just were, but “all of the above” is the correct approach (and multiple model comparison is also important).

5. This was a great read and I’m really looking forward to the rest of the posts in this series! In the past I’ve turned to Bayesian methods to avoid overly relying on p-values. In doing so, I better understood my dataset, what the data might be telling me, and eventually identified quantitative thresholds that are (hopefully) useful to resource managers. I think that the way you described the problem you’ve presented here will just “click” with many people. I’m very interested in hearing more about how we can use statistics in general to actually add value to our results and the applications of those results, rather than merely use p-value as a gatekeeper to getting published.

6. I agree with the usefulness of P, effect size and R^2 and I need to frequently remind myself of this. It’s easy to fall to the charms of AIC without stepping back to see that your best model or your averaged model doesn’t really explain much! Some thoughts though. Not sure I would call effect size a measure of biological significance. Effects can be significant with very small effect sizes if the effect is compounded over time (interest) or just simply allowed to have an effect over a long amount of time. Very small selection coefficients can produce large results given large time. And that is the crux at the problem with suggesting we should be embarrassed by R^2 = 0.03. If explanations are really highly multifactorial, each of small effect, then the only way to get a high R^2 is if most of the factors are in the model. But I’m happy to show that something has an effect, and be very confident of the magnitude of the effect (using P and/or standard errors) if if the effect size is small and the R^2 is small. Interpreting the effect size is really challenging but there are a few papers that have compiled effect sizes across different disciplines and these are useful if only in a comparative way. This gets to prediction. Is the goal of science predictability (high R^2) or understanding (low standard errors of parameter estimates with good causal model of why the estimates are what they are)? This probably depends on if you are in academic or applied science. I don’t need to make decisions so a high R^2 is not a goal. In fact its exciting because it means there is lots of work to do to understand a complex system better. And obviously, a single factor may be a great predictor though not a causal factor at all but is a good predictor because of high correlation with one or more other predictors that you haven’t measured or were but were dropped from the “best model”.

• For me, the use of AIC (and ANOVA tables, come to that) is to help you find the best model (for whatever definition of best you want to use). Unfortunately people forget that getting the best model is only the start – you actually need to look at it and see what it’s telling you. Hence, the use of R^2 and the parameter estimates (i.e. the response surface, something even an ANOVA model gives you).

• I agree with most of what you say.

There is a tradition out there of calling a p-value the statistical significance and the effect size the biological significance. That was the context in which I used the term, but I agree it somewhat confusing.

Your later point about understanding vs prediction is important. I like both goals. But have we understood very much when r2=0.03? Is not the ultimate most rigorous test of understanding the ability to make a priori prediction instead of post hoc explanation?

7. Wasn’t it Sir David Cox who commented that the only use for ANOVA tables was to get the degrees of freedom?

I must admit I do use ANOVA tables, but that’s to (a) eyeball where the variation is being partitioned, and (b) check to see if I need to worry about interactions. I’m not sure I’d ever present these in a paper, though: there are too many reasons why they could be wrong (there’s a really fun paper by Schey which shows this, including a data set where the ANOVA table shows that the second variable explains everything, no matter which variable is second).

BTW, there should be a paper about R^2 coming out soon in Methods in Ecology & Evolution.

8. Great post. Though I suspect the tides are changing/have changed with regards to reviewers and editors allowing p-values without effect sizes and r2. It was largely through the review process (rather than my stats courses, as you mention) that I became convinced of the need to emphasize these. In my experience (which is limited), it seems to be a common request from reviewers to include effect sizes, and rightfully so. Plus, they’re relatively simple to calculate, though I too wish they were more automated in outputs.

• Re: being asked for effect sizes, don’t figures basically show that (differences between treatment means, or whatever)? What sort of information have you been asked for, beyond what your figures show?

• Cohen’s d or r. Usually on the basis that someday, someone might use them in a meta-analysis, so they’ll need standardized effect sizes. Beyond that, figures are definitely a good (best?) way show effect sizes, but for comparing data that aren’t in figures, or are from a different study, standardized measures are useful.

9. A clarificatory question, Brian, perhaps on something you were planning to address in future posts in the series. What do you mean by prediction? As I’m sure you know, that term can mean various things–do ecologists need to get better at all of them? There’s predicting the value of a new observation from the same population. There’s predicting the value of observations from a different population. And there’s the very different, non-statistical sense of predicting the existence of some phenomenon that we didn’t even think existed, as in the Lakatos-on-Einstein example. And probably other senses besides. I take it that in these posts you’re mostly interested in statistical senses of prediction, not “we need to think of more novel hypotheses that predict the existence of surprising phenomena” prediction?

And a minor comment re: failure to report or sufficiently emphasize effect sizes and R^2 values when reporting ANOVAs or other GLMs. Isn’t that what our figures show? When reporting an ANOVA, it’s pretty standard to accompany the report with a plot of the relevant means and standard error bars, or perhaps boxplots rather than just means +/- SEs. Doesn’t that illustrate the effect sizes, and also give some sort of visual sense of how much residual within-group variation is left unexplained? Similar remarks could of course be made about the reporting of results from other sorts of GLMs.

• Plots with mean and SE’s illustrate how well we’ve estimated the means but not the effect size. The estimate of the means could be really good because of large N but the effect size could be still be very small, since that is generally considered a function of s.d. (see below example with selection coefficients). I like Cohen’s D and wish more people would report it so that I could build up an intuitive sense for what is pretty small v moderate v pretty large. Comparing effect sizes in my field with Cohen’s recommendations doesn’t make much sense since effect sizes can be very different with different phenomenon. Again, look at selection coefficients, the selection coefficients with the best estimates had very small effect sizes (see the Kingsolver paper from early 2000s) so to compare these with studies from psychology studies doesn’t make any sense. An selection coefficient effect size (beta coefficient) of 0.1 is huge (especially if its been estimated well) but this is less than small according to Cohen. Not untii people routine report this can we build up a sense of what is small or big

• Ok, though as other comments have noted, the notion of “effect size” isn’t really something that’s easily and sensibly summarized in any one universally applicable number. Think for instance of the large cumulative effect on phenotypic evolution that even a very small selection coefficient can have if it’s maintained for long enough. For many purposes, though obviously not all, I do think the sorts of plots with which ecologists usually accompany their statistical analyses provide a lot of useful information about how “strong” effects are in some biologically-relevant sense. I think Brian’s points about reporting of statistical results are well-taken, but I don’t know that things on that front are quite as bad as Brian suggests.

• Jeff raises a great point. The tradition of putting SE (standard error) error bars on plots is totally consistent with a p-value obsessed focus. Putting standard deviation or 95 percentile bars allows a rapid assessment of both r2 and effect size (in Jeff’s sense of COhen’s D which compares the difference in means with the amount of variability). No way to do this with standard error. All I can really tell from standard error is approximate p-value (which is the one thing already reported in the text) and how impressively big your sample size is (and thus impressively small your SE is)

• Great call for clarification Jeremy. This post has gotten rather statistical in nature but I hope to broaden out to a less statistical view in other posts.

I guess my definition of prediction is nothing more nor less than the common english language meaning. To make a statement about what will in time future from now that can then be observed to be true or not. Its hard to respect a science that can only do post hoc explanation. Its also hard to respect a science that is not willing to get at least somewhat quantitative about its statements.

Your issue of interpolation (predict new individual in same population) vs extrapolation (individual in new population) in prediction is important and I’ll touch on it in my next post.

• Thanks for the clarification Brian. Though at the risk of being pedantic, even the “prediction vs. post hoc explanation” distinction isn’t as clear cut as one might think. In philosophy, this is known as the “old evidence” problem. Evidence that was already known at the time a theory is developed often is treated as a test of that theory’s predictions, rather than the theory being treated as a post hoc explanation. Which seems like a problem, because how could a theory fail to fit evidence that’s already known? A famous example is Einstein’s theory of relativity and its prediction of… (hope I’m recalling this correctly) the perihelion of Mercury, a phenomenon that was known well before Einstein, and of which I believe Einstein was aware. Philosopher Deborah Mayo suggests one plausible resolution to the “old evidence” problem (i.e. a reason why it can be legit to think of “predictions” of known evidence as true predictions in your sense). I talk about her work in a post scheduled for tomorrow, though I won’t be bringing up the application to the “old evidence” problem.

10. I agree it quickly gets sticky defining prediction. I tend to think of it as a balance scale – some successful predictions put a lot of weight on the side of a theory being true,others put less weight on the side of truth (but still add some weight). Predicting the precession of perihelion of Mercury was less impressive than predicting the bending of light, but it was still some evidence in favor of relativity. I like Rosenzweig’s dipswitch idea – make 5 different “old evidence” predictions and you start to look pretty good even when it is old evidence.

11. Another clarification question Brian (which I’m pretty sure you’ll be getting to in a future post, so you can just ignore it for now if you like). Is the problem that ecologists are bad at prediction, or *explanation*? For instance, if species abundances are spatially autocorrelated, so that I can say “I predict that abundance here will be the same as abundance nearby”, does that count as successful prediction for you? Or is the problem that that prediction isn’t made by appeal to some *explanatory* variable–it’s not a prediction like “I predict that abundance here will be X because this is a wet site and this species does well at wet sites”?

More generally, any system with sufficiently repeatable behavior will be predictable via “the straight rule of induction”: the past will be like the future. I predict the sun will rise tomorrow, because it’s risen every day in the past. Is that sort of prediction enough to address the concerns you raise? Because one could well make that sort of prediction without ever being able to produce, say, a regression model with a high R^2. (Whether the “straight rule of induction” would often work well in ecology is another question. Probably not. But that’s not my question…)

• Thanks for the probing questions Jeremy. You phrase it as prediction vs. explanation. I would phrase it as an autocorrelation vs exogeneous variable prediction. I would agree that most people would find the autocorrelation prediction unsatisfying. For years, weather forecasters could not beat the “tomorrow will be the same as today” prediction. But they kept trying and now beat the prediction pretty solidly.They also have limits – the autocorrelation prediction cannot predict that far into the future. However, they do serve as bench marks for success. Until we can beat those predictions, we sure need to be very humble about the fact we have a lot more work to do.

I suspect whether one accepts autocorrelation (or the sun will rise tomorrow) type predictions depends on the goals and context. Personally, I am not in an applied enough setting to find them satisfying. So I posit there is a spectrum ranging from pure prediction to pure explanation. In my post I am arguing that we are close to the pure explanation end and need to move some distance to the prediction end. I suppose how far is a matter of personal taste and context. Personally, I go about 2/3 of the way down the scale. As a concrete example, neural nets and even modern regression tree techniques like boosting and bagging all do a little better than old fashion regression trees. But they turn the regression model into a black box. I am not comfortable with that. I will sacrifice the slight improvement in r2 to be able to have at least some window into the black box of how different variables matter, how they interact with each other etc. And I consider purely correlative models in general (like say a niche model using regression trees) to be a point along the path, not an endpoint. My research is actively seeking trying to move us at least one step beyond purely correlative niche models for predicting species ranges in climate change.

• Yes – absolutely you can. But to me it is just one step too far towards a black box (e.g. most people don’t bother to analyze how the variables are interacting with each other). Especially since the boost in R2 is usually less than 0.05 (which it is) – not a good trade-off in my view. But that’s a personal preference. I know most people prefer these other methods.

• Understood – just giving some pushback to the “black box” dismissal. Many posts on this blog address what people don’t but should do in ecology; if you’re editing or reviewing a paper that doesn’t report variable contributions or relationships from machine learning, you can always request it. Any of the good review or introductory papers on the topic will point the author in directions to do so.

12. Pingback: 12/1/12 | Mike the Mad Biologist

13. Pingback: Links « Conidial Coleopticide

14. Pingback: Want to bet? | Dynamic Ecology

15. Hi Brian, this fits in well with the latest post about making bets…although I stumbled on the more recent post before finding this.
I think that people often feel that prediction is only important in very specific contexts (e.g. applied ecology) but I would assert that there is only one way to demonstrate understanding and it is through prediction. So, even the most theoretical of scientists interested in things completely without practical application at this point should still feel compelled to demonstrate that they have increased our understanding of the natural world. And the only way to do that is with prediction.
What I find interesting is that we usually evaluate empirical models based on how well they fit the data that were used to construct them. But, of course, the implicit assumption is that if the model fits the data that it was constructed from well then it is more likely to make good predictions about new data. I suspect that most ecologists see a high R2 as an end in itself but, in fact, it is only useful insomuch as it provides evidence that the model will predict new data well. If we knew that the model with the high R2 made bad predictions for all new data we would discard it as an anomaly. How often is the best model the model that predicts new data best? I don’t have the answer to this question but my guess is that it is less often than we might imagine. One reason I don’t know the answer is that we almost never do the test.
I think that predictions can be evaluated based primarily on two axes – accuracy/precision and generality. So, how close do we get to the right number? if the prediction is continuous and how often do we put things in the right box? if the prediction is categorical. And, over how wide a range of conditions do the predictions hold? I haven’t figured out where the ‘unexpectedness’ of the prediction fits in to the story.
The issue of prediction versus explanation is one I struggle with because it often seems to me that explanation is a series of nested Babushka dolls. If I can predict species richness well using latitude is this devoid of explanation? I’ve had colleagues suggest that ‘latitude is not an explanation for species richness – latitude is just numbers on a map.’ They have less problem if I use temperature as an explanation for species richness but it’s not clear to me how there is something qualitatively different between temperature and latitude as a predictor of species richness. We would have to dig deeper into temperature to identify what the mechanism buried within temperature is that results in increased species richness (in the same way we have to dig into latitude). To use the spatial autocorrelation example, I think if abundance at neighboring sites does a good job of predicting abundance at a new site then we understand abundance at the level of ‘how are abundances at neighboring sites related?’ But, we don’t know how much of that is due to similar environments, dispersal among sites, large-scale catastrophic events etc. But let’s say that we find that most of this is due to a common response to temperature and precipitation. We still might not know how temperature and precipitation are affecting the organisms. Is it because there are more food resources? More efficient use of the available resources? etc. And if it’s because there are more food resources is more food affecting survival or fecundity or some combination? And so on. There is probably a point where the box is completely unpacked but it often seems to be well below the point where we would already have been comfortable saying we had ‘explanation’.
I think Brian has identified one of the most critical issues facing ecology and I’m looking forward to seeing where this goes. Best.

Jeff Houlahan

• Hi Jeff – I *really* like your two axes of accuracy vs generalizabity of prediction. I agree with you about “explanation” being a Babushka dolls where do we stop issue. My next post is more or less on this topic, so I will just take this is a spur to write it (hopefully later this week). But I’m sure Jeremy will disagree with both of us ;-)

16. Reblogged this on Sam Clifford and commented:
The author of this post argues that 1) ecology is bad at prediction and 2) ecology needs to get better at prediction. I’d expand this beyond ecology to air quality and most other sciences based on observations of environmental phenomenon. I’ve railed against ANOVA previously but haven’t spent as much time as the author delving into the philosophy of science. I like that they suggest GAMs. I don’t like the focus on only p values, R-squared and effect sizes but I do note that they’ve mentioned AIC, so perhaps this isn’t aimed at people getting PhDs in statistical modelling.

17. I found this post very interesting and informative. The highlighting of the importance of question of interest and associated statistic is so important, interesting and neglected. The section about the evils of ANOVA are also interesting and I am looking forward to read more in future posts. I was wondering how logistical limitation of experimental design factored in on your preference of response surface vs ANOVA designs. I was also wondering if such limitations may have played a role in the historic use of ANOVA designs in the literature (as opposed to strict calculation limitations such as the need for NLS). I would think many would prefer to have a range of values for a given experimental factor (with the associated more interesting question) but are in fact limited to two levels per factor of interest for physical or financial reasons. Personal example: I only had access to two identical chambers set up for CO2 control. Also, in a factorial design, the number of experimental units rises with the number of levels to the power of the number of factors. With a small number of factors, don’t the number of units required for regression analysis become rapidly prohibitive? Thank you for the great post!

• Excellent real-world questions! There are cases where number of replicates prohibits a response surface method (such as the example you give). But when I see only two levels within a factor and 10 replicates (which is common in ecology) then I would argue you really should have spread out to have more levels (5 levels with 4 replicates or 10 levels with 2 replicates). As for the question about many levels in many factors growing exponentially, it is true if you do a full factorial design, but it is not so important to do a full factorial design if you are using regression analysis. One can do an incomplete design and populate only a subset of all possible combinations (ideally maintaining balance but even this doesn’t matter so much for a regression design).

• Sorry for the lateness – I just read your post and found it very interesting. I have a question about this answer you gave, which is this – why have any replication of the factor levels at all, rather than less? Why not instead go for as many values as possible, and would it be best for these values to be (if possible) randomly distributed within some reasonable bounds? I have thought of doing this with recent experiments but am wary of the reviewer responses.

• What Brian said. If your interest is in the quantitative relationship between two continuous variables, then you do *not* want a design with only a few, highly-replicated levels of the independent variable. Heck, if it’s logistically feasible, you could just assign every experimental unit a different level of the independent variable, with no within-level replication.

And in most cases, you probably want the levels of your independent variable to be roughly evenly spaced, especially if you have no strong a priori hypothesis about the form of the relationship between the dependent and independent variables.

Of course, in your particular situation, where (if I understand correctly) logistical constraints force you to only have two levels of the independent variable, all these considerations about the optimal regression-type design go out the window. You’re constrained to only have two levels of the independent variable, so all you can do is just replicate each of them however many times you can.

18. numerous very serious accusations of sloppiness and data falsification
Hum, maybe not that numerous, and always by the same people. This would seriously rise suspiscion about their own motivation for attacking.
The story behind the retracted Oikos paper looked at the time much more like AP Moller did not directly collect the data himself and fraud could not be established. On the other hand, there seemed to be personal settings hiding behind motivations for the attacks. Definitely not looking exactly like the current regular cases of fraud/retraction documented on Retraction watch…