It is my sense of the field that AIC (Akaike information criteria) has moved past bandwagon status into a fundamental and still increasingly used paradigm in how ecologists do statistics. For some quick and dirty evidence I looked at how often different core words were used at least once in an article in Ecology Letters in 2004 and 2014. Regression was used in 41% and 46% respectively. Significance was used in 40% and 35%. Richness was 41% and 33%. And competition was 46% and 49%. Perhaps a trend or two in there but all pretty steady. AIC has gone from being in 6% of the articles in 2004 to 19% of the articles in 2014. So in summary – AIC has tripled in usage and is now found in 20% of all articles and is used almost 2/3 as often as the mostly widely used statistical technique of significance..

I have a theory about why this has happened which does not reflect favorably on how AIC is used. Please note the qualification “how AIC is used”. AIC is a perfectly valid tool. And like so many tools, its original proponents made reasonable and accurate claims about it. But over time, the community takes ownership of a concept and uses it how they want, not how it was intended.

And I would suggest how people want to use AIC is in ways that appeal to two low instincts of ecologists (and all humans for that matter). First humans love rankings. Most newspapers contain the standings of all the teams in your favorite sport every day. We pay more attention of the rankings of a journal’s impact factor than its absolute value. Any number of newspapers produce rankings of universities. It is ridiculous to think that something as complex as journal quality or university quality can be reduced to one dimension (which is implicit in ranking – you can’t rank in two dimensions). But we force it on systems all the time. Second, humans like to have our cake and eat it too. Statistics have multiple modalities or goals. These include: estimation of parameters, testing of hypotheses, exploration of covariation, prediction into new conditions, selecting among choices (e.g. models) etc. Conventional wisdom is you need to be clearly based in one goal for an analysis. But we hate to commit.

You can probably already see where I’m headed. The primary essence of what AIC delivers is to boil choices down to a single dimension (precisely it provides one specific weighting of the two dimensions of likelihood and number of parameters to give a single dimension) and then ranks models. And comparing AIC scores is so squishy. It manages to look like all 5 statistical goals at once. It certainly does selection (that is its claim to fame). But if you’ve ever assessed whether ΔAIC>2 you have done something that is mathematically close to p>0.05.

Just to be clear, likelihood also can be used towards all those goals. But they present much more divergent paths. If you’re doing hypothesis testing you’re doing likelihood ratios. If you’re doing estimation you’re maximizing. If you’re doing selection you can’t proceed unless you specify what criteria to use in addition to likelihood. You have to actually slow down and choose what mode of inference you’re doing. And you have to make more choices. With AIC you present that classic table of ΔAIC and weights and voila! You’ve sort of implied doing all five statistical goals at once.

I want to return to my qualification of “how AIC is used”. The following is a simple example to illustrate how I perceive AIC being used these days. Take the example of species richness (hereafter S). Some people think that productivity is a good predictor (hereafter prod). Some people think seasonality is a better predictor (hereafter seas). Some people suggest energy is the true cause (hereafter energ). And most people recognize that you probably need to control for area sampled (area).Now you could do full blown variable selection where you try all 16 models of every possible combination of the four variables and using AIC to pick the best. That would be a pretty defensible example of exploratory statistics. You could also do a similarly goaled analysis of variable importance by scaling all four variables and throwing them into one model and comparing coefficients or doing some form of variance partitioning. These would also be true exploratory statistics. You could also use AIC to do variable importance ranking (compare AIC of S~prod, S~seas, S~energ). This is at least close to what Burnham and Anderson suggested in comparing models. You could even throw in S~area at which point you would basically be doing hypothesis testing vs a null although few would acknowledge this. But my sense is that what most people do is some flavor of what Crawley and Zuur advocate which is a fairly loose mix of model selection and variable seleciton. This might result in a table that looks like this*:

Model | ΔAIC | weight |
---|---|---|

S~prod+seas+area | 0 | 31% |

S~prod+energ+area | 0.5 | 22% |

S~prod+energ | 1.1 | 15% |

S~energ+seas | 3.2 | 9% |

S~energ | 5.0 | 2% |

There are a couple of key aspects of this approach. It seems to be blending model selection and variable selection (indeed it is not really clear that there are distinct models to select from here, but it is not a very clear headed variable selection approach either). Its a shame nobody ever competes genuinely distinct models with AIC as that was one of the original cliams to the benefit of AIC (e.g. Wright’s area energy hypothesis S~energ*area vs.the more individuals hypothesis a SEM with two equations: S~numindiv and numindiv~prod). But I don’t encounter it too often. Also note that more complicated models came out ranked better (a near universal feature of AIC). And I doubt anybody could tell me how science has advanced from producing this table.

Which brings me to the nub of my complaint against AIC. AIC as practiced is appealing to base human instincts to rank and to be wishy washy about inferential frameworks.There is NO philosophy of science that says ranking models is important. Its barely better than useless to science. And there is no philosophy of science that says you don’t have to be clear what your goal is.

There is plenty of good debate to have about which inferential approach advances science the best (a lot has happened on this blog!). I am partial to Lakatos and his idea of risky predictions (e.g. here). Jeremy is partial to Mayo’s severe tests which often favors hypothesis testing done well (e.g. here). And I’ve argued before there are times in science when exploratory statistics are really important (here). Many ecologists are enamored with Platt’s strong inference (two posts on this) where you compare models and decisively select one. Burnham and Anderson cite Platt frequently as an advantage of AIC. But it is key to note that Platt argued for decisive tests where only one theory survives. And arguably still the most mainstream view in ecology is Popperian falsification and hypothesis testing. I can have a good conversation with proponents of any of these approaches (and indeed can argue for any of these approaches as advancing science). But nowhere in any of these approaches does it say keeping all theories around but ranking them is helpful. And nowhere does it say having a muddled view of your inferential approach is helpful. That’s because these two practices are not helpful. They’re incredibly detrimental to the advance of science! Yet I believe that AIC has been adopted precisely because they rank without going all the way to eliminating theories and because they let you have a muddled approach to inference.

What do you think? Has AIC been good for the advance of science (and ecology). Am I too cynical about why hordes are embracing AIC? Would the world be better off if only we went back to using AIC as intended (if so how was it intended)?

UPDATE – just wanted to say be sure to read the comments. I know a lot of readers usually skip them. But there has been an amazing discussion with over 100 comments down below. I’ve learned a lot. Be sure to read them.

*NB this table is made up. In particular I haven’t run the ΔAIC through the formula to get weights. And the weights don’t add to 100%. I just wanted to show the type of output produced.

Very nice text, thank you for posting it. AIC has become a plague in Ecology. As you said, the problem is not the tool itself, but the misuse that has become commonplace in the past years. As a reviewer of manuscripts and theses, to my experience, something like 95% of the studies use AIC and model/variable selection as a replacement for deduction, induction, and abduction. Some model or variable, among dozens included in a given analysis, will surely be significant. Thus, many studies invent explanations for the results after making the analysis, instead of building a mind map with alternative scenarios before carrying out the research. This is a contemporary version of the old problem known as “ad hoc hypothesis”.

“and abduction”

Dynamic Ecology: the only ecology blog where commenters casually reference the ideas of Charles Sanders Pierce. 🙂

“Thus, many studies invent explanations for the results after making the analysis, ”

Yup. I’ll have more comments on this in the next linkfest…

Hi Brian – I thought you were about to argue against AIC at the start but now I see that your key statement that comes later in the post is this

“that AIC has been adopted precisely because they rank without going all the way to eliminating theories and because they let you have a muddled approach to inference.”

I definitely agree with your concerns here.

AIC will only be good for the advance of science (and ecology) if we use it to advance the development and predictability of theory.

Several years back in Simova et al. 2011 we utilised this approach http://onlinelibrary.wiley.com/doi/10.1111/j.1466-8238.2011.00650.x/abstract

– Brian

One hugely good thing in the paper you reference is you actually give an R2 for every model in addition to an AIC. You also (of course) are deriving the statistical models to be compared from a conceptual framework. These two things make a big difference. in my opinion.

Reblogged this on Enquist Lab – The University of Arizona.

Brian: I sense that another concern here is that AIC indulges what you’ve elsewhere called ecologists’ lack of a problem-solving mentality? Manifested here as an unwillingness to completely chuck some hypotheses?

On another note, I wonder if part of the attraction of AIC is that it encourages you to ignore the unhappy possibility that all of your models suck in some absolute sense, because you’re only ranking models relative to one another. It’s far from clear why one would care about identifying the best apple from a bad bunch. Recognizing the badness of the bunch surely is the most important thing in many contexts. I have an old post on this: https://dynamicecology.wordpress.com/2012/10/24/no-i-dont-need-to-propose-an-alternative-approach-in-order-to-criticize-yours/

Yup – you’ve pretty much succinctly nailed two of the biggest problems.

1) Nothing ever gets killed off by AIC – it just gets bumped down the list

2) You never assess how good your best AIC model is – in my experience the “best” model often has an r2 of 0.10 yet with AIC we get to all slap each other on the back and congratulate ourselves on finding the best model.

The phrase “best of a bad lot” comes to mind.

I wonder if a downside of big data is that it becomes easier to get p<0.05 by brute force and still publish on the basis of our "best model".

It is definitely true that p<0.05 doesn’t mean much in big data. I always tell people I have never had p>0.05 in my research (a stretch of the truth but not by much) and they always look shocked. But then I tell them I always work with 1000s of data points and it makes sense.

But I have to say good practitioners of big data don’t pay much attention to p<0.05. They mostly worry about prediction accuracy on held out data. Which will tell you a lot about my whole philosophy of science and why I like Lakatos and write blog posts on needing to care more about prediction.

I don’t know how you are judging models (good and bad). A model with a low R2 does not necessarily mean the model is not good. Even the *true* model can have a low R2 if inherent variance/uncertainty is large (we can easily think of a simple artificial scenario where simulated data are analyzed with the true model — if the variance used in the simulated data is high, R2 would be low). Similarly, having a model with a high R2 does not mean you understand the process very well because there are many other models with high R2.

Of course.

In the end, I think judging is best left to human subjective evaluation and having to justify choices to peers. I much prefer that to turning it over to a formula. But between the 3 broad categories of p-value, effect size and variance explained to judge models/variables, p-value is the least important in my opinion. AIC is a rather weird blend of R2 and p-value. While this not the math exactly, it is not that far from using cut-offs on ratios of different model MSE (for normal errors) which is not that different from comparing R2 (but using cutoffs in difference level to do so).

Mark Brewer talked about “The Cult of AIC” at the Methods in Ecology & Evolution symposium few weeks ago. Hopefully the recording will be on the MEE pages in the next week or two.

I suspect AIC is used because (a) it was new and sexy, and (b) it gives simple answers that don’t require too much thought. It means we can chuck lots of variables into a big black box, shake it around, and get something out of it. If we want to look really cool we can use model averaging! That’s so much easier than fitting the full model and looking at the actual model (i.e. the coefficients).

Looking forward to hearing Mark’s presentation.

I would fully agree with your characterization! I can’t tell you how many times I’ve talked to people the last year who want to do model averaging (even though it was never a good fit to objectives in my opinion).

Is there any real-world circumstance in which model averaging is scientifically useful? What’s the best example you know of? I recall talking to a knowledgeable colleague about this years ago and his view was that model averaging is pretty much useless in practice.

There is a growing literature in for example climate modelling that model averaging can improve forecasts (although not surprisingly it only works if you are selective about which models you include in the average). But that is a whole different kettle of fish where the averages are across complex simulations.

In the current AIC world in ecology, where the “models” are just different subsets of variables (maybe with some interactions and quadratic terms thown in), it is hard to imagine the benefits of model averaging vs taking the time to get the right model.

That makes total sense, thanks.

One distinction to make in model averaging is the averaging of the model coefficients and the averaging of the predictions. In theory, both can be good ideas, but multicollinearity can throw a monkey wrench into the averaging of the coefficients. I don’t see why model averaging the predicted values isn’t a good idea if you have competing models. I don’t understand what you mean by taking the time to get the model right. Model averaging is useful when you took the time to get the model right, but you can’t discriminate conclusively between “right-looking” models.

Model averaging is also useful when your (properly designed, post-hoc) models all have close delta AIC values and there may be several variables of importance. Therefore, for practical conservation and management concerns, understanding which variable, when manipulated, will give the greatest “bang for the buck” is a very useful calculation. Understanding, of course, that this is still just a model prediction.

Question: how much of the issue here traces back to how ecologists learn statistics? If people learned more about what AIC is doing “under the hood” (e.g., “It’s estimating relative–not absolute!–Kullback-Leibler information distance. Here’s what K-L distance is…”), would they be less wishy-washy about what AIC is for?

I actually doubt it, for various reasons, but thought I’d throw the question out there.

Jeremy (or anybody) – I’m too lazy to go do the research, but I know you poured through recent Mercer Award winning papers a while back for a post. Did any of them use AIC? I’m going to put out the hypothesis that AIC was not the central inferential tool in any of the winners. Anybody know of a counter example?

I don’t recall any that used AIC off the top of my head, but I’m curious enough that I’ll go to the trouble of double-checking.

Of course, one problem with the argument that the best papers in ecology don’t use AIC is that most people don’t see Mercer Award winners as role models to be emulated. They admire and respect Mercer Award papers and the people who did them–but not in a way, or to such an extent, to want to do as they do. I think that’s in part because nobody wins the Mercer Award for using an easy-to-follow “recipe”–a “crank-the-handle” approach that anyone can just apply in their own work. The contrast with AIC is stark–“fit a bunch of models to your data, none of which is rigorously derived from theory, and then rank them with AIC” is precisely the sort of crank-the-handle approach anyone can easily apply in their own work.

Personally, I think the Mercer Award winners are well worth emulating. But emulating them is hard, and in order to emulate them you might have to, say, change study systems or questions in order to have a tractable, solvable problem. A problem where you can totally nail *the* correct answer. Which is something many, though not all, Mercer Award winning papers do.

It’s not the Mercer Award but a paper that heavily used AIC won the President’s Award in 2005 from The American Naturalist. http://www.amnat.org/awards.html

Thanks! Great paper (no surprise – President’s Award is definitely the same kind of category I was thinking of with Mercer).

I will note that you did a great job of starting with 3 distinct a priori, competing, strong hypotheses and comparing them. That I’m sure is why you got the award. And its very different from the muddy model/variable selection example I gave above. Which proves the point that it is more about how you use AIC than AIC itself.

“Which proves the point that it is more about how you use AIC than AIC itself.”

Techniques aren’t powerful; scientists are: https://dynamicecology.wordpress.com/2012/06/01/techniques-arent-powerful-scientists-are/

I generally agree that AIC is useful, but widely misused. However, I disagree with the idea that ranking models does not allow you to rule models out. I think this happens frequently using criterion such as delta AIC of 7 or more or some other threshold. The amount of evidence required to rule out a model is a subjective decision and the authors of the paper can set one threshold, but present the AIC table so readers can make their own judgement.

I think one misguided criticism of AIC is that it encourages a fishing expedition. In reality, I think it just encourages you to *admit* you went on a fishing expedition. I don’t think the average paper presenting an AIC table necessarily did any more data dredging than one presenting p values, maybe after step-wise model selection. I think the real problem here is that researchers are doing exploratory research and then writing up as if it was confirmatory.

Interesting and good points. I broadly agree.

On your first point, I think we get to the difference between how AIC is used vs how it should be used. I agree that AIC can be used to eliminate models from future consideration, I just don’t see many people do it.

On the 2nd, I agree AIC is way better than post hoc p-values on fishing expeditions. So that is a really interesting point that AIC has just let us reveal how often fishing expeditions happen. That is an argument for AIC improving science at least in the very long haul. But of course it raises the next question of should we be fishing that often?

In my field (wildlife science, broadly), the majority of studies are observational with many potential covariates. I think most of what is being done is exploratory and we’re kind of stuck with that. But maybe I’m too pessimistic.

I think the situation you describe is probably a world where an exploratory approach is justified very often. I just wish people would then go whole hog and do pure exploratory techniques (like full blown variable selection or machine learning like regression trees) rather than used mixed models (which are focused on hypothesis testing) and do AIC on an arbitrary subset of possible models.

Great post Brian. AIC has come to dominate much of the wildlife application literature. One additional casualty of the approach is that it also seems to coincide with less presentation of the data. This is just my general sense of the field, but as a reviewer I’m often asking for a better presentation of the data underlying major conclusions. This way the reader can see leverage points, general model fit, effect size, etc. all of which are not directly available from an AIC table. Granted this becomes more difficult when there are many covariates/dimensions to the dataset, which is one of areas where folks like the AIC rankings. But I think this just mean we need to be savvier with our data presentation rather than not showing it at all.

Good point. Personally, if I had to choose between seeing a scatter plot of the data with an stimated regresison line through it or getting all the statistics of slopes and p-values and r2 I would choose the first. This is of course a false dichotomy where we don’t have to choose, but your point on how AIC pushes us away from this is a good one.

I feel like this post could have been written about just about any statistical methodology. I don’t think any specific methodology can either move an entire field of study forward or slow its progress. Statistical methodologies are typically developed with specific problems in mind which have sub-optimal solutions available at the time and, therefore, are not even intended to move an entire field of study forward.

I think the problem is two-fold.

1) AIC was sort of “marketed” as a panacea for all of your p-value woes. I remember attending a seminar given by David Anderson which felt a lot like a sales pitch. He even got into a debate with one of the other attendees about the AIC>2/ p-value2 for p-value2 (and since I am not calculating a p-value I don’t need to worry about alpha inflation so I can do as many comparisons as I please).

Unfortunately, I have seen very few people capitalize on what I think is a major strength of using an information criteria: assessing model selection uncertainty.

I’m half with you and half not with you on saying this critique can be applied to any statistical technique. For sure a bottom line (as in my reply to Colin below) is that if you’re not doing philosophy while you’re doing statistics you’re probably doing the statistics wrong. But I do think AIC is more decoupled from philosophy than most other approaches.

I do think it is interesting that AIC (and probably Bayesian) came of age in the era where scientists became salespeople and pushed methods aggressively. A lot of the issues I have with these techniques (as I’ve made clear on both AIC and Bayesian) is not the technique themselves but their sales jobs and the large numbers of people who buy into the sales job.

It would be very interesting (and difficult) to try to quantify if or how much “salesmanship” of methods or approaches has increased in science over time.

Forget where I read it, but I recently read a tweet or short blog post complaining about the increasing prevalence of “review” or “perspectives” pieces that are really just advertisements for the author’s preferred approach to science. Of course, such pieces aren’t common in an absolute sense, it’s not as if they’re crowding out other sorts of stuff. But they might be increasing in prevalence, and it would be interesting to document that and think about why it might be.

I wonder if it really is about salesmanship. What we’re seeing is what’s been seen in the more molecular fields time and again. A new method comes along, and looks really great. Everyone jumps on the bandwagon, but then start finding problems. So there’s a wave of criticisms before people settle down to working out how to use it properly.

My Comment was clipped and rearranged for some weird reason. I think the salesmanship was only one part of it. The other part was people shifting their analyses without shifting their associated inferential objectives. I.e., people started using AIC to generate *essentially* p-values without really changing how they approached a question. I remember having a debate in our lab about whether you can get a paper published that has both p-values and AIC tables (addressing different questions of course). There was this idea that you either did everything with AIC (including things it wasn’t really suited for) or nothing. I think this aligns well with the idea of a new method getting introduced and the community figuring out where it fits in and what its strengths are.

Brian- all of this discussion is great, but I’m still left wondering. Can you give an example, using real data, of “correctly” and “incorrectly” using AIC, and demonstrate that you will come to significantly different conclusions? It is hard to tell if we’re getting riled up about philosophical differences in our approach to analyzing data and selecting models, when at the end of the day it may come out in the wash.

Fair question a couple of points.

Personally, I am less worried about having philosophical differences than blindly applying a statistical technique *without* having a philosophical position. Statistics is only a tool to scientific inference and that larger contextualization needs to be at the front of the mind of every practioner. P-values have a trillion problems, but at least its users know they’ve subscribed to a philosophical position and have some idea of the implications of that. That I guess is why I am picking on AIC – I’m not sure the technique even brings that much along with it.

I already gave an example of what I think is using AIC incorrectly (even if it wasn’t applied on real data I think the framework is pretty clear).

As an example of good use of AIC, here’s two:

1) True full blown variable selection in an exploratory fashion – try all permutations and use AIC to pick the best (although personally I prefer BIC and fit it overfits less often).

2) Use AIC in the original sense that is not about variable selection and tweaks on a single linear model. Start w/3 competing theories. Turn these into 3 statistical models. Then use AIC to compare them. Then go a little further than AIC. Report the R2 (or RMSE) of each model so we know how good (or bad) the best model is. Do it on held out data. And draw some strong conclusions on which theories we are done with and which we are inconclusive in distinguishing and need to do a better job on. My discussion with Brian Enquist above and the paper he links to is actually a pretty good example of that.

“Personally, I am less worried about having philosophical differences than blindly applying a statistical technique *without* having a philosophical position.”

I’ll second Brian on that. (Well, actually there are certain statistical philosophies that I just find pernicious, but they’re so rare among actual practicing scientists that they’re not worth worrying about.) What mostly worries me is scientists who think they’re being “pragmatic” if they just ignore philosophy and proceed in some vague, confused, ad hoc fashion.

Well Well Well… so much pooh-poohing of AIC. I for one was not aware there existed an AIC bandwagon in ecology… or an opposing force on the other side of the ball. I guess I jumped on without even knowing it, but these things happen when your brain has been damaged by a dozen years of biochemistry LOL.!

So, I used AIC for the purpose of model selection, and believe it or not, it did not involve an ad hoc hypothesis. Which, by the way, I do not think is necessarily all that bad so long as that ad hoc hypothesis is tested in a new data environment after its development. I had reached a dead end in my analysis of community level data relative to biodiversity. The hypothesis (H1) I started out with went something along the lines of “Biodiversity is expected to decrease as more communities become stratified within habitats”. This expectation was based upon the casual observation that just one or a scant few species became dominant within stratified communities, whereas it appeared evenness and total diversity were improved among non-stratified communities.

Low and behold, analyses suggested diversity, richness, abundance and dominance were all independent of structure. Obviously that outcome was counter-intuitive, and was perhaps biased by application of a categorical variable (structure), uniformly distributed data for structure and diversity, and in some cases at least an absence of homoscedasticity. Nonetheless a conundrum presented itself: none of the four primary models assessed using a variety of statistical approaches explained this result, unless of course we accept diversity is indeed independent of diversity. I wasn’t willing to do that, so I plowed ahead and got up to speed on the various likelihood methods available to me.

It is, however, very important to know what AIC is, what it does, and how to apply it. AIC does not test a null hypothesis. AIC = 2k – 2ln(L), where k = # of parameters in the model & L = the maximal likelihood for the estimated model (Akaike 1974). The “preferred” model is the one evidencing the minimal AIC value. AIC estimates relative goodness of fit and relative amount of information lost in a model: it DOES NOT test a null hypothesis related to model fit. In practice, one uses AIC minimum in the equation to calculate AIC-RL. AIC min is the model where a minimum amount of information was lost compared to all other models considered. However, an unknown amount of information was lost in AIC min that cannot be accounted for. It is also necessary to establish a cut-off for the relative amount of information lost in models to distinguish between models communicating useful information and those not. That cut-off is arbitrary in nature and also precludes the ability to test the null.

The manner by which I applied outcomes among 11 models enabled me to proceed with investigation of the relationship occurring between diversity and structure. Likelihood estimates provided me direction when no other statistical approach “worked”. I was for all purposes trapped in a maze with no means of escape, and AIC provided me a very important life line. I was then able to apply that information to develop alternative statistical approaches allowing me to test the initial hypothesis.

Note, though, that I did not use AIC to test the hypothesis. Anyone doing so should be pooh-poohed, in spades. However, I fully endorse use of AIC under the proper conditions and do not dissuade ecologists from its application.

Interesting points.

I don’t think the line with hypothesis testing is as strong as you make it. If two of your models in the list you are calculating AIC for are nested (especially if one of them is credible as a null model) you are actually doing likelihood ratios. In particular, if there is one degree of freedom difference between the two models, then p<0.05 is asymptotically the same as ΔAIC<1.92 (which is spookily and probably not coincidentally close to the ΔAIC<2 rule of thumb that is often used).

Ah, yes, I suppose if one had a credible null model to begin with, then AIC could be applied this way. Of course, this approach requires removal of ecological processes from data via simulation to randomize the null model- which is something I very seriously considered, and could have done with my data concerning structure and diversity. But this was one of those “philosophical” tenets I could not reconcile for myself, at least, Simberloff notwithstanding. Perhaps the day will come when I reach some sort of comfort level with data randomization, but it does feel “spooky” to me.

Woah! Now here’s a topic to get our teeth into…

I’ve spoken about the “Cult of AIC” at a few venues in recent months, and no-one has yet really argued back at me. It’s pretty easy to show (and if I ever get a chance to write the paper…) that the oft-quoted “optimality criterion” of AIC (i.e. it’s optimal in terms of prediction) is at best unreliable, as the result is an asymptotic one, which requires essentially that any new *covariate* data you obtain has to be pretty much indistinguishable from the original data; that’s like a kind of stationarity assumption. Any breach of that “stationarity” means that the rather lax penalty AIC applies is going to be too weak, and you’ve overfitted your model. It is commonly argued that BIC’s optimality criterion (of consistency in selecting the right model) is pointless as there is either (a) no such thing as a “right” model, or (b) the right model is so complex that you will never be able to find it in practice. I have a lot of sympathy with these objections, but to me this puts BIC and AIC on an equal footing – the theoretical optima really aren’t much use in practice.

Instead, my view is we should just see AIC and BIC as applying different strengths of penalty on the modelling. The more unstable the data (or the more unstable the model itself from, say, site to site, or from time to time) the stronger the penalty we should be applying. The “best” penalty for any one situation may in fact be neither provided by AIC or BIC. I have some plans for this, and when my mythical “research time” appears I’ll test it and write it up.

Another thing, building on something Brian said earlier – if you *ever* “discount” or “rule out” a model using AIC comparisons, you’re effectively doing hypothesis testing. And as for AIC weights…we all know adding a completely random covariate *automatically* provides a model which has AIC <=2 than the original model, and is therefore "competitive". How scientific is that?

I think I'm agreeing with most people here on this post+comments when I say that way AIC is often used is popular because it removes (some of) the need to think.

I might defend some form of model averaging; that of predictions, in some contexts, where the "right" model is either non-existent or genuinely unknowable. In those cases, predictions then contain some model uncertainty, which is valid. But as for model averaging coefficients…given that necessarily, regression coefficients mean different things in different models, this is just bogus. What may be sensible is (taking a Bayesian approach) to plot a posterior density for a regression coefficient from multiple models…but even then it would need to be interpreted with caution.

Thanks for another great post Brian, and I think I now need to go and lie down for a while…

Thanks for the talk preview! I do think AIC has gotten a lot of mileage out of the claim of being optimal in some sense. As you say it relieves users of thinking about what they really want. But as you point out, the optimality claim is pretty shaky theoretically. And I’ve just never bought it practically. I have a paper where a model with zero free parameters (all parameters derived independently from literature instead of the data analyzed) has an R2 of 0.68 and one where all four parameters are curve fit to the data is R2=0.78. I would take the first model in a fit/parsimony trade-off in a split second. Yet AIC heavily favors the second.

Your thoughts about how parameter penalties needs to vary by how consistent the best model is across time and space is really interesting. And starts to converge a lot on the ideas in machine learning (which totally has to worry about too many parameters/too flexible a functional form) on ensuring that one is fitting just the signal and not the noise in the data. Of course testing on hold out data is the main solution to date. But as I’ve shown in a paper with Volker Bahn, even that is problematic when there is autocorrelation in the data.

Very interesting. Thank you.

“[the] way AIC is often used is popular because it removes (some of) the need to think.”

Yup. This gets back to my comments above about how nobody emulates Mercer Award winners. For various reasons (some of them good, some of them less good), the field as a whole tends to gravitate towards seemingly-straightforward, “standard” approaches.

Brian- a minor point, but I was curious about your comment of rank indexes being one-dimensional. Would this also apply to composite indices, that under ideal circumstances utilize indicators that are independent of one another, and reliably predict other indicators of the variable? It seems to me anyway that composite indices behaving in this manner are multi-dimensional. Thanks.

But presumably you have to weight the various components of the composite in some fashion, thereby turning it into a single-axis ranking. Either that, or you’re just doing different rankings on different axes…

Great post, I wonder on how such misuse of statistical tools might be prevented in the ecological literature. Some kind of database/journal of appropriate ecological analysis could advertise relevant analysis for a precise type of data and question. Rewarding case studies using appropriately old as well as new statistical tools will also help. In the end however I fear that it boils down to the incentives that researchers perceive, if misused statistical tools still gets you into Nature, Science or Ecology Letters, then the pragmatic approach is to apply similar scheme in your research and focus on novelty rather than on the appropriateness of your analysis.

Personally I don’t think that the main problem here is incentives. I don’t think most of the vague, I-don’t-really-know-why-I’m-doing-this abuses of AIC that Brian identifies arise from people searching for novelty above all else, or reaching for papers in top-tier journals, or whatever.

Well different kinds of incentives. Being with the pack and getting published easily is I think an incentive. Personally, along with my Mercer hypothesis, I don’t think that many abusive uses of AIC get into Science or Nature either. Abusing AIC is not about being in the tail.

It is definitely easier to go with the flow. Every time I get into one of these situations with a student whose committee I’m on, I give them two answers. I tell them up front, here is what I think is right. Here is what the mainstream ecology community wants to see.

Really good points and discussion. Maybe worth noting that many people who use AIC for model selection have adopted AICc, as it penalizes the model by the number of parameters.

It matters whether one is trying to understand the fit of the model to one’s existing data, or to assess the predictive ability (out of sample) of the model. For the latter, my understanding is R^2 isn’t the best way to understand how good or crappy one’s model is. Isn’t it true that R^2 is going to increase with the number of coefficients and interactions (just like the AIC), as well as the range of x values?

It would be lovely if people more routinely plotted confidence or prediction regions for their various models. One can do this for fixed effects now in lme4!

Definitely the goal should be to pick the model that has the optimum point on an accuracy/parsimony trade-off.

R2 does suffer from the fact that it mathematically can only go up as you add parameters, so its not great for comparing models with different #’s of parameters (although when the # of parameters differs by 1 and the R2s are hugely different I am pretty comfortable drawing a conclusion).

A number of criteria exist to try to do this trade-off. In the dark ages adjusted-R2 and Mallow’s Cp were used (Cp actually is AIC in some circumstances). Then AIC was invented. Then BIC, AICc, QIC, TIC and etc were invented.

AIC attempts to fix this by one very specific trade-off. The question is whether there really is just one right trade-off and if so whether AIC has it. I personally think AIC overfits (values fit too much over parsimony). I gave an example above of why. Lots of other people think this too.

Personally I am a fan of a table reporting R2 and # of parameters and then having the author argue why a particular model is the best trade-off for the application at hand and then letting the reader draw their own conclusion. If you only report AIC you can’t do this. I would be very happy if conventions evolved to reporting R2 and # parameters alongside AIC.

You may have meant that AICc accounts for the sample size as AIC already penalizes by the number of parameters (2k). AICc has a penalty that is a function of both the number of parameters and the sample size. This is a small sample correction that gives an estimator for the expected KL divergence that is unbiased with small n. AICc converges to AIC as n goes to infinity.

The goal of AIC is to maximize the predictive accuracy, not necessarily choose the model closest to the one that generated the data. As I understand it, AIC overfits because the extra flexibility of a slightly overfitted model can better approximate the true distribution.

Ben, you’re right that’s what I was thinking. My data are often limited, unfortunately!

One more thought – there is a whole different approach where R2 (or RMSE) is measured on out of the bag (held out data not used to fit the model). In that case you will see a hump-shaped plot of R2 vs model complexity and you pick the model complexity that gives you an optimum/highest R2 on the out of bag data (this is in contrast to using R2 on in bag or data used to validate the model where you will only see the R2 go up with model complexity). This is the dominant paradigm in machine learning and I think it has a lot going for it.

Just reread your post. You were already talking about out of bag. Sorry. To my knowledge the actual metrics of goodness of fit don’t change between in bag or out of bag testing. Its just the way you trade them off with parsimony differs. And I have a whole post on the difference between r2, R2, RMSE and etc https://dynamicecology.wordpress.com/2013/03/19/ecologists-need-to-do-a-better-job-of-prediction-part-iv-quantifying-prediction-quality/

I’m in a little over my head, but can’t resist a comment or two. Great discussion!

There have been a few comments about AIC being a wishy-washy approach that enables researchers to avoid firmly excluding hypotheses etc. Isn’t that also a potential benefit of AIC? It shows the relative support between models and thus gives a sense of model selection uncertainty, rather than forcing us to choose one model categorically.

However, maybe the problem is that we still want to interpret only one model (unless we’re model averaging predictions (not coefficients), which I think makes sense for prediction), and end up using the potentially sketchy delta AIC cut-offs. This isn’t much of a problem when there’s strong support for the top model, but how do we interpret a model set with ~equally ranked models? I don’t know. I’m not super familiar with alternative approaches, like LRT, but would alternatives be better able to deal with model selection uncertainty or similarly performing models?

Also, there was one point regarding the effect of random/useless predictors being added to models with useful predictors….Arnold explicitly addresses/critiques these “uninformative parameters” in his 2010 JWM paper (Uninformative Parameters and Model Selection Using Akaike’s Information Criterion), which I think can really help to refine AIC model selection and inference.

Of course, all of the models in an any set may be crap so I strongly agree with some measure of performance (like r2) being done in tandem with AIC.

Thanks for the great discussion!

Joe

“It shows the relative support between models and thus gives a sense of model selection uncertainty”

As Brian’s post asks: why would you want that? I can imagine very specific contexts where you might. For instance, in another comment, Brian mentions a policy/management context where you’re legally obliged to make some management decision right now, based on the best science available. But in a lot of contexts, knowing the relative support for various models doesn’t actually tell you anything, scientifically. Especially if none of the models is all that great in an absolute sense.

I think one example of why you want to know the uncertainty is in reponse to people have used stepwise regression and then only report and discuss the top model. Also, in this regard I think human desire to declare a winner is stronger than the desire to make a list. And one of the selling points of AIC is that the list is more informative than just the winner, because it quantifies the uncertainty.

“I think one example of why you want to know the uncertainty is in reponse to people have used stepwise regression and then only report and discuss the top model. ”

Well, ok, although saying that approach X is an improvement over terrible approach Y kind of seems like damning with very faint praise! 🙂

Sure, but as about as valid as a posting damning a statistical approach because it is sometimes (or often) misused.

So, if you have a situation where are you interested what affects the occurrence or count of a species and you have a list of different explanatory variables and the data, what statistical approach would you recommend? Note you could make a priori hypotheses (codified as alternative statistical models) that propose that e response variable is affected by soil, vegetation, temperature, or ect. The old way of doing this is stepwise regression, but I’d like to hear about alternatives. I recognize that controlled experiments would be best, but in many cases that’s unrealistic.

@Barney Luttbeg:

Your statement of the problem isn’t sufficiently detailed for me to suggest an answer. I don’t say that as a criticism, just as a statement. “Affects” could mean “causes”, it could mean “predicts”, it could mean other things…

But in general, we have various old posts on why and how to do variable selection and model selection. Here are a few of them (some of which are just pointers to other sources):

https://dynamicecology.wordpress.com/2014/10/02/interpreting-anova-interactions-and-model-selection/

https://dynamicecology.wordpress.com/2014/02/10/beating-model-selection-bias-by-bootstrapping-the-model-selection-process/

https://dynamicecology.wordpress.com/2015/02/05/how-many-terms-in-your-model-before-statistical-machismo/

https://dynamicecology.wordpress.com/2011/12/17/advice-primer-on-alternative-methods-of-model-selection/

I’ll ask the question with “predict”. A state or federal agency wants to know what is the best prediction for when and where species X will occur.

I just published a paper in Ecology, Model Averaging and Muddled Multimodel Inferences (available as preprint in early online), where you can see some evidence of the extent to which strange analyses have been done by using AIC model averaging of regression coefficients, incorrectly making model averaged predictions from model averaged regression coefficients for models that are not linear in the parameters, and using sums of AIC weights for inferring relative importance of predictors (these really only tell you something about relative importance of models). Read the paper for all the gory statistical details (thought I sent you a copy Brian). Yes, AIC is just another likelihood ratio statistic that can be related directly to hypothesis tests, CI, and coefficients of determination and, yes, you need to think hard about whether it makes any sense to average the quantities of interest. The use of AIC and model averaging seems to have become ritualized by some without the practitioners actually trying to understand whether it actually made sense or provided any improved inferences. We might call this the allure of addressing model uncertainty – it seems like such a commendable statistical practice how could we not want to do it. But tables of candidate models, their AIC values, and AIC weights are just as stupefying as tables of ANOVA model comparisons, SS, and their P-values.

Thanks Brian. I did get the copy you sent me (and enjoyed it!). It is a great dissection of the current state of model averaging (and the confusion between parameter averaging and prediction averaging and the assumptions going into them).

“The use of AIC and model averaging seems to have become ritualized by some without the practitioners actually trying to understand whether it actually made sense or provided any improved inferences.”

Have to agree with you!

I’ve read Brian’s paper, and it does make some good points, about abuses of information theoretic inference (I feel sorry for the authors of paper he dissected!). But the point about averaging predicted values is that in nonlinear models, you can’t model average the coefficients and then use these to produce the predicted values. You can, however, average the predicted values from each model. I don’t know of anyone recommending the former.

Ben: You may be correct that no one recommended using model-averaged regression coefficients in nonlinear models (e.g., occupancy models and other logistic regression models, Poisson and other count model forms) to make model-averaged predictions but I see it done all the time. I’m guessing it is done because people were led to believe by Burnham and Anderson that the model-averaged regression coefficients made some sense and correctly accounted for model uncertainty. Neither is true. Then people make it worse by using them to estimate model-averaged predictions for nonlinear model forms. I know some people do it for a computational convenience when using large data bases associated with georeferenced data (e.g., GIS based predictors for species distribution models). Burnham and Anderson did point out that you could use model-averaged regression coefficients as a short-cut computation to getting model-averaged predictions for linear models (models that are linear in the parameters). People just seemed to miss the “linear model” restriction on this mathematical equivalence.

I played around with Bayesian variable selection and averaging a few years ago, and one thing that became clear was that the posterior distributions of some parameters could be really bimodal, with a peak at zero and another around the conditional mode (conditioned on the variable being ‘in’ the model). I found this really instructive, as you could then see that model averaged coefficients could be shrinking the estimate into a region with really small support. If you want to shrink your parameters like this, why not use a LASSO (or something similar), which is designed to do shrinkage?

Hm, perhaps I should aggressively sell this and replace the Cult of AIC with the LASSO Legions.

Really interesting point on the bimodal support.

I’d be more into joining LASSO Legions than a Cult of AIC if I were the mass group joining sort.

Yes, interestingly the issues with AIC model averaging of regression coefficients extend to Bayesian model averaging of regression coefficients. It amounts to averaging fractions with different denominators when there is any multicollinearity among predictor variables. And as you noted, bi- or multimodal distributions of estimates are possible such that a statistical average is not providing any meaningful information about the center of the distribution of the estimates associated with model uncertainty, even if you effectively standardize coefficients estimates (based on partial SD to account for multicollinearity) so that averaging them is numerically sensible.

Brian, I really enjoyed your paper. I certainly encourage those who challenge the “Cult of AIC” or what I’ve termed the “Bible of Burnham and Anderson”. But what I think would have been more informative is rather than showing there are theoretical problems with certain applications of information theoretic inference, show how using these methods compares with other options, such as ignoring model selection uncertainty. For example, If we model average moderately collinear predictors, are they more biased and do the resulting confidence intervals have worst coverage than if we just used the top model? Do you have any thoughts on this?

Ben: If you read my paper carefully, you’ll see that I’m not implying that the Burnham and Anderson model-averaged regression coefficients are more biased I am stating that they are mathematically inadmissible (you can’t legitimately average fractions with different denominators without somehow first equating the denominators). Now it will be interesting to see if my suggested approach based on standardizing by partial SD to account for scale changes associated with multicollinearity yields good sampling distribution characteristics given model uncertainty. But some form of standardizing has to be done for this averaging to be legitimate. I’m not that interested in pursuing this myself, because the use of correctly standardized model-averaged regression coefficients would still only apply to a limited class of models, those with no interactions, no polynomials, etc. So currently, the best, safest recommendation is to only apply model averaging to quantities that are guaranteed to be the same across all models and that are based on each entire model, e.g,, the estimated means in a regression model.

I understand that, but I see this as an assumption violation (rho!=0 between all predictors). Before I discard a tool from my statistical toolbox, I would like to know how much we can violate this assumption before I’m better off using another tool (less bias, better coverage) with problems of it’s own (e.g. only interpreting the top model).

Thanks so much for posts like this, it gives me plenty of reading material for my commutes. 😉 Seriously though, as a non-academic wildlife biologist (I work for an environmental consulting firm in California) these kinds of discussions really help me to think critically when reading the methods section of journal articles. I’ll second the above arguments that AIC is extremely common in the wildlife literature and since many endangered species policy and management decisions are supposed to be based on “sound science,” it’s absolutely critical that the people researching these species be up front about their assumptions and data analysis when publishing the results of their studies.

As per the above discussion, I think there are some good reasons for AIC in wildlife world. Going with the best available science (e.g. best available model) is a good choice in the policy realm. I also don’t think it is a coincidence that Burnham & Anderson are in wildlife departments themselves (in terms of enhancing uptake within wildlife circles).

I’m happy to conceit that there are many problems with AIC, but Brian, just for clarification, do you object to,

a) the use of AIC for ranking different models in general,

b) to calculating AIC weights, or

c) to interpret them as probabilities a la Burnham & Anderson?

And if you object to weights, do you object philosophically to calculating relative weights of different models or theories in general (hence also to BIC, Bayes factors etc.), or just to the use of AIC to that end? You were asking about a philosophy of relative evidence – Chamberlin / The Method of Multiple. Working Hypotheses comes to mind , or Bayesian inference.

What I was missing among all the bashing of AIC is concrete advice about what to do instead. OK, you can say people shouldn’t analyze data for which they have no clear hypothesis, but this is not a really practical suggestion as many people do have a dataset and no fixed hypothesis, so they need to do something about it.

“OK, you can say people shouldn’t analyze data for which they have no clear hypothesis,”

You (or someone) can say that, but Brian didn’t! As this post and others makes clear, he’s all in favor of exploratory, hypothesis-free analyses: https://dynamicecology.wordpress.com/2013/10/16/in-praise-of-exploratory-statistics/

What he’s against is mixing up or trying to combine exploratory and hypothesis-testing analyses. Or more broadly, vagueness about one’s scientific goals, why those goals are worthwhile, and how best to achieve them.

And see the comments higher up for a couple of examples in which Brian highlights some effective uses of AIC.

Sorry, I think I’m still not quite getting what you object to. You say Brian is happy that people calculate AIC weights *exploratory*, but they are not allowed to interpret them *inferentially*?

Compared to the BF, AIC certainly has the flaw of not including a prior on the working hypothesis, but otherwise I find it totally reasonable from the philosophical viewpoint to weight multiple working hypothesis by their ability to explain the data, which AIC does essentially (via KL distance).

Calculating p-values for multiple working hypotheses, on the other hand, does not improve any of the problems that I would see with AIC, and rather adds a few more on top of them, starting with the fact that the p-value is an even worse indicator of relative evidence than the AIC.

Jeremy pretty much nailed my thoughts (spookily so).

I’ll just add that my objection exists somewhere in the chain before A – adopting the goal of ranking.

As for Chamberlin, he is an interesting suggestion. And people have suggested a link between Chamberlin and model selection (http://bioscience.oxfordjournals.org/content/57/7/608.full).

But I don’t see it. My read of Chamberlin says he was mostly interested in:

a) Avoiding a false rush to certainty that let us over glibly confirm our hypothesis with weak tests (and multiple hypotheses as the way of bringing rigor). I agree with this. But it doesn’t say we should be content to rank them and stop. It still implicitly wants us to choose the right answer.

b) Chamberlin does also bring up the possibility that multiple hypotheses are correct. So my answer to Fred above is relevant to that. Basically I don’t see how doing random variable selection tells us which forces are acting in a given system

Sorry, I posted the answer to Jeremy above without updating.

Hmm … OK, I can only say that I find it philosophically and statistically a completely reasonable question to ask which probabilities we should assign the n alternative hypothesis that we have, given the data.

And if someone wants to do this, the question that we have is which method should we use.

@Brian:

“Jeremy pretty much nailed my thoughts (spookily so).”

Apparently, if you blog with someone long enough, you become like an old married couple. Able to complete one another’s sentences. 🙂

My current opinion is that model weights are fine for model averaging, but are unreliable indicators of variable importance, based on arguments here:

http://www.researchgate.net/profile/Francois_Xavier_Dechaume-Moncharmont2/publication/266266283_Ecologists_overestimate_the_importance_of_predictor_variables_in_model_averaging_a_plea_for_cautious_interpretations/links/545741600cf2cf516480620c.pdf

A very recent comment from Burnham on AIC and relative variable importance that may be relevant or an interesting addition to discussion:

http://warnercnr.colostate.edu/~kenb/pdfs/KenB/AICRelativeVariableImportanceWeights-Burnham.pdf

I definitely agree that problems in variable selection are huge, and a lot of nonsense is done here with AIC weights.

“Model weights are fine for model averaging” is a bit of a tautology as averaging necessarily requires weights, but I assume you mean AIC weights, and I agree AIC is typically doing fine, although not necessarily better than other options.

The question that remains is if one wants to interpret these weights as probabilities in favor the respective model.

Do you see a problem with interpreting AIC weights as the probability that a model it is the best model within the set of considered models?

Thanks Joe, I hadn’t seen that. One issue when interpreting model probabilities is that these are still subject to sampling variability. If you have a strongly preferred model, you’re probably fine saying it is the overwhelmingly most probable, but I would be cautious interpreting differences between models with similar probabilities.

Philosophically sure, because the prior is lacking.

In practice, model misspecification are probably the bigger concern with AIC and all other likelihood-based approaches, as Fred mentioned below.

Re the link to Burnham’s comment: nice to see him finally respond to the critique of AIC weights for measuring relative importance that I and others have made. Note that he is largely making post hoc arguments that they never originally made when developing this procedure. He also is using a definition of relative importance for predictors that few statisticians would use because his definition has to assign a relative importance of 1 to a predictor that occurs in all candidate models. I maintain my position – the sum of AIC weights is unlikely to be a very useful discriminating measure of relative importance of predictors. I also find it strange that he persists in describing the weights as indicating anything about probabilities for a model – this has pretty much been critiqued multiple times in the statistical literature.

Wow – I missed this whole thread of comments (probably when I was at my sons baseball game). Very interesting! Thanks to all contributors.

Do you see a problem with interpreting AIC weights as the probability that a model it is the best model within the set of considered models?I cringed when I read that. It’s OK to make that sort of statement if you’re R.A. Fisher, but it requires waving your fiducial wand to invert the probabilities. At best I think it’s an approximation, under the assumption that all models have equal prior probabilities, but I’d like to see the maths on that.

Ok, just clarifying. How do you feel about people using bootstrapped data to estimate how often each alternative model is found to be the “best” model? Why would one need a prior to deem that a probability (not a belief) that a model is the best of the considered models?

@Barney:

“Do you see a problem with interpreting AIC weights as the probability that a model it is the best model within the set of considered models?”

Yes. As Bob O’Hara notes, it requires you to adopt a pretty unusual philosophy of “probability”.

Following that logic, could we not say the same thing concerning development of a null model, for AIC or other tests? Yet, the body of literature is extensively littered with procedures for just that very purpose. Given the null model is generated via removal of ecologic process from ones data, I would view that as even more a “Hocus Pocus” endeavor than application of weights, and at least in many scenarios an ad hoc method. I would not be inclined to pursue either approach, but if someone put a gun to my head and forced a decision, I would likely go with weights before null models.

Which I think might get at another point observed throughout the thread that you have responded to: “Sometimes you have to pick the best smelling horse in the barn”. Unfortunately not everything we do in science delivers that ideal model, much less the pot of gold at the end of the rainbow. However I do see great value in picking the best smelling horse and publishing on it, because it has the very real potential of providing someone else guidance in their experimental designs.

Yes, interpreting the AIC model weights as probabilities does not seem to have any support from the statistical community. Barker and Link and Draper have made this point several times. David Draper in a long email to me years ago provided a pretty good explanation about why AIC had no place in assigning probabilities to models. If I’m recalling all this correctly it was similar to arguments made by Barker and Link – this inevitably becomes a Bayesian analysis. Quite frankly I just don’t think it is required to assign probabilities to models or hypotheses for the advancement of science. The ratio of the AIC weights to the highest weight of candidate models are readily interpreted as a measure of the relative importance of models, where relative importance indicates the proportionate reduction in log likelihood per parameter of each model relative to the best model. This is an interpretation of relative importance that is consistent with other uses in statistics, e.g., it is related to comparing coefficients of determination.

There is an article by P. Murtaugh in a recent issue of Ecology http://www.esajournals.org/doi/abs/10.1890/13-0590.1, showing how some criticisms of P-values can be applied to Delta AIC, and what is the link between the two for some simple examples. I found that was very instructive.

Often ecological problems might not fit the picture where only 2 or 3 distinct hypotheses are likely, as in your example with species richness, but also the theories themselves might not be so clear-cut that there is a well-defined unique model for each. This “inherent fuzzyness” (for lack of a better expression) naturally leads to more models of similar nature being compared, because the practitioner simply can’t select the only 3 or 4 models that makes sense. However, even in this context, I’m not sure about the claim in this post that keeping all theories around but ranking them is not helpful. Surely that is better than keeping them all without ranking them, which is the state of the knowledge before analysis if you have no strong a priori…

I’m more concerned by the cases where AIC selects overcomplicated models (which can be shown with simulations) and why that happens. But the field is relatively young, compared to other areas in statistics, so I assume statisticians will know more and more what criterion to use for what purpose (if they haven’t already knowledge on the topic that we just don’t know about!).

Thanks for the link!

You raise interesting points. There are two main themes I’m getting out of your comments.

1) You’re right there often are no well defined hypotheses. I guess I would say if there are no well defined hypotheses maybe we should just go all in and do exploratory statistics?

2) You also raise the issue of what I call multicausality (3 or 4 forces acting on the same system at the same time). Which is nearly universal in ecology. I’m not sure AIC is the best solution to that. First of all, it does put you back in the world where you have 3 or 4 strong hypotheses so you should compete them (not turn it into a muddled variable selection exercise like my cartoon). Second I personally find variance partitioning more useful as a way to summarize the role of multiple simultaneously acting forces than AIC weights or ranks.

I totally agree with you about AIC overfitting.

Were I “The Decider,” then yes, I would say just about everyone should conduct exploratory statistics when data allow it. Certainly most of us have well defined hypotheses going forward, but I am reminded of “failing to stop and smell the roses”.

I believe there is such a tendency across the sciences to approach our work like a blood hound on a scent, that we very often- perhaps almost always, miss out on intriguing and revealing relationships in our data.

For many years now I have made time- significant time- to do exploratory work with my data, and I very much approach it like playing in the sandbox. For me anyway it is play, and really great fun, because it takes me back to a time when I did things just for the heck of it, with no particular end point in mind.

I would say, on average, about 9 out of 10 exploratory paths I take are dead ends. But it is that one of ten that makes you feel like the King of the Hill!

About 1): maybe exploratory statistics should indeed have a larger place in ecology. But, and this might be true for variance partitioning as well (not sure), some data are easier to vizualize than others. Sometimes it is difficult to see the patterns without fitting models (e.g. estimating survival), so you already have likelihood and a criterion penalizing it becomes a natural choice (and then you’re faced the question of what to choose, AIC, BIC, or something else even…). (I don’t have strong opinions on the matter)

“This “inherent fuzzyness” (for lack of a better expression) naturally leads to more models of similar nature being compared, because the practitioner simply can’t select the only 3 or 4 models that makes sense. ”

Yes. But in that sort of situation (which I agree is very common), how does using AIC to rank models help you? Heck, if you start with fuzzily-defined models, how is any statistical procedure going to tell you anything besides “you need to define your models more sharply before trying to distinguish between them”?

What helps you in this sort of situation, I think, is thinking hard and coming up with better models, that one can actually test. Or maybe setting the models to one side and doing something descriptive or exploratory. Or maybe collect data that will check the *assumptions* of your various models rather than test their predictions. Or maybe design a manipulative experiment that will distinguish the models. Or maybe go ask some other, more tractable question instead. Or go ask the same question in some other system, like a model system that makes the question tractable.

In general, if you’re struggling to learn what you want to learn with statistical approach X, or struggling to articulate exactly why you’re using statistical approach X, the solution often is not “learn to use statistical approach X better”. Or even “try some other statistical approach”. Struggling to learn what you want to learn with statistical approach X, or being unclear on what you’re learning from statistical approach X, often is a sign that there’s some more fundamental problem, of which your statistical problem is only a superficial symptom.

My point was that if you can write models where components e.g. A,B,C,D of population growth (of a focal species) can all be affected by factors X,Y,Z (it makes many possible combinations and that’s the annoyance) and you see, comparing AICs, that factor Z almost always affect B in the highest ranked models but that factor A is almost never affected by X, you still learn plenty.

Then you can go designing new experiments to see why your population decline through trait B because of an increase in Z, whilst A (let’s say recruitment) is never affected by X, despite a previous hypothesis that chemical X could have a profound impact on A. And you can check later why chemical X has no effect.

So assuming the ranking selects high-ranking models that have something in common and don’t have all the variables in, it can be useful (the example I took looks a little like a structural equation model I think, it would be interesting to know what the community using SEM chooses as criterion of fit).

Hi Brian, one reason I like AIC that I don’t think has been mentioned here is that the model that is identified as the best model by AIC is the one that would do the best in a Leave-one-out cross validation. This explicit connection to out-of-sample predictive ability seems to me to be an advantage. Of course, the problem is that this is true under some fairly restrictive assumptions and in the limit so while it is attractive conceptually I don’t know, in practice, whether it results in selecting models that are better at out-of-sample prediction than other methods of model selection.

The delta 2 method is a problem for reasons that have already been discussed but one that I don’t see mentioned often is that for nested models it is impossible to end up with a single model unless the full model (using all predictor variables) is the best model. For example, if you are fitting models that can have from 0-5 predictor variables and all the models are nested the only way you can have all models other than the best model with delta AIC >2 is if the model with 5 predictor variables is the best model. This is clear if you do a simple thought experiment. Let’s say the best model is the model with 3 predictor variables. Now add a 4th variable – the delta AIC compared to the model with three variables can’t be greater than 2. And it can only be 2 if the 4th variable explains absolutely none of the remaining variability in the dependent variable, which is never going to happen. So, the model with 4 variables will always have a delta AIC less than 2 even if that 4th variable explains almost none of the variability in the dependent variable. Under the AIC ‘rule of thumb’ you would be expected to treat those models as roughly equal.

The other problem I have is that I have been involved in a couple of conversations about what to do if you have to select one model and you have 1 or more models within 2 AIC’s of the best model. I would say that you should just pick the best model. Others have said you should pick the simplest model from the group. But AIC already has a penalty for complexity – what are the grounds for adding a second penalty for complexity? Brian, you mentioned that there is some evidence that AIC errs on the side of selecting more complex models and that would be grounds for choosing one of the simpler models but in the discussions I’ve had, proponents of this approach aren’t making that point. They are simply stating Occam’s razor and using that as grounds for their decision. The logic for this seems weak to me when using a technique that explicitly penalizes complexity. They would make the same argument for models ranked by BIC and BIC adds an even heavier penalty for complexity and, I suspect, if anything errs on the side of picking models that are too simple. But both of these problems have to do with how AIC is used rather than anything fundamental about AIC.

Like Brian , I think most of the problems with AIC arise from how we use it rather than anything inherently wrong with it and I think it has at least one conceptual strength that most model selection techniques don’t. Best, Jeff H

On point 1) Burnham and Anderson in their book and Shane Richards (2008) have both argued for eliminating models that have “pretending variables” (ie when a more complex model has a higher deltaAIC than a simpler version of the model). When you do this you are often left with a shorter of group models receiving support from AIC. You can often end up with the null being the only supported model.

On the “other problem”, why pick one model? The AIC analysis is saying that both model are receiving substantial support from the data. You shouldn’t ignore the uncertainty in model selection and choose a winner.

Jeff, in the first situation you describe, I think Arnold (2010) makes a good argument for excluding nested effects that don’t increase the AIC by more than 2 units. But this is an example of information theoretic inference not having widely established and agreed upon rules, which might give you more researcher degrees of freedom.

http://onlinelibrary.wiley.com/doi/10.1111/j.1937-2817.2010.tb01236.x/abstract;jsessionid=25DBD80CCF06313095C2B88CAE2E17B9.f03t04

I have observed a debate in this string concerning “Bias” and “Penalty” as it concerns AIC. Anderson & Burnham posted a response to the issue online many years ago. I do not know for certain if what they claim is unequivocally true, as I am not a statistician, but they seem to make a logical argument:

“The so-called penalty term in AIC (i.e., 2K) is not a bias correction term. This is incorrect, see Chapter 7 in Burnham and Anderson (2002). There are certainly dozens of journal papers that clearly show that the maximized log(L) is a biased estimator of relative, expected K-L information and that to a first order a defensible asymptotic bias correction term is K, the number of estimable parameters in the model. So, E(K-L) = log(L) – K. To obtain his AIC, Akaike multiplied both terms by –2. Thus, AIC was –2log(L) + 2K. Note, the 2 is not arbitrary; it is the result of multiplying by –2 such that the first term in the AIC is the (well known) deviance, a measure of lack of fit of the model. The rigorous derivation of the estimator of expected relative K-L, without assuming the model is true, leads to a bias correction term that is the trace of the product of two matrices: tr(J*I-1). If the model, g, in question is the “true” model, f, then this trace term equals K. If g is a good (in K-L sense) approximation to f then tr(J*I-1) is not very different from K. Moreover, any estimator of this trace (hence, TIC model selection) isso variable (i.e., poor) that it’s better to take this trace term as K rather than to estimate it.”

But interestingly, other statisticians (Bozdogan 2000, Ripley 2004) have pointed out that Akaike’s (1973) derivation strictly requires nested models for the K in the AIC computation (2log(L) + 2K) to correctly adjust for the bias in using the MLE as an estimate of the Kullback-Liebler divergence for selecting the model with minimum divergence. A paper by Schmidt and Makalic (2010) goes even further to demonstrate that this assumption strictly requires that there be a single candidate model for each value of K parameters for this nested model requirement to be satisfied and demonstrate the increasing probability of selecting the wrong model as the number of candidate models for each of the K parameter values increases (their result is an upper bound). So this certainly suggests there are some serious issues with using AIC for all subsets regression where you might have multiple candidate models with K = 2, K= 3, K=4 parameters, etc. I’m still trying to come to grips with the generality of the Schmidt and Makakic (2010) result. But it certainly calls into question the argument that Burnham and Anderson have made that AIC doesn’t require nested models for validity.

“What is statistically significant is not always biologically significant”. This really seems to be the root of all evils, whether we are discussing null hypothesis testing (NHT) or information-theoretic model comparison (ITMC). Both NHT & ITMC can be misused to test “silly nulls”. Both approaches lead to arbitrary inferences (effect size in NHT; confidence intervals in ITMC). Both can test a range of alternatives, via univariate or multivariate statistics, that do not include any *good* alternatives. NHT & ITMC are equally prone to data & model dredging. A posteriori attempts to improve models by adding parameters occurs in either environment and in general is not a good practice. However, exploratory data analysis is not only helpful, but probably should be applied in any developmental modeling process.

Jeremy’s point about abusing AIC to facilitate intellectual laziness is a good one, and I have seen this happen repeatedly. My suggestion, and it is an approach I force myself to use, is that the null should be at least, if not more interesting and biologically significant as the alternative. In fact, we ought to construct hypotheses such that we are motivated to support the null, rather than reject it. I believe this approach avoids many of the pitfalls described, and really forces investigators to focus on biological significance and not just statistical significance. Such an approach likely cures many of the ailments discussed today, whether it is NHT or ITMC.

Given the complaints about fuzzy thinking, unclear hypotheses, and data dredging, and interest in mitigating these factors, maybe someone should do a post about the feasibility of preregistering studies in ecology. This is increasingly being done in the medical field, especially for pharmaceuticals, and psychology. A recent popular article that draws from an interview with Brian Nosik, one of the biggest open science proponents, has an interesting section on how the process of preregistration has led researchers to realize just how much their hypotheses of interest shift during the research process.

http://nautil.us/issue/24/error/the-trouble-with-scientists

I think preregistration would be a huge plus, but there may be some unique aspects of ecology that would make it more difficult, such as the need to control for many factors, some of which aren’t realized ahead of time.

Just an idea, anyway.

I’ve been waiting for Jeremy to weigh in but I know he is at a conference today so maybe not. He has talked about this frequently in the blog and is a fan of the idea. I’m sure he’ll be along with links soon.

Yes to everything you and Brian just said. I’m a fan of the idea of preregistration, while admittedly never having done it myself. I think it would be a very interesting exercise to try in ecology, if only to help researchers realize just how much their “hypotheses” shift during the research process, and (importantly) how much this compromises their ability to determine whether those hypotheses are true or not.

I also think it would be an interesting way to try to resolve disagreements among different “camps” or “schools of thought” in ecology. Not necessarily in the eyes of the members of those different schools of thought, but in the eyes of outsiders. Get both sides to agree in advance on what study would resolve their dispute in favor of one side or the other, then go and do that study.

But yes, I am at a conference, so I’m afraid I won’t have much time to participate in the conversation further. Here are a few old relevant posts before I run:

https://dynamicecology.wordpress.com/2012/12/03/want-to-bet/

https://dynamicecology.wordpress.com/2015/01/22/book-review-the-bet-by-paul-sabin/

https://dynamicecology.wordpress.com/2013/03/29/friday-links-transparency-in-research/

I think this is interesting. One “easy” way of doing this would be to version control your working notebook or whatever, and if possible highlight hypotheses in specific commits (maybe with a tag).

Any sort of database requiring submission will be awkward. This solution could piggyback on the push for version control for other reasons.

Pingback: Recommended Reads #53 | Small Pond Science

Pingback: Weekly links round-up: 22/05/2015 | BES Quantitative Ecology Blog

I haven’t read all the comments, so perhaps someone has already said this.

My understanding is that AIC and other information criteria are focused on the goal of improving out-of-sample prediction without actually doing cross validation. Its job is not to find the “true” or “universally best” model form. It can’t do this, because AIC scores change with sample size. Remember, AIC will often rank false models over the true model (supposing you know what the true model is – maybe you simulated the data yourself). The idea is that if there is not enough information to properly estimate the parameters of the true model, then using the estimates from that model will do a worse job of prediction than a false but simpler model, which won’t overfit as much.

So, any use of AIC to try to find “the best model”, or to do hypothesis testing, seems like a fundamental misuse. Again, it’s a way of ranking how well a certain parametric model family will predict new data, given the current sample size. If you’re not trying to make predictions from your current model estimates, then you probably don’t want to use AIC or other information criteria.

Let me know if I’ve got this wrong.

Thanks. I have heard several commentors mention the out-of-bag nature of AIC (by which I assume you mean that it is asymptotically equivalent to jackknifing regression). I have to say I don’t think this has much to do with the initial impetus (I cannot find jackknife, hold-one-out or even hold out in Burnham & Anderson). As several commentors have pointed out the relation to jackknifing is asymptotic and its not very clear how relevant it is in real conditions. But in the end I just have to say if out-of-bag tests are your goal, there are much better, stronger methods than asymptotically jack-knifed approaches. Larger holdouts (e.g. 2/3 vs 1/3) or cross-validation are the norm in the machine learning world.

AIC is closely tied to likelihood and likelihood in normal variables is closely tied to SSE which is closely tied to R2 and RMSE. So I buy those links a little bit. But the particular parsimony penalty (2k) seems arbitrary to me (having read in full the detailed mathematical derivation of it being optimal not withstanding). In general since AIC usually overfits I’m not sure how this fits the logic of your last sentence in your first paragraph. The idea that this is an out-of-bag test seems a reach to me personally. And I personally think it is almost always used to find the “best” model (and the “second best” and etc) but I guess that may depend on the literature you’re reading.

I think Burnham & Anderson explained the derivation of this multiplier (2*K) was not arbitrary but in fact represented the deviance. Were they just jerkin’ our chains?

It does align with deviance, but my understanding is that the 2*k comes originally from the minimizing Kullback-Leibler distance rather than an analogy with deviance.

Could be. While I studied AIC somewhat deeply, I am by no means an expert on its history. Anderson & Burnham stated with authority that Akaike derived the multiplier as the deviance, and said it was by no means arbitrary. So am I left wondering who is right about the issue.

I’m not sure what you mean by saying that it “usually overfits.” Compared to what? It may well pick a model more complicated than the “true” model, if the sample size is quite large. That’s in part because AIC isn’t supposed to identify the true model, but provide good out-of-sample prediction (again, as I understand it). The MLE from an overly complex model family may do just as well as the MLE from the true model family, provided that the sample size is large enough to render superfluous variables approx 0. But if you know of clear examples where bad overfitting happens, I’d be curious to hear. (Forgive me if this has already been mentioned above.)

I do know from experience that AIC usually picks more complex models than BIC. I don’t know anything about BIC’s derivation or its intended job, so don’t have anything to say about the difference. I do know that Burnham and Anderson contrast the two in their book, and argue that they are intended to do different things. Haven’t read it in years, so I might misremember that.

To be clear: I don’t mean to provide a defense of current AIC practices, which I don’t know very well anyway. Cross-validation is the more obvious thing to do. My point is that if people are indeed using AIC to pick “the universally best model,” then that’s wrong, as I understand it.

I did give an example of what I consider overfitting above, but good luck finding it! Try searching within this page for “R2 of 0.68”.

Thinking about it more, I do see what you mean by designed to fit out-of-bag data. This is in the original derivation by Akaike. And it is rather imprecise. One source I saw said it provides an estimate of the fit to an independent replicate of the data used to fit the model. I have to confess I cannot quite get my head around what that really is! And it is only asymptotic and I think there are some pretty strong questions of whether it works out in real world data. I am much more comfortable with really holding data out-of-bag.

Hey Brian the discovery that AIC picked the same model that would be picked by Leave-One-out Cross validation is in Stone, M. 1977. An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. JRSS Series B 39, 44-47. I have trouble getting my head around this too but my fuzzy understanding is that the model that would be the best by leave-one-out cross-validation can be identified analytically with certain assumptions. So, to me this takes the penalty out of the realm of arbitrary – it may not always perform as advertised but the theoretical foundation seems a little better than either p < .05 or BIC. Jeff.

As I understand it from all the comments, there are two claims about AIC measuring the best model on out of sample.

One goes back to Akaike himself and seems rather vague. The other is on jackknifing by Stone 1977 that you mention.

The Stone one seems much more concrete (although jackknifing is a much less powerful claim than best model on a whole new sample of data). But they are both true only asymptotically with large samples.

My understanding is neither of these claims necessarily get that close to being fulfilled on real data (too far from the asymptotic conditions). But I would be happy to be educated on this!

I would agree the theoretical foundation is better than BIC, its just that BIC works better in the real world in my experience. I would say the theoretical foundation for AIC and calculating a p-value are at least equal (and I’ve actually argued in this post that I think there is a better foundation for p-values). However p<0.05 is best compared to ΔAIC<2 and both are completely arbitrary and without theoretical foundation (although I suspect that secretly ΔAIC<2 just comes from p<0.05 (as mentioned above p<0.05 is mathematically exactly the same as ΔAIC<1.92 under simple and again asymptotic conditions)

I haven’t read through all the comments, so excuse me if I’m harping, but I think the key is that when AIC first came to be used, people actually had to understand what the philosophical underpinings meant, much like when the first (I imagine) significance tests were developed. However, as journal editors and others became enamored of it, fewer people understand how to properly execute it.

I read Burnham and Anderson, cover to cover, several times while writing my dissertation (only one chapter used AIC, but it was heavy in that chapter) yet since then, about 6 years only, I have met relatively few people who have even cracked it open. It has become as much of a “plug and chug” method as doing a t-test.

I would agree with you. B&A came out when I was in grad school and I read it cover to cover too. Yet as a tool gets adopted, people start to think they can dispense with such things. Well put.

Excellent point, Katie. And, AIC is not the only instance of “Plug & Chug” ecology, which get’s to Jeremy’s point about intellectual laziness. Ecology has gone through several phases of what I loosely call determinism. So for example, in the wake of the physics revolution spurred by Schrodinger, Bohr, Podolsky, Einstein and others, ecology went on an extended, decades-long search for deterministic and universal laws equivalent to what to what was observed in physics during the late 19th & early 20th centuries. When those efforts did not pan out, many threw their hands up and said, “well, complex biological systems at the level of the ecosystem or beyond are just too complex, too variable to ever identify things like universal mechanisms”.

If you dig into the literature, you find that “philosophy” repeated time & again. Thus, for example, when it comes to constructing composite indicators of living systems, in my opinion, we often plug & chug again. Often times the variables thrown into the PCA or factorial hopper were decided upon by diverse- and often- non-scientific stakeholder groups. So, we toss sometimes hundreds of variables into these analyses, partition out the variance, then select a subset of variables encapsulating some majority percentage of the variance. Then we weight them according to the degree of variance they account for, build a composite index and call it a day.

Is it any wonder, then, that a recent study found 75% of 400+ restoration projects failed to attain their benchmarks??? No, of course it isn’t. Understanding the mechanics of complex living systems does not come out of these approaches, and the time is really long past due that we no longer continue to FUND these approaches. Thinking is hard, and thinking creatively is harder, and thinking in a manner that reveals a true understanding of how ecosystems function is really really hard.

Imagine where we would be today if Isaac Newton had opined, “Small things move fast, big things move slowly, and really big things don’t move at all… so the hell with the laws of motion, it’s just too darned complex”. Uh, yeah, it is complex and we need to separate the wheat from the chaff. I speak from experience, having spent the past 6 years investigating one relationship… and the composite index resulting from that endeavor did not come out of any PCA hopper, I assure you… although AIC provided some useful clues along the way.

David, could you please give a reference for the failed-restoration-projects-study? Working in restoration ecology and doing some ecological restoration myself that clearly interests me and I did not find it myself… Thanks in advance!

Albin- Last I checked, the publication had not yet been posted online. It is a pending paper by Margaret Palmer. Here is the link to her publications page:

http://www.palmerlab.umd.edu/publications.html

The following quote from Margaret was the source of my information:

A. Palmer at the University of Maryland reports that more than 75 percent of river and stream restorations failed to meet their own minimal performance targets. “They may be pretty projects,” says Palmer, “but they don’t provide ecological benefits.” (from: http://e360.yale.edu/feature/rebuilding_the_natural_world_a_shift_in_ecological_restoration/2747/).

Im going to vote for Lasso Regression as a silver bullet to perform variable selection, reduce over-fitting, and let you assess significance and magnitude of effects simultaneously 🙂 http://www.stat.cmu.edu/~ryantibs/datamining/lectures/17-modr2.pdf

Just got around to reading this today. And no time for reading all the comments. But just wanted to say thanks for writing this post.

I’m sorry, I don’t have time to read all of these comments, and I’m sure I am going to repeat some of what someone else has said here.

I’m a bit worried that the opening tone of your post suggests that AIC is a bad stats method. ANY stats paradigm is going to be bad if it is not preceded with careful consideration of competing hypotheses. I love the use of AIC for the ability to compare among multiple competing hypotheses, especially since in ecology it is rare to study a phenomena that may be affected by “this thing or nothing”. Step one of ANY study should be to carefully consider possible competing hypotheses, and for an AIC analysis you should then develop a considered model set (not dump an all permutations model set into the analysis). Burnham and Anderson clearly advocate for careful consideration of a limited model set, and not doing so is not a problem of AIC analysis, it is a problem of the scientist.

If it’s helpful to anyone, I follow these rules for myself when running an AIC analysis.

1) Limit my model set to logical, interpretable, competing hypotheses.

2) Always include a null model.

3) Always report an R-square value.

I think it says a lot about how we might need to improve stats education if ecologists are just doing a data dump into a model dump.

I think your 3 criteria go a long way.

Brianne- You highlight many important issues, not the least of which is improper application of statistical methodology. This post has been very helpful for me, and so I appreciate Brian putting it out for consumption, as my organization has been in process of developing a new peer review model for a journal we intend to market & publish. Part of our new peer review process, which in general shall be stream-lined and transparent, is retaining statisticians to carefully review matters such as this. I think we can all agree there has been a mountain of statistical gibberish published in ecology.

I am curious about your comment: “it is rare to study a phenomena that may be affected by “this thing or nothing””. If that is indeed the case, and I would argue in most cases it is not (but rather the complexity of the system obscures it)- how then do you construct a valid null model? Thanks.

I’m not trying to be snarky, I just have struggled philosophically with this concept of a randomized null when, as you state, we rarely study phenomena modulated by a small set of variables.

“1) Limit my model set to logical, interpretable, competing hypotheses.

2) Always include a null model.

3) Always report an R-square value.”

I like these 3 rules (and the original blog and entire reply chain).

1) This is an important point that B&A (2002) emphasize repeatedly. And they disparage “all possible subsets” approaches, and yet AIC is widely used in such circumstances (and B&A 2002 inevitably gets cited).

2) Although they don’t come right out and say it in the book (they come close on p. 115), B&A seem to discourage null models (in 2001, Wildl. Research, 2002 J. Wildl. Manage. they do come right out and say it). As others have commented earlier, AIC will find an AIC-best model, even if all of them suck. So I concur; always include a null model.

3) This works fine for GLMs, but not for multinomial mark-recapture models, but ANODEV can perhaps function as an approximation to R2 in such circumstances. One of the most common “mistakes” I see with AIC in mark-recapture data is to interpret an ecologically vapid model (i.e. survival and recapture probability differ among years) as “AIC-best”, and ignore an ecologically relevant model (i.e. survival is a function of ecological covariate X). In circumstances like this, comparing the covariate model to the null and fully temporal model with ANODEV allows one to say that “30% of the deviance in survival is accounted for by covariate X).

B&A on one hand discourage all subset data dredging, but on other hand model averaging basically requires that every explanatory variable receive equal representation in the set of considered models (otherwise the weighting not only reflects the evidence in the data, but also how many models in the explanatory variables the model appeared in) . So, they end up saying don’t dredge, but basically you have to if you’re going to do model averaging. I think this goes back to the beginning of the conversation of what is the goal of the analysis. If it’s prediction, then all subset data-dredging probably isn’t that bad. If it’s inference about the effects of certain variables on a pattern, then you need apriori hypotheses.

I like your approach, Todd, but what do you think of the recent Bromaghin et al. (2013) and Doherty et al. (2012) that basically state when you have no a priori hypotheses regarding a particular set of covariates (i.e., you included them because they’re all biologically plausible) not evaluating all possible models (or some model-rich selection process) can result in biased selection results? I’m wrestling with this at the moment for some of that NZ stuff…

Bromaghin, J. F., T. L. McDonald, and S. C. Amstrup. 2013. Plausible combinations: An improved method to evaluate the covariate structure of Cormack-Jolly-Seber mark-recapture models. Open Journal of Ecology 3:11-22. doi:10.4236/oje.2013.31002

Doherty, P. F., G. C. White, and K. P. Burnham. 2012. Comparison of model building and selection strategies. Journal of Ornithology 152(2):317-323.

http://link.springer.com/article/10.1007%2Fs10336-010-0598-5

My first question is always “what is your question?” If your primary goal is prediction, to obtain the best possible estimates (i.e. lowest MSE) of the response variable, then Doherty et al. conclusively demonstrated that model averaging over all possible subsets using AIC is the best approach. But I suspect that most readers don’t really care about the best estimates of absolute bobolink abundance in 2002, or bullsnake survival in 2007 (but I’m not trying to dis this either; for agency scientists, this might be the most relevant question). If your primary goal is variable selection (i.e. what things should I continue to measure to best predict Y in future surveys?), then AIC has a lot to recommend it (with the caveat that 1 in ~6-7 retained predictor variables will be spurious, but so what?). If your primary goal is model selection, i.e., “does the world work like A or B?”, then AIC might be useful as a precursor to asking should I interpret model A, model B, or both? (but parameter estimates and their uncertainty will ultimately answer this question). Or if your primary goal is to identify and incorporate a host of nuisance variables that you suspect influence Y, but you really don’t care about (because your primary goal is to assess whether or not X! influences Y), then I think AIC is a good approach to winnow out what you should or shouldn’t include as nuisance variables, but a poor approach to concurrently assess the importance of X! (I would advocate a LRT for that). But I think that most practitioners are employing AIC for variable selection, with the goal of assigning ecological significance to retained variables, and I don’t think that’s the greatest approach. I’m becoming a fan of running a single model (as large as your data will reasonably support), and interpreting it. Or 2 or 3 models if you have 2-3 different plausible alternatives of how the world works. And that isn’t too far from B&A 2002.

I’ve noticed somewhat a tendency in the string of comments for “all-in” or “all-out” approaches to statistics of any kind. Personally I would not recommend anyone put all their eggs in one basket. I think we all realize any test has its strengths & weaknesses. Diversity indices, for example, to varying degrees assess abundance, richness and sometimes overall abundance. Recent iterations also include landscape metrics. Some are more or less sensitive to any of these variables.

So why would you limit yourself to just one diversity index, knowing none provide the gold standard? I’ve always preferred the approach of biochemistry. Most often, biochemists triangulate their hypotheses, using several independent lines of experimentation. That enables them to often avoid the oh so embarrassing publication of non-reproducible results.

Philosophically, I have a lot of issues with generation of a “null model” via randomization of data. It’s always seemed spooky to me. I instead advocate applying AIC in a relative framework- i.e., without a null model. However, I would never, ever rely on this as a gold standard- much less publish anything on that basis alone. No matter how one might apply AIC, or any other test, I would suggest alternative lines of questioning and reasoning before leaping to any conclusions.

A new paper that is particularly relevant:

Fieberg, J. & Johnson, D.H. (2015) MMI: Multimodel inference or models with management implications? The Journal of Wildlife Management

http://onlinelibrary.wiley.com/doi/10.1002/jwmg.894/abstract

“There is no unique best way to analyze most data sets.”Mark Brewer’s talk about the Cult of AIC is (finally) online.

This is an excellent talk. Thank you Mark.

Thanks Brian – I borrowed the word “muddled” from you, of course!

Question I should’ve remembered to ask much, much earlier: Brian, what do you think of continuous model expansion as an alternative to model selection or model averaging? Andrew Gelman is big on this, but I only have the vaguest sense of what it is. Basically, you write down and estimate a “hyper-model” or “meta-model” that includes the various models you want to select among as special cases, right?

Pingback: This Week’s Good Reads: Ecologists’ Favorite Statistical Methods, How Biodiversity Inhibits Parasites, and Distractingly Sexist Scientists | The UnderStory

Pingback: Ecology at the Interface in Rome | biologyforfun

Great article and discussion. I work in a sub field of behavioral ecology in which AIC is becoming increasing popular as a “tool” for selecting variables and interactions before inferential testing. Many reviewers in my field advocate for AIC in lie of a priori hypothesis construction. I would like to preemptively defend my choice not to use AIC in my mixed-model hypothesis testing for many of the reasons that have been discussed here. Is anyone aware of a citable source that advocates against the misuse of AIC model selection? Thanks for your help.

Wait, behavioral ecologists are using AIC to select what terms to include in a model, and then doing null hypothesis significance testing on the selected model using the same data?! I sure hope I’ve misunderstood you, because dear lord that is a terrible idea. It’ll dramatically inflate your type I error rate above the desired nominal value. See here.

Hi Blake,

You may have seen in the comments above that Mark Brewer has given a presentation about AIC. This is working its way towards a paper and I hope it will be out sooner (months) rather than later. I think this would be an excellent citable paper on this topic.

Its not citable on the same level but I know in reply to reviewers several people have had success citing blog posts like this one too.

Its a sad day when you have to justify to reviewers why you used hypothesis testing instead of AIC! (but I have seen this too which is why I wrote this post).

Thanks Brian 🙂

The paper should appear as an “accepted article” any day now over at MEE…

Awesome! Looking forward to it.

NB Mark Brewer’s paper can be found here: https://dx.doi.org/10.1111/2041-210X.12541

Brewer, M.J., Butler, A., and Cooksley, S.L. (2016). The relative performance of AIC, AICC and BIC in the presence of unobserved heterogeneity. Methods Ecol Evol 7, 679–692.

Pingback: Why ecology is hard (and fun) – multicausality | Dynamic Ecology

Pingback: Most useful blog comments section… – Community Ecology and Phylogenetics

Pingback: Ecology is f*cked. Or awesome. Whichever. | Dynamic Ecology

Great article and discussion.

Personally I think ecology needs to move towards testing alternative models using out-of-sample predictions. This is a true test of our ability to create theories that are meaningful in the real world.

Of course, the models should ideally reflect plausible alternative hypotheses, rather than arbitrary combinations of variables. Comparing models using out of sample errors (e.g. implemented using cross-validation) gives you a ‘ranking’ if you desire it and a measure of predictive ability (e.g. the root mean square error), which Brian suggests we should use the R^2 for.

Note that the AIC is useful precisely because AIC model selection is equivalent to selecting models using leave one out cross validation (see http://robjhyndman.com/hyndsight/aic/). However, as Brian explains, just reporting AIC’s doesn’t tell us how good or bad the predictions were in an absolute sense.

I also wanted to add a link to Gelman’s blog on DIC for Bayesian analysts reading this post: http://andrewgelman.com/2011/06/22/deviance_dic_ai/

In short, if you are using the DIC for model selection be careful. It has the same philosophical issues as the AIC that are raised here in addition to other issues relating to its theoretical basis and calculation.

The problem with out of sample prediction in ecology is that if it is genuinely out of sample then there is almost certainly unexplained heterogeneity that means that AIC will tend to over-fit – see the Brewer et al. paper linked to above.

I think DIC is being replaced by wAIC now – it has a better theoretical basis (but still the same issues as AIC).

Two contradictory links to my work – wherein I agree with both of you:

http://onlinelibrary.wiley.com/doi/10.1111/oik.03726/full

https://dynamicecology.wordpress.com/2013/08/19/why-advanced-machine-learning-methods-badly-overfit-niche-models-is-this-statistical-machismo/

More seriously, I am broadly a huge fan of tests of out of sample prediction (1st link). It is to me one of the most rigorous modes of inference in science. But as Bob notes (and my 2nd link notes) it needs to be truly out of sample, which is not as easy in the autocorrelated world of ecology as it is in the machine learning world of a database of customers. I personally think still meaningful out of sample predictions can be arranged in ecology (http://onlinelibrary.wiley.com/doi/10.1111/j.1600-0706.2012.00299.x/full) , but have indeed observed that many people will quickly dismiss these as “unfair” tests of extrapolating to new conditions. There are many philosophical questions as well as a need to clearly define goals and motivations in resolving this.

Thanks Brian.

I would still argue that k-fold cross-validation is a stricter test than in-sample tests, even though you are resampling from the same data-set.

It is interesting to think about how you would design a study with a ‘truly out of sample test’ as you say.

BTW, your final link is broken, I was interested to see what was on the other side.

Cheers,

Chris

thanks, will have to read up on the wAIC.

The key advantage of AIC over other classical tests is the ability to compare competing hypotheses, not test a single one. That is more in line with what Platt and Chamberlain had in mind. Yes, it does get misused, but you could say the same about experiments that use unrealistic treatment levels to produce a ‘statstically significant’ effect. Most folks don’t specify a priori what biologically significant effect size is, but are ok with it as long as p < some arbibtrary standard.

And so what if folks are 'wishy-washy'? If they're being wishy washy because they can't frame a biologically relevant reason for including a covariate on a model, fair point, but if its because they're acknowledging the uncertainty that comes with doing observational studies, why is that bad? It seems like we might all benefit from being honest about the shortcomings of our studies.

As far as keeping competing hypothesis alive, maybe that's actually warranted in natural experiments or observational studies? In most cases, these aren't tests that can exclude alternative hypotheses, despite how they are presented so why pretend like they do? Doesn't that instill a false sense of confidence, which is probably more detrimental to science than failing to exclude other potential hypotheses.

I’m pretty sure Chamberlain never meant that a regression model with and without a quadratic term or with and without an interaction term or especially 5 regression models with more or less arbitrarily chosen subsets of variables when he talked about competing theories. And invoking Platt in the context of AIC is even worse – what would he say about your last paragraph?

Pingback: Poll results on statistical machismo | Dynamic Ecology

Pingback: In praise of courtesy p-values: perfectly correct p-values vs. pragmatically approximate p-values | Dynamic Ecology