# How many terms should you have in your model before it becomes statistical machismo?

Before the holidays, I ran a poll asking why people’s models have gotten bigger and more complex (i.e. more terms in regressions). First it is worth noting that essentially nobody disagreed with me that models have gotten more complex.So  I am taking it as a given that my original characterization that the typical model has increased from a 2-way ANOVA (maybe with an interaction term) to say 4-8 terms (several of which may be interaction terms or random factors) just in the last 15 years.

Like every topic I place under the statistical machismo header, there is no one answer. No right or wrong. Rather it is a question of trade-offs where I hope to make people pause and question conventional wisdom which seems to always and only lead to ever increasing complexity. Here I definitely hope to make people pause and think about why they are building models with 5-8 terms. (NB the following is a typically long-winded blog post for me, feel free to skip to the bold summary paragraph at the bottom).

In econometrics this issue is taught under the title “ommitted variable bias” (and it is frequently taught in econometrics and often in psychology). One can mathematically prove that if you leave a variable out which is correlated with the variables you include this will lead to a bias in your estimation of the slopes for the variables you did include. The trade-off is that including more variables in a regression leads to a loss of efficiency (bigger error bars around your slope estimates). This seems then to boil down to a classic bias vs variance trade-off. I’m personally not too sold on this view point for this particular problem. First the mathematical proof has a very unrealistic assumption that there is a single definitive set of variables which alone cause the dependent variable – but this is never the real world. Second, although it might introduce bias, there is no way to know whether it biases slopes positively or negatively which means in practice you don’t know how its biased which in a weird meta way goes back to being effectively unbiased. The whole omitted variable bias is pretty decisively shredded in Clarke 2005.

Most ecologists I think are instead coming pretty much from Hurlbert’s extremely influential paper on pseudoreplication (which got a lot of confirmation in the survey). Hurlbert introduced the idea of pseudoreplication as a problem and made two generations of ecologists live in fear of being accused of pseudoreplication. However, nobody seems to recall that adding more terms to a regression is NOT one of the solutions Hurlbert suggested! And I’m willing to bet it would not be his suggested solution even with modern regression tools so easily available. His primary argument is for better more thoughtful experimental design! There is no easy post hoc fix through statistics for bad experimental design and I sometimes think we statisticians are guilty of selling our wares by this alluring but flawed idea. Beyond careful experimental design, Hurlbert, basically points out there are two main issues with pseudo-replication: confoundment and bad degrees of freedom/p-values. Let me address each of these issues in the context of adding variables to complexify a regression to solve pseudo-replication.

1) Confoundment – Hurlbert raises the possibility that if you have only a few sites you can accidentally get some unmeasured factor that varies across your sites leading you to mistakenly think the factor you manipulated was causing things when in fact its the unmeasured factor that is confounded with your sites by chance. However, and this is a really important point – Hurlbert’s solution (and anybody who thinks for five minutes about experimental design) is to make sure your treatment is applied within sites, not just across sites, thereby breaking confoundment. Hurlbert also goes into much more detail about relative advantages of random vs. intentional interspersion of treatments and etc. But the key point is confoundment is fixed through experimental design. This is harder to deal with in observational data (one of the main reasons people extol experiments as the optimal mode of inference). But in the social sciences and medicine it is very common to deal with confoundment in observational data by measuring and building in known confounding factors. Thus nearly every study controls for factors like age, race, income, education, weight, etc by including them in the regression. For example propensity to smoke is not independent of age, gender or income which in turn are not independent of health, so decisive tests of the health effects of smoking need to “remove” these co-factors (by including them in the regression). Either Hulrbert’s experimental design or social science’s inclusion of co-factors make sense to me. But in ecology , we instead tend to throw in so-called nuisance factors like site (and plot within site) and year but this does NOT fix confoundment (and is more motivated by non-independence of errors discussed below). To me confoundment is NOT a reason for the kinds of more complex models we are seeing in ecology. If you are doing an experiment, then control confoundment in the experimental design. And if it is observational include more direct causal factors (the analogs of age and demographics) like temperature, soil moisture, vegetation height and etc instead of site and year nuisance factors if you are worried about the confoundment problem of pseudoreplication.

2) Bad degrees of freedom/p-values – Hurlbert’s second concern with pseudo-replication (which is totally unrelated to confoundment and is not fixable by experimental design) relates to  p-values. This is because non-independence of error terms violates assumptions and essentially leads us to think we have more degrees of freedom than we really have, which since we divide by degrees of freedom to get p-values leads us to think our p-values are lower than than they really are (i.e. p-values are wrong in the bad way – technically known as anti-conservative). This is a mathematically true statement so the debate comes in with how worried we should be about inflated p-values..If we decide are worried we can just stop using p-values (recall this was Hurlbert’s recommendation but very few remember that part of the paper!). Nor does Hurlbert imply there are larger problems than the p-value inflation (and the confoundment raised above). In fact Hurlbert says psuedoreplication without p-values can be a rational way forward.

The question of whether to report p-values or not interacts in an interesting way with one of the main results of the survey. Many people feel like having more complex models is justified because they are switching to model selection approaches (i.e. mostly AIC). This approach is advocated by two of the best and deservedly most popular ecology stats books (Crawley & Zuur et al). But I have to confess that I am uncomfortable with this approach for several reasons. First, the whole point of model selection initially (e.g. Burnham and Anderson’s book) was to move away from lame models like the null hypothesis and compete really strong models against each other as Platt (1964 Science) recommended in his ode to strong inference. Comparing a model with and without a 5th explanatory factor does not feel like comparing strongly competing models so it does not feel like strong inference to me. Second, model selection is a great fit for prediction because it finds the most predictive model with some penalty for complexity (recall in the world of normally distributed errors AIC is basically the SSE minus 2* the # of parameters and SSE is also the numerator in R2 making a precise mathematical link between AIC and R2). But model selection is a really bad fit for hypothesis testing and p-values (again as anybody who has read the Burnham and Anderson book will have seen but few follow this advice). Although I don’t go as far as Jeremy and Andrew Gelman (I think doing one or two very simple pre-planned comparisons such as with or without interaction term and then reporting a p-value is probably OK), I strongly believe that one should not do extensive model selection and then present it as a hypothesis test. While I agree with Oksanen’s great take-down of the pseudoreplication paper that argues p-values are only courtesy tools, I don’t think most people using model selection and then reporting p-values treat them that way. I’m fine – more than fine – with pure exploratory approaches, but I think a lot of people are noodling around with really complex models and lots of model selection and then reporting p-values like they’re valid hypothesis tests. Indeed, I have had reviewers insist I do this on papers. This strikes me as trying to have your cake and eat it too and I think is one of the reasons I am so uncomfortable with the increasingly complex models – because they are so highly intertwined with model selection approaches.

I do think it is important to note that whatever the motive, there are genuine costs to building really complex models. The biggest cost is the loss of interpretability. We know exactly what a model with one explanatory factor is saying. We have a pretty good idea what a model with two factors is saying. And I even have a really precise idea what a 2-way ANOVA with an interaction term is referring to (the interaction is the non-additivity). But I have yet to see ever a really convincing interpretation of a model with 5 factors (other than “these are things that are important in controlling the dependent variable” at which point you should be doing exploratory statistics). And interaction terms (often more than one these days!) are barely interpretable in the best circumstances like when the hypothesis is explicitly about interaction. And while mixed models with random effects are a great advance, I don’t see too many people interpreting random effects in any meaningful way (e.g. variance partitioning) but the most commonly used mixed model tool – lmer – pretty much guarantees you don’t know what your p-values are (for good reasons) and the most common workarounds are wrong and often anti-conservative to such a degree that the author’s of the package refuse to provide p-values (e.g. this comment and Bolker’s comments on Wald tests). Again – if you want to do exploratory statistics, go to town and include 20 variables. But if you’re trying to interpret things in a specific context of particular factor X has an important effect, you’re making your life harder with more variables.

Another big problem with throwing lots of terms in is collinearity – the more terms you have, the more likely you are getting some highly correlated explanatory variables. And when you have highly correlated explanatory variables, the “bouncing beta problem” means you are basically losing control of the regression (i.e. depending on arbitrary properties of your particular data, the computer algorithm can assign almost all of the explanatory power – i.e slope – to either one or the other correlated variable – or in other words – if you drop even one data point the answer can completely change).

So, in summary, adding variables is a very weak substitute for good up front experimental design. It might be justified when the added variables are known to be important and are used to control for confoundment with sampling problems in an observational context. But that’s about it. And the techniques often invoked to make complex models viable such as random effects and model comparison pretty much guarantee your p-values are invalid. I find it very ironic so many people go to great lengths including nuisance terms to avoid pseudoreplication (to ensure their p-values are valid) then guarantee their p-values are invalid by using random effects and model selection. And good luck interpreting your complex model especially when coefficients are being assigned to collinear variables arbitrarily! So to my mind complex regression models straddle the fence very uncomfortably between clean hypothesis testing contexts (X causes Y and I hypothesized it in advance) and pure exploratory approaches – this fence sitting complex model approach to my mind has the worst of both worlds, not the best of both worlds.

To put it in blunt terms, it would appear from popular answers in the survey that many people are complexifying their models in response to Hurlbert’s issues of pseudo-replication and Burnham & Anderson’s call for model comparison but seem to forget that both of them actually call for abandoning p-values to solve these problems. And that Hurlbert’s paper was really a call for better experimental design and Burnham & Anderson’s book was a call for a return to strong inference by competing strong models against each other not tweaks on regressions. So these were both calls for clear, rigorous thinking before starting the experiment, NOT for post hoc fixes by adding terms to regression models.

So, I have to at least ask, how much of this proclivity for ever more complex models is a result of peer pressure, fear of reviewers and statistical machismo? I was a little surprised to see that no small fraction of the poll respondents acknowledged these factors directly.So I urge you to think about why you are complexifying your model. Is it an observational study (or weakly controlled experimental study) where you need to control for known major factors? Should you really switch to an exploratory framework? Are you willing to give up p-values and the hypothesis testing framing? if not, say no to statistical machismo and keep your model simple!

This entry was posted in Issues by Brian McGill. Bookmark the permalink.

I am a macroecologist at the University of Maine. I study how human-caused global change (especially global warming and land cover change) affect communities, biodiversity and our global ecology.

## 33 thoughts on “How many terms should you have in your model before it becomes statistical machismo?”

1. Great post! I was talking with a colleague this week about how I feel like experimental design doesn’t get nearly the attention it deserves. It was completely drummed into my head as a grad student that the time to think about how you will analyze your data is BEFORE you do the study. Of course, things end up changing (perhaps due to a demonic intrusion!), but it really helps to make sure the experimental design is sound.

I don’t think it’s more of a problem now that people aren’t carefully thinking about experimental design prior to doing a study — this point was made so emphatically and often when I was a grad student precisely because it was a common problem. But I do think the availability of more complex analytical tools (and more computing power) makes it so that it’s easier for people to try to resort to fancy stats to “fix” experimental design problems.

When I was discussing this with my colleague, we were discussing how it’s really important to discuss study design and statistics at the same time. But I think there’s been more of a shift towards classes that are focused on a specific programming language (e.g., an “R” course, or a Python course). Perhaps that shift has led to less of an emphasis on study design?

Finally, thanks for emphasizing that Hurlbert argues for better experimental design! I was reading a paper yesterday that misinterpreted this aspect of Hurlbert. Hurlbert does NOT say that studies like the whole lake manipulations that have been done are a problem. (If I recall correctly, he says there’s a lot to be gained from them.) Instead, he simply cautions about improperly applying statistical tests to those sorts of designs.

• “But I do think the availability of more complex analytical tools (and more computing power) makes it so that it’s easier for people to try to resort to fancy stats to “fix” experimental design problems.”

That is very astute Meg.

• I agree with everything you said.

In the grad stats course I teach I spend a decent chunk of time on experimental design (and another on scientific inference) and the students love it – usually say its the most valuable part of the course. It means I cover less R functions but that is the right priority as we all agree. But teaching experimental design has gotten very rare. I’m curious how many people here had any formal coverage in a class of experimental design? (vs learning it by osmosis). Might be a separate post in there!

And yeah – everybody from field people desperate to fix their data and observational people desperate to be as rigorous as experiments and statisticians desperate to sell their wares like to imply post hoc fancy statistics can fix anything. But they can’t.

And your memory is correct Hurlbert fully acknowledges that good science can be done with pseudo-replication he just hates the reporting of p-values on them. (And demonic intrusion is one of the all time great phrase coinages in ecology). And Oksanen then says good science can be done with pseudo-replication AND go ahead and report p-values as a matter of courtesy to your reader. Whichever you agree with, nobody says you can’t do good science with pseudo-replication.

• What would you think about reporting results from the simplest possible model (Y~X) and the most complex possible model (Y~”full”+X), where full doesn’t include the kitchen sink, but random effects for location and year, and other measured covariates known to influence Y, with a priori consideration given to how many df you can afford to spend given your sample size, whether those covariates are highly correlated with X, etc.
Here’s a real example:
model predictor beta SE t
Y~X fire.freq 0.656 0.110 5.96
Y~full+X fire.freq 0.690 0.203 3.40

As an author, reviewer, and reader, I’m more confident that abundance of species Y is positively correlated with fire frequency after seeing the parameter estimate from both models (but I definitely don’t need to see every possible combination of covariates between simple and full).

2. Nice post!
“Indeed, I have had reviewers insist I do this on papers.”
Yes. Me too, and it’s annoying.
Regarding experimental design, one of the motivations to develop my ecological statistics course was that STAT 802 Experimental Design here at UNL was widely regarded as useless by our wildlife, fisheries & ecology students dealing with landscape scale data. Do you have any thoughts about experimental design at the landscape scale, e.g. where one of the factors is something like “% of forest within 1 km”?

• Hi Drew – interesting points and questions. You raise the important point that experimental design is not a monolithic topic and varies from subfield to subfield in ecology. Here at UMaine we have a great experimental design course but it is entirely focused on agricultural field treatments which is of limited applicability to ecology. In my ecology class I try to teach things relevant to the average field ecologist (which basically involves agreeing and disagreeing with Hurlbert’s paper – its a really packed paper).

You would think as a macroecologist I would have a well articulated plan on experimental design at larger scales.

I do briefly teach experimental design for regression (where the goal is to have independent “x” values that are evenly spread out and cover as wide a domain as the desired domain of inference).

Otherwise, in my experience the big a priori design question is to really think through in some detail the a priori exclusion criteria. Which points, species time periods should be excluded from the analysis (again obviously a priori). There are always some wierd outliers like invasive species, points on top of cities etc that will mess up your analysis (for reasons unrelated to the mechanisms you are trying to study) and doing this a priori is highly desirable. As to how exactly to do this knowing your data well and having experience are all I can suggest.

As far as the pseudo replication in large spatial contexts, I think there are two main approaches I have seen. One is to do downsampling. Put down a grid at a large scale where one thinks neighbor spatial autocorrelation ought to be trivial (e.g. 100 km x 100 km) and then randomly pick one point in each grid cell. This avoids a lot of the sampling biases (more points around cities for e.g.). The other is to use spatial regression. Raster analysis is one place where I think spatial regression really make sense, but it is also one places where it is really hard to do (most packages kind of die or take a month to run when you throw in 10,000 points). I am increasingly leaning to the downsampling approach. As a result I also have no few studies where I just ignore spatial autocorrelation too. As I’ve argued repeatedly it doesn’t bias estimates which is usually what I am interested in – it just messes up p-values and p-values are so useless in macroecology. They always have 10 zeros in front because of the sample sizes. If one is careful this can be sold to macroecological reviewers.

Good question!

• Phew. OK, those are the same thoughts I’ve had. This downsampling idea deserves some further thought; our students are not usually using continental scales, but the same idea would work with finer scale landscape analyses I think.
In my class I also bring up the “spread the points along the x-axis” design goal. Kathryn Cottingham’s review “Knowing when to draw the line: designing more informative ecological experiments” is my inspiration for that section.

3. Hi Brian. This is another great post that I’m largely in agreement. But I’m going to be a little provocative and focus on two of your comments on omitted-variable bias. 1. “One can mathematically prove that if you leave a variable out which is correlated with the variables you include this will lead to a bias in your estimation of the slopes for the variables you did include.” This is only true if your are attempting to interpret the coefficients causally. But regression is a method used for description or prediction and omitted variable bias is irrelevant to this. Regression is BLUE. In using regression for description/prediction we shouldn’t refer to missing variates as confounders but simply as unmeasured covariates. Adding these will improve prediction as long as the signal of the added co-variate is larger than the extra noise. (it will always improve prediction within data but will it add value in a second dataset, which is easily checked using something like cross-validation).

2. “the whole omitted variable bias is pretty decisively shredded in Clarke 2005.” Thanks for this link as I was unaware of the paper. What do you mean by this statement? I inferred that you meant that omitted variable bias is not a concern, something that I would and have (href=”http://onlinelibrary.wiley.com/doi/10.1111/evo.12406/abstract?deniedAccessCustomisedMessage=&userIsAuthenticated=false”>2014 Evolution here) argued (rather provocatively) against. But in reading the abstract of Clark, it seems that Clark is stating simply that chasing confounders may or may not decrease the bias. Again, this is only relevant if interpreting the coefficients as causal. So again, this is irrelevant to regression as description/prediction. The graphical modeling work of Judea Pearl is a better source for the wisdom of naively chasing confounders as he shows nicely using graphical models that adding some kinds of confounders can “unblock” edges and increase bias. Again, this is all part of causal modeling. In my Evolution paper I actually explored (slightly) the practice of naively adding confounders. Adding a random confounder has a slightly higher probability of decreasing bias. The point in my paper is its probably not worth the trouble to chase confounders and you are naive to think that this may “control” confounding.

But the point about chasing confounders I agree with and maybe this is what you meant by “shredded”.

This all raises the question of how are regression models used in Ecology and Evolution? Within epidemiology, hundreds of papers are published every week that use an observational design and a regression model to link some environmental/genetic factor to some disease (or phenotype more generally). For example, here is one that is popular this week: href=”http://content.onlinejacc.org/article.aspx?articleID=2108914″>excessive running is bad for you. The authors are usually careful to couch the language as if this is all about prediction (the X variables are called “predictors” for example) and there is always the statement “of course we have measured associations and further work needs to be done to infer causation”. But then they use causal language throughout including the word “confounder”. Confounding is only relevant in causal modeling.

I don’t read the ecology and evolution literature nearly as much as you or many readers of this blog but when I do, I would say this practice is generally the same, at least for publications of observational studies. Everyone uses “association” and “predictor” and “correlation doesn’t imply causation” but then the whole purpose or goal of the study is causal. Interventional. If this factor is manipulated by this much, this will be the expected result. Regression doesn’t do that for us unless the method is experimental.

The really big question that keeps me awake at night is, are we learning much from observational studies? I would argue that it is an illusion (or delusion) that we are. It is very sobering to view observational studies in E&E through the lens of epidemiology. A major difference between epidemiology and ecology/evolution is that many of the causal systems of interest in epidemiology are replicated ad nauseum, with little to show for it. Some systems show high diversity in results among studies. Others are more consistent. Are the consistent ones closer to the truth? There are numerous examples of experimental results that completely reverse the causal effects inferred from many, many years of consistent observational results. The effect size of smoking on risk of lung cancer is huge, yet it took many, many years of observational studies to be fairly confident that this effect is real and not spurious due to confounding.

• Jeff — “The really big question that keeps me awake at night is, are we learning much from observational studies?” Good question, but what if observation is all we have? I’m thinking here about policy related to wildlife conservation. We have resource selection functions (i.e. regression models based on observational data) that tell us bears like forest and avoid roads. We’ll never be able to do an experiment to confirm the causal effect of roads on bear behavior, but does that mean we should allow road construction in intact forest areas?

• Jeff – I pretty much agree with your points on omitted variable bias so don’t expect me to defend the concept or argue with you!

And I’m w/ Drew on observational studies. I even said in this post observation is weaker than experiment but what if that’s all you have. I’m not willing to just abandon scales and questions that aren’t amenable to experiment!

You are quite right that in fields like medicine it sometimes seems that every new analysis is independent of the prior ones on the same field.

I’m not convinced ecology is in the same boat. We tend to be a little more likely to say X effects Y because of some pretty clear mechanisms before we test them instead of just asking I wonder if coffee affects cancer. We also tend to use natural experiments often (not that these aren’t without problems too). But whatever the reason, I think a lot of observational studies converge on reasonably consistent results. EG productivity is the primary driver of diversity at continental scales in plants. Related factors like temperature can be important in ectotherms.. Mountainous topography is an important secondary factor. Body size is a consistently inversely related to but a weak predictor of abundance. Metabolic rate (and life span and etc) is primarily predicted by body size with variance explainable by life style and habitat and phylogeny which are all collinear.

• This is basically the argument of the folks at science-based medicine, who argue that SBM is not just evidence-based medicine (roughly equivalent to randomized controlled trials) but also includes well controlled observational designs + underlying physiological model that predicts the observational results. It is really this that keeps me up at night. Specifically, our brains are explanation machines and a consequence of this is this scenario of science: we observe a general pattern and then create a very very plausible mechanistic model. Then we look for correlations to “test” the model, but is this really a test, since the model itself was unconsciously (or consciously) developed to explain the correlation? Our brains will (quite naturally) focus on mechanisms that support our inchoate model and conveniently ignore mechanisms that do not support our inchoate model. Then we start the process of developing a more rigorous model but the path we take is already biased by a set of previous observations/explanations.

I’ll stop there instead of trying to flesh this idea out more but what I wonder is, how much of the mechanistic models that we develop are already biased to work in favor of an observation? Or, what kinds of mechanistic explanations are more susceptible to this type of bias? And which are more robust?

• I acknowledge what you describe can happen. But you didn’t address my point that regression-based fields in ecology (e.g. species richness, allometry) have a history of coming up with consistent, reasonably repeatable results across taxa, continents, etc.

• Hi Brian, the comparison with science-based medicine was addressing this point. I agree, given the consistent results and the many kinds of evidence that might be available (lab experiments, mesocosm experiments, computational experiments, theoretical models, in addition to the observational studies) it would seem absurd to state that we haven’t learned something about causal effects in these large-scale systems or that the regression models are not modeling real causal effects. More on this in a minute.

I’m going to throw this out, again, partly to be provocative. Do you think ecology is ahead of medicine simply because you are an ecologist and biased toward this position (confirmed by selective recall of the rigorous work in ecology that you are very familiar with in combination with the daily absurd stories that come out of epidemiology, like coffee and cancer?)

I’m actually in awe of the basic science and much of the statistical modeling in biomedical research. And its history provides many useful comparisons for ecologists. For example, is the productivity as the driver of diversity really rock solid (I confess to not being able to answer this at all since it is far outside of what I read)? Take a similar rock solid theory in medicine – the “lipid hypothesis” of cardiovascular disease – that dietary lipids are a driver of cardiovascular disease (I would call this a theory that generates many testable hypotheses but thats another conversation). There is abundant observational data in its favor. And abundant experimental data on the biochemistry and physiology of lipids and the development of atherosclerosis. Hundreds of thousands of research hours from many, many biochemistry and physiology labs. All doing excellent basic science, which has been used to develop a detailed model of the genesis of cardiovascular disease. It all makes perfect sense and it would be absurd to reject it based on the combination of the basic science and the observational data. Except that over the last 10 years, many RCTs are not finding good experimental evidence supporting the theory at the big macro level – that dietary lipids increase the risk of cardiovascular disease.

• I advocate exactly what you’ve written above: “more thoughtful experimental design! There is no easy post hoc fix through statistics for bad experimental design and I sometimes think we statisticians are guilty of selling our wares by this alluring but flawed idea.”

I’m also comfortable with the idea that there are lots of things in biology that we simply will not know (ever!), or at the very best, will know only very incompletely.

4. I can’t resist bringing up Bayesian statistics here (statistical machismo definitely not being my motivator, in case you are thinking that). I am curious what you think of Bayesian model averaging in more exploratory data analyses where there are several explanatory variables that cannot be tossed out as uninformative a priori (so in the modelling set-up the prior would be given as uninformative). In my limited experience, the end result is far more parsimonious than a step-wise model selection using AIC to evaluate competing models, and the out-put is much more informative. Having a range of error on a parameter estimate (which is based on the uncertainty in the fit and how the inclusion of different terms altered the parameter estimate) allows each explanatory variable to be considered on its own. Bayesian statistics are becoming more preferred in the health sciences over frequentest statistics, and I am wondering if that is the way ecology will move too. Are any of you teaching/using Bayesian methods?

• Re: model averaging, Shane Richards has concluded that it doesn’t work that well:

• Can you give the name of the second article, the link you gave does not work for me.

• Huh, not sure what’s wrong (EDIT: aha, doubled the http bit! fixed now). Here’s the doi: 10.1007/s00265-010-1035-8. It’s Richards et al. 2011 Behav Ecol Sociobiol

• I’m agnostic on model averaging. I’ve never used it myself but I can see scenarios where it works. It seems to be pretty successful in climate modelling. But there and as I said in the post, model averaging of independent mechanistic models makes more sense to me than averaging across regressions though. As to which tool to use (Bayesian or simpler like AIC weights) I guess it depends – I’m biased to simpler though.

But none of these are useful in a hypothesis testing framework! Model selection mostly (only?) applies to prediction.

• I agree that it is useful in prediction, and as far as being able to extend a predictive model outside of the data used to train it, the choices made in constructing the model(s) are important. As far as hypothesis testing between different models, I see your point. In my case, this is not my goal, as my hypotheses are about the nature of the predictions (types of errors, etc) made by different sets of models across several species.

As an aside, there are differences between Bayesian hypothesis testing and the more familiar p value size test of significance, etc.

• “As an aside, there are differences between Bayesian hypothesis testing and the more familiar p value size test of significance, etc.”

Of course!

5. I really enjoyed this post. This was timely for me as I’ve recently had quite a few students who come to me knowing that they “want to do model selection” (AIC). But beyond that, their goals are fuzzy. I just had a great conversation with a student last week and we walked through her plans as I asked her, did she intend to compare a few a priori hypotheses or was this analysis exploratory? It turns out that model selection with a few a priori models is a great idea for her experimental data and will allow her to ask higher order questions and compare interesting hypotheses rather than test a boring null. But her first instinct was a huge unguided complex model comparison despite the fact that she had quite a bit of biological mechanism by which to guide her hypotheses.

I also agree with Megan that experimental design courses are important. As someone who teaches a programming course for grad students I want to defend that practice and I see it as very separate from the important experimental design training. I make sure that students understand that my course is a programming course, not a stats course. That said, I do run into issues when students do not have a good background in theory. I am reminded of the danger zone in this venn diagram: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Finally as a fire ecologist I have to agree: if we through out all psuedoreplicated fire studies we would have very little left. But I think of pseureplicaiton as an logic and inference issue, not a statistical problem (other than accounting for levels of variance in nested data). Sometimes the scope and gernalizablility of inference arises from a logical argument based on information not in the experiment itself.

• ” But her first instinct was a huge unguided complex model comparison despite the fact that she had quite a bit of biological mechanism by which to guide her hypotheses.”

I think your student was sitting in my office last week 🙂 You pretty much nailed my recent experiences and why I wrote this post.

6. Excellent topic… and one I have debated since the 80s. Experimental design is obviously crucial. Mess up here & you have your work cut-out for you. I place great emphasis on two areas: hypothesis development & pilot studies. It is crucial to remember that pseudoreplication is dictated by your hypotheses… and could actually become replication by asking a question in a more nuanced way. It is also so very important to test your protocols in advance with pilot work, so I often refer folks to the publications of Elroy Rice & colleagues to get a real appreciation for that. And I always fall back on simple regression as my “bottleneck” step in any project. That is, if your data obey all four assumptions of simple regression, then the sky is the limit, because then they will obey the assumptions of almost any other test. Obviously my approach requires a lot of labor and effort in advance of actually conducting the work… but is so well worth it.

7. Hi Brian – Again, I agree completely with the point of the article (too many parameters) but I want to address a couple of what I perceive of as misconceptions on omitted variable bias that I think leads researchers down the illusory path of false discovery.
1. “The trade-off is that including more variables in a regression leads to a loss of efficiency (bigger error bars around your slope estimates).” This is true only if one is doing a regression for prediction/description. If one has any casual interpretation than the total error in your estimate includes that due to sampling AND that due to the bias. And this error can be very large (depending on the number of missing confounders and their correlations and effects) and effectively doesn’t decrease with sample size. So the standard errors, confidence intervals, p-values of slopes spit back by stats packages are meaningless
2. “there is no way to know whether it biases slopes positively or negatively which means in practice you don’t know how its biased which in a weird meta way goes back to being effectively unbiased”. True, but again, this bias adds substantial error above and beyond sampling error to the estimates. But when you run a regression in any stats package, it assumes no missing confounders and so the error it reports back assumes only sampling contributes to error. Due a regression on N=10^4 and your statistics package will report back really really small error but your actual error in the estimates is still very large because of the confounding. Again, only relevant for causal modeling.
3. “the mathematical proof has a very unrealistic assumption that there is a single definitive set of variables which alone cause the dependent variable – but this is never the real world”. True is some situations. But this doesn’t make the problem go away. It actually makes it worse. Because this is, yet again, another source of error not incorporated into standard errors/confidence intervals/p-values.

What I find interesting is that these issues are not ignored in many fields that use regression, or at least the issues are covered thoroughly in the textbooks and there are a number of statisticians working out the problems. Within ecology and evolution it is absent from the textbooks and almost no one is working on the problem. I think we are wishing it away because of misconceptions about what it is. Here is a quote from Richard Lewontin, who I’d love to hate but he is a very good critic of science “The observers then pretend to an exactness that they cannot achieve and they attempt to objectify a part of nature that is completely accessible only with the air of subjective tools”

8. Re: Model averaging: Some of the utility or lack thereof depends on what you are model averaging (true both for Bayesian and AIC model averaging). Model averaging predicted responses (e.g., estimated means on lhs of regression) based on entire model and some data can be useful. Model averaging individual regression coefficients (discrete pieces of models) as commonly done in ecology and biology is an unmitigated disaster. Anytime there is any amount of collinearity among predictor variables, the units in the denominator of the regression coefficients (remember they are a change in y divided by a change in X) change across models with different combinations of predictors such that averaging them is nonsensical. This approach of model averaging regression coefficients has been destructive to useful information in ecology while not even reasonably characterizing model uncertainty as proclaimed by its proponents. So trying to get away with more complex models with more parameters and predictors by using model averaging is not serving anyone well in my opinion.

• Excellent response! And, I would add that it is equally dangerous for folks to run a series of simple ANOVAS as opposed to a multi-factorial ANOVA.

9. A new relevant paper from econometrics points out that, if a variable Y truly is a function of variables A, B, and C, regressing Y on A alone ordinarily will give you a biased estimate of the Y-A relationship. You might think that, if you added in B as a second predictor, that you’d have less bias in the estimated effect of A. You’d be wrong, unless the the addition of B is “balanced”.

http://marcfbellemare.com/wordpress/11037

Will give this a shoutout in the next linkfest.