Model selection bias is a technical term for confounding exploratory and hypothesis-testing statistical analyses. Or even less technically, circular reasoning.* If you use the data to select the best-fitting model from some set of candidates, you can’t then use the same data to test hypotheses about the value of the estimated parameters of the best-fitting model. Doing so inflates your Type I error rate. Equivalently, model selection causes the confidence intervals for your fitted parameter values to be too narrow.
This is true no matter how you do the model selection, even if you do it very informally. For instance, if you’re regressing y on x, and you eyeball the data and decide that it looks like a quadratic curve might fit well, you can’t then do a valid hypothesis test of whether the true quadratic coefficient is significantly different from zero. Of course it’s going to be “significantly” different from zero (at least, it probably will be). If that wasn’t the case, you wouldn’t have decided to fit a quadratic model in the first place!
The standard way to deal with this problem is to do model selection and hypothesis testing on separate datasets, either by collecting new data or by holding back some of the original dataset for purposes of cross-validation. But in comments on an old post, I had a different idea:
Re: model selection bias, is it legit to take a randomization-based approach to correct for this? For instance, do a backwards elimination multiple regression (or whatever) on your original data. Then randomize the data 1000 times, each time running the same backwards elimination algorithm, to get distributions of the expected outcome of the entire model selection algorithm under the null hypothesis that there’s nothing but noise in the data. In particular, you’ll get a distribution of expected P values for the final selected model under the null hypothesis, to which you can compare the observed P value. You’ll get an estimate of the fraction of the time you’d expect any given predictor variable to be included in the final selected model under the null hypothesis. Etc.
This seemed legit to me, but I wasn’t sure if there was some technical problem I hadn’t thought of. Nor did I know if anyone else had ever had the same idea. Nobody on the thread really knew either.
Well, now we know: it’s legit. In fact, it’s so legit that the guy who invented bootstrapping, Brad Efron, just published a paper on it! (ht Andrew Gelman) His approach actually is somewhat different than what I suggested–he suggests bootstrapping the entire model selection process to get valid confidence intervals, rather than using repeated randomization to conduct valid null hypothesis tests. And he also shows how to use bootstrapping as a form of smoothing or model averaging. But the core idea is the same (as far as I can tell; I’m not a statistician so I can only get the gist of Efron’s paper).
UPDATE: the above paragraph is a poor summary of the paper; my bad. The paper isn’t the first to propose bootstrapping the model selection process, it’s proposing a new way to get standard errors for those bootstrapped estimates without increasing the computational burden. Thanks to a commenter for clarifying.
Don’t get me wrong, I definitely cannot claim any credit for this idea! Indeed, Efron’s paper talks about other related ideas from the statistical literature, so it’s not like I’m the first person ever to think along these lines. I’m just foolishly pleased with myself for managing to think of a crude but recognizable version of a good idea that actual statisticians have had.** 🙂 Plus, I thought the paper would be of sufficiently wide interest to be worth sharing in a post.
Note that I have no idea about the advantages and drawbacks of this idea relative to cross-validation or other ways to deal with model selection bias.
NeRd race to write an R package implementing Efron’s approch! On your mark, get set…go!
*Brian calls this Freedman’s Paradox. It’s also known as the Texas sharpshooter fallacy.
**Why yes, I do get pleased with myself quite easily. Why do you ask? 🙂 And if it turns out I’ve misunderstood Efron’s paper and his idea really is totally different than that half-baked one I had, don’t tell me. 🙂
Hi Jeremy,
Thanks for pointing out this paper which is in fact in press in JASA. The paper itself is not actually proposing the bootstrap + model selection framework (which is referred to as ‘bagging’ or ‘bootstrap smoothing’ in the abstract itself, having 2 names coined up for something new would be kind of silly), but a way to provide standard errors for those estimators without increasing the computational burden (the paper mentions the brute force approach which would be to do bootstrap on each individual bootstrap sample).
Thanks very much for the clarifications Peter, much appreciated. I’ll update the post.
Thanks for this and other posts. Bootstrapping (BS) is an alternative to cross-validation (CV), but one should be alert the you should have a clearly defined method for model selection to apply it. Precisely exploratory data analysis usually uses “ad hoc” methods (such as “this time series has a strong bent in it, so I’ll use a method appropriate for such behaviour”), and sometimes people think they are considering the model selection step by including, following the example, a group of alternative kernells (all of which adapt to strong bents) but no one which is very smooth or allows jumps, etc. Which they might have used if at first sight data looked different. This just means that all steps of data-dependent model selection should be replicated in the bootstrap.
CV is not magical and cannot go around this problem either but sometimes is easier to (truly) just don’t look to some points during model selection (or better, don’t give them to your analyst) and only use them later for validation.
Another problem shared by BS and CV can be that they assume independence between observations (which sometimes is reasonable indeed). With time-series many economists use block-CV (which uses “windows” of data), but I am yet to see an applied work on ecology or genetics to use the appropriate methods considering spatial or phyllogenetic dependence in their CVs and not allowing they complex models to overfit (ex: Climate envelope modeling and Genomic selection)
Sorry for extending so much, overall they are good methods and I hope they would be used more, but while more people get to know them I’d like they use them well.
“…one should be alert the you should have a clearly defined method for model selection to apply it.”
Yep! And you’re quite right that this is an argument for just holding back some of your data during model selection.
“With time-series many economists use block-CV (which uses “windows” of data), but I am yet to see an applied work on ecology or genetics to use the appropriate methods considering spatial or phyllogenetic dependence in their CVs”
Yep. Ecologists rarely do moving blocks boostrapping or CV (though I once looked into it for a side project in grad school; I think ecological statistician Subhash Lele has an old paper on the moving blocks bootstrap). Brian and I have talked about this issue in the past in the context of spatial autocorrelation, and the use of moving-blocks boostrapping or CV to correct it. It’s a pet peeve of Brian’s because many statistical methods for predicting species distributions turn out to implicitly rely on the strong autocorrelation in the data (I’m guessing that’s what you were noting as well in your remark about climate envelope models often being overfit?) Unfortunately, in practice it’s of course difficult to do moving blocks bootstrapping or similar on spatially-autocorrelated ecological data because your blocks have to be 100s of km wide, which is often most of the range of the study area.
I agree that it would be nice if these methods were more widely known and used, though as you say just mindlessly relying on them as a panacea for model selection bias would be going too far.
Model selection bias can be an issue even in very simple, seemignly innocent situations like checking the assumption of proportional hazards in a Cox model. In this paper, we suggest bootstrap/permutation resampling may be a solution to consider:
https://onlinelibrary.wiley.com/doi/full/10.1002/sim.6021