Model selection bias is a technical term for confounding exploratory and hypothesis-testing statistical analyses. Or even less technically, circular reasoning.* If you use the data to select the best-fitting model from some set of candidates, you can’t then use the same data to test hypotheses about the value of the estimated parameters of the best-fitting model. Doing so inflates your Type I error rate. Equivalently, model selection causes the confidence intervals for your fitted parameter values to be too narrow.
This is true no matter how you do the model selection, even if you do it very informally. For instance, if you’re regressing y on x, and you eyeball the data and decide that it looks like a quadratic curve might fit well, you can’t then do a valid hypothesis test of whether the true quadratic coefficient is significantly different from zero. Of course it’s going to be “significantly” different from zero (at least, it probably will be). If that wasn’t the case, you wouldn’t have decided to fit a quadratic model in the first place!
The standard way to deal with this problem is to do model selection and hypothesis testing on separate datasets, either by collecting new data or by holding back some of the original dataset for purposes of cross-validation. But in comments on an old post, I had a different idea:
Re: model selection bias, is it legit to take a randomization-based approach to correct for this? For instance, do a backwards elimination multiple regression (or whatever) on your original data. Then randomize the data 1000 times, each time running the same backwards elimination algorithm, to get distributions of the expected outcome of the entire model selection algorithm under the null hypothesis that there’s nothing but noise in the data. In particular, you’ll get a distribution of expected P values for the final selected model under the null hypothesis, to which you can compare the observed P value. You’ll get an estimate of the fraction of the time you’d expect any given predictor variable to be included in the final selected model under the null hypothesis. Etc.
This seemed legit to me, but I wasn’t sure if there was some technical problem I hadn’t thought of. Nor did I know if anyone else had ever had the same idea. Nobody on the thread really knew either.
Well, now we know: it’s legit. In fact, it’s so legit that the guy who invented bootstrapping, Brad Efron, just published a paper on it! (ht Andrew Gelman) His approach actually is somewhat different than what I suggested–he suggests bootstrapping the entire model selection process to get valid confidence intervals, rather than using repeated randomization to conduct valid null hypothesis tests. And he also shows how to use bootstrapping as a form of smoothing or model averaging. But the core idea is the same (as far as I can tell; I’m not a statistician so I can only get the gist of Efron’s paper).
UPDATE: the above paragraph is a poor summary of the paper; my bad. The paper isn’t the first to propose bootstrapping the model selection process, it’s proposing a new way to get standard errors for those bootstrapped estimates without increasing the computational burden. Thanks to a commenter for clarifying.
Don’t get me wrong, I definitely cannot claim any credit for this idea! Indeed, Efron’s paper talks about other related ideas from the statistical literature, so it’s not like I’m the first person ever to think along these lines. I’m just foolishly pleased with myself for managing to think of a crude but recognizable version of a good idea that actual statisticians have had.** 🙂 Plus, I thought the paper would be of sufficiently wide interest to be worth sharing in a post.
Note that I have no idea about the advantages and drawbacks of this idea relative to cross-validation or other ways to deal with model selection bias.
NeRd race to write an R package implementing Efron’s approch! On your mark, get set…go!
**Why yes, I do get pleased with myself quite easily. Why do you ask? 🙂 And if it turns out I’ve misunderstood Efron’s paper and his idea really is totally different than that half-baked one I had, don’t tell me. 🙂