There has been a lot of discussion of researcher degrees of freedom lately (e.g. Jeremy here or Andrew Gelman here – PS by my read Gelman got the specific example wrong because I think the authors really did have a genuine a priori hypothesis but the general point remains true and the specific example is revealing of how hard this is to sort out in the current research context).
I would argue that this problem comes about because people fail to be clear about their goals in using statistics (mostly the researchers, this is not a critique of Jeremy or Andrew’s posts). When I teach a 2nd semester graduate stats class, I teach that there are three distinct goals for which one might use statistics:
- Hypothesis testing
These three goals are all pretty much mutually exclusive (although there is some overlap between prediction and exploration). Hypothesis testing is of course the most common scenario, and I’ve already pontificated extensively on how ecology needs more prediction (see these posts: I, II, III, IV for more on this topic). Here I want to focus on hypothesis testing vs exploration.
Here is the key point – the goal should be determined a priori before starting to analyse the data. You can actually use the same technique (example OLS linear regression) to do hypothesis testing, prediction or exploration. But you CANNOT use it to do all three goals on one dataset! It is a statistical and even ethical no-no to sit down, do some data mining (also called data dredging) until you find a statistical relationship, then test it for statistical significance.
I think of this as Freedman’s paradox (Freedman 1983). Freedman wrote a paper showing through some very heavy analytical methods and also through some fairly intuitive simulations (I force my students to read this paper and they all bog down on the analytical section but get and appreciate the simulations), that if you do variable selection first and then significance testing on the selected variables your p-values will be wildly overstated (and yes for prediction folks, even the r2 will be high). In fact, he explicitly shows that if you take random variables that have no correlation with your dependent variable, if you do variable selection on enough variables you will ALWAYS get a p<0.05 (and high r2). This is a fairly intuitive extension of the fact that if you randomly generate two independent variables and do a regression you will get a significant result 5% of the time (on any test that has Type I and Type II errors calibrated correctly such as ANOVA/regression). Well if you have dozens of variables, you will have hundreds of possible combinations and one of them will come home as significant by chance.
This is why people say (correctly) that you should never report a p-value if you’ve already done a variable selection process (whether it is stepwise OLS or something fancier and more machine-learning like such as regression trees or MARS). And this is the big critique of “researcher degrees of freedom” – that researchers, usually unintentionally and subconsciously, make decisions (including variable selection but also including data selection, etc) that have this effect of violating Freedman’s Paradox and producing meaningless but seemingly statistically significant results.
What to do about this? Well there are two choices really. One is to go towards option #1 – hypothesis testing. Anytime you are publishing a p-value you are claiming to test a hypothesis. But the point of this post, researcher degrees of freedom, and Freedman’s Paradox is that you only get to test one hypothesis. Not test 37, then report 1. Indeed if you test 37 and only report 1, that is an ethical lapse. If you test 37, and report all 37, then you are at least being honest, but your reviewers are then likely to mutter about tests for multiple comparisons and toss you out, as they should. The temptation is to let this pull you back into testing 37 but only reporting 1 but DO NOT DO IT! You only get to test 1 hypothesis EVER for a question for a set of data. This means if you’re going to spend 4 seasons in the field leading to one test, it probably ought to be a really good hypothesis. In principal, the statistics is fine if you conceive your hypothesis after your field seasons but before you run any statistics (assuming the data is complex enough that your brain is not calculating statistics while you’re in the field). But in practice (in part because of the danger of calculating statistics in your head while you collect data and in part because you only get one test), it ought to be an A PRIORI test conceived of before you start collecting data. And to make this rigorously honest, you probably ought to communicate and write down that hypothesis to others before you start collecting data. That way if anybody ever questions your integrity on this, you can prove you have done it correctly. Which would have nicely resolved the debate between Gelman and the researchers he criticized (they were able to point to prior examples of the hypothesis in the literature which in my mind goes a long way to validating their claim to hypothesis testing, but wouldn’t their answer have been that much stronger if they could say we wrote the hypothesis in a grant and stated this was the only hypothesis we would test)
Sadly the genuine a priori hypothesis is rare, as we all know and as nicely written up here by Smallpond. Many researchers have been led to believe that option #1, hypothesis testing of an a priori hypothesis is the only acceptable form of science. So when they don’t do it (for good reasons or bad), they fake it. And it is not always the author’s fault. I am a co-author on a paper where we had no clue which of 20 variables would be most important since we were looking at something nobody had looked at before, so we used exploratory techniques and never reported a p-value and never claimed to test a hypothesis but did explain why we found some really cool results that should be tested further, but one reviewer said it needed to be reframed in a more hypothetico-deductive framework (this despite their knowing we did not start out with a priori hypotheses).
My proposed solution is we need to better educate people (and reveiwers) about option #3 – exploratory statistics (and option #2 prediction but I’ve already written those posts). Exploratory statistics is a perfectly valid approach to science. Indeed in fields at the frontier where we know little, it may be the only valid approach. Yet we are so obsessed with hypothesis testing and #1 that even when we do #3 (and are careful not to report p-values) we are still expected to dress it up as #1.
If exploratory statistics weren’t treated like the crazy uncle nobody wants to talk about and everybody is embarrassed to admit being related to, science would be much better off. People could start being up front about using exploratory statistics. Then research projects could be designed in one way or the other from the beginning (I dare you to try and write an NSF grant that is up front about using exploratory statistics these days). And we could start having real conversations about whether hypothesis testing or exploratory statistics was more appropriate in the context of a particular question and paper. And we would stop seeing people doing exploratory statistics and trying to pretend like they were doing hypothesis testing and getting in trouble with their statistics (e.g. reporting p-values on variable selection methods).
What would bring exploratory statistics out in the open look like? Well mostly it would involve being honest how we thought about and tackled a question. It would involve candid conversations about whether we had an a priori hypothesis or if we had no clue when we started. It would also involve different types of reporting (no p-values). It would probably involve breaking our obsession with p-values a bit (google “p-values are evil” if you want some good blog entry points on why this is a good thing). It would also open up the stats toolbox to include more use of some things like regression trees and spline regression and principle component analysis (all of which to my knowledge you still, thank goodness, cannot get a p-value out of in R). So we could stop having to answer questions like this (which was properly answered). But it would also continue using lots of techniques that also work in hypothesis testing like linear regression (although it would probably tip the balance to something like AIC than p-values) – in these cases the difference would be the conversation and the reporting of results, not the analysis.
So, go ahead, make an open and public confession. I use exploratory statistics and I’m proud of it! And if I claim something was a hypothesis it really was an a priori hypothesis. You can trust me because I am out and proud about using exploratory statistics.