There has been a lot of discussion of researcher degrees of freedom lately (e.g. Jeremy here or Andrew Gelman here – PS by my read Gelman got the specific example wrong because I think the authors really did have a genuine *a priori* hypothesis but the general point remains true and the specific example is revealing of how hard this is to sort out in the current research context).

I would argue that this problem comes about because people fail to be clear about their goals in using statistics (mostly the researchers, this is not a critique of Jeremy or Andrew’s posts). When I teach a 2nd semester graduate stats class, I teach that there are three distinct goals for which one might use statistics:

- Hypothesis testing
- Prediction
- Exploration

These three goals are all pretty much mutually exclusive (although there is some overlap between prediction and exploration). Hypothesis testing is of course the most common scenario, and I’ve already pontificated extensively on how ecology needs more prediction (see these posts: I, II, III, IV for more on this topic). Here I want to focus on hypothesis testing vs exploration.

Here is the key point – the goal should be determined *a priori* before starting to analyse the data. You can actually use the same technique (example OLS linear regression) to do hypothesis testing, prediction or exploration. But you CANNOT use it to do all three goals on one dataset! It is a statistical and even ethical no-no to sit down, do some data mining (also called data dredging) until you find a statistical relationship, then test it for statistical significance.

I think of this as Freedman’s paradox (Freedman 1983). Freedman wrote a paper showing through some very heavy analytical methods and also through some fairly intuitive simulations (I force my students to read this paper and they all bog down on the analytical section but get and appreciate the simulations), that if you do variable selection first and then significance testing on the selected variables your p-values will be wildly overstated (and yes for prediction folks, even the r2 will be high). In fact, he explicitly shows that if you take random variables that have no correlation with your dependent variable, if you do variable selection on enough variables you will ALWAYS get a p<0.05 (and high r2). This is a fairly intuitive extension of the fact that if you randomly generate two independent variables and do a regression you will get a significant result 5% of the time (on any test that has Type I and Type II errors calibrated correctly such as ANOVA/regression). Well if you have dozens of variables, you will have hundreds of possible combinations and one of them will come home as significant by chance.

This is why people say (correctly) that you should never report a p-value if you’ve already done a variable selection process (whether it is stepwise OLS or something fancier and more machine-learning like such as regression trees or MARS). And this is the big critique of “researcher degrees of freedom” – that researchers, usually unintentionally and subconsciously, make decisions (including variable selection but also including data selection, etc) that have this effect of violating Freedman’s Paradox and producing meaningless but seemingly statistically significant results.

What to do about this? Well there are two choices really. One is to go towards option #1 – hypothesis testing. Anytime you are publishing a p-value you are claiming to test a hypothesis. But the point of this post, researcher degrees of freedom, and Freedman’s Paradox is that you only get to test one hypothesis. Not test 37, then report 1. Indeed if you test 37 and only report 1, that is an ethical lapse. If you test 37, and report all 37, then you are at least being honest, but your reviewers are then likely to mutter about tests for multiple comparisons and toss you out, as they should. The temptation is to let this pull you back into testing 37 but only reporting 1 but DO NOT DO IT! You only get to test 1 hypothesis EVER for a question for a set of data. This means if you’re going to spend 4 seasons in the field leading to one test, it probably ought to be a really good hypothesis. In principal, the statistics is fine if you conceive your hypothesis after your field seasons but before you run any statistics (assuming the data is complex enough that your brain is not calculating statistics while you’re in the field). But in practice (in part because of the danger of calculating statistics in your head while you collect data and in part because you only get one test), it ought to be an A PRIORI test conceived of before you start collecting data. And to make this rigorously honest, you probably ought to communicate and write down that hypothesis to others before you start collecting data. That way if anybody ever questions your integrity on this, you can prove you have done it correctly. Which would have nicely resolved the debate between Gelman and the researchers he criticized (they were able to point to prior examples of the hypothesis in the literature which in my mind goes a long way to validating their claim to hypothesis testing, but wouldn’t their answer have been that much stronger if they could say we wrote the hypothesis in a grant and stated this was the only hypothesis we would test)

Sadly the genuine *a priori* hypothesis is rare, as we all know and as nicely written up here by Smallpond. Many researchers have been led to believe that option #1, hypothesis testing of an a priori hypothesis is the only acceptable form of science. So when they don’t do it (for good reasons or bad), they fake it. And it is not always the author’s fault. I am a co-author on a paper where we had no clue which of 20 variables would be most important since we were looking at something nobody had looked at before, so we used exploratory techniques and never reported a p-value and never claimed to test a hypothesis but did explain why we found some really cool results that should be tested further, but one reviewer said it needed to be reframed in a more hypothetico-deductive framework (this despite their knowing we did not start out with a priori hypotheses).

My proposed solution is we need to better educate people (and reveiwers) about option #3 – exploratory statistics (and option #2 prediction but I’ve already written those posts). Exploratory statistics is a perfectly valid approach to science. Indeed in fields at the frontier where we know little, it may be the only valid approach. Yet we are so obsessed with hypothesis testing and #1 that even when we do #3 (and are careful not to report p-values) we are still expected to dress it up as #1.

If exploratory statistics weren’t treated like the crazy uncle nobody wants to talk about and everybody is embarrassed to admit being related to, science would be much better off. People could start being up front about using exploratory statistics. Then research projects could be designed in one way or the other from the beginning (I dare you to try and write an NSF grant that is up front about using exploratory statistics these days). And we could start having real conversations about whether hypothesis testing or exploratory statistics was more appropriate in the context of a particular question and paper. And we would stop seeing people doing exploratory statistics and trying to pretend like they were doing hypothesis testing and getting in trouble with their statistics (e.g. reporting p-values on variable selection methods).

What would bring exploratory statistics out in the open look like? Well mostly it would involve being honest how we thought about and tackled a question. It would involve candid conversations about whether we had an a priori hypothesis or if we had no clue when we started. It would also involve different types of reporting (no p-values). It would probably involve breaking our obsession with p-values a bit (google “p-values are evil” if you want some good blog entry points on why this is a good thing). It would also open up the stats toolbox to include more use of some things like regression trees and spline regression and principle component analysis (all of which to my knowledge you still, thank goodness, cannot get a p-value out of in R). So we could stop having to answer questions like this (which was properly answered). But it would also continue using lots of techniques that also work in hypothesis testing like linear regression (although it would probably tip the balance to something like AIC than p-values) – in these cases the difference would be the conversation and the reporting of results, not the analysis.

So, go ahead, make an open and public confession. I use exploratory statistics and I’m proud of it! And if I claim something was a hypothesis it really was an *a priori* hypothesis. You can trust me because I am out and proud about using exploratory statistics.

Reblogged this on Ambika Kamath.

Pingback: Qual teste estatístico devo usar? | Sobrevivendo na Ciência

Nice post, admirably clear as always. Hopefully will help drum into people that “exploratory statistics” means more than just “trying out random analyses with no hypothesis in mind”. I think that’s what hard about getting people to see the issue with researcher degrees of freedom–they think the issue is much narrower than it actually is. (Which is why I do have one small quibble re: Gelman’s specific example. Yes, the researchers did have an a priori scientific hypothesis–but they didn’t have a priori, data-independent plans on how to analyze the data. Put another way, their a priori scientific hypothesis wasn’t sufficiently specific to fully specify the statistical analysis. That’s my understanding, anyway.)

I guess my question is, in practice how do you encourage people to be honest about the exploratory nature of many of their analyses? Because even if you don’t report a p-value, it’s certainly possible for you to talk about an exploratory result as if it’s a real discovery, rather than a possibility that needs to be tested by future work. I suppose one way is for journals to develop a new category of paper, called “exploratory papers”. Or maybe for someone to start a new journal called “Exploratory ecology”. Of course, then the question is if people would value such papers enough to write them, review them, care about them even a little when making hiring and tenure decisions, etc. I think they might value them if such papers developed really interesting new hypotheses for further testing, but probably not otherwise.

Teaching intro biostats for the first time, I’m still thinking about how best to convey this idea to students in an intuitive way. I think I’m going to go for a multi-pronged approach–make the same point in several different ways, some more formal than others. Simulate an illustrative multiple comparisons example. Show them that famous xkcd cartoon on multiple comparisons. Tell them that it’s circular reasoning to let the data tell you how they should be analysed, and then do that analysis on those data. Tell them about the Texas Sharpshooter Fallacy (shoot a rifle at the side of a barn, then go up to the barn, find a cluster of bullets, draw a target around it, and declare yourself a sharpshooter). Have them read Simmons et al. on researcher degrees of freedom. Have them read a couple of blog posts on fMRI brain imaging in neuroscience (apparently neuroscientists routinely bias their analyses by the way in which they pre-filer their data). And of course, have them read your post too. :-)

Hi Jeremy, to my mind it is all about changing the culture so that people aren’t embarassed about using exploratory statistics (and don’t feel superior about dressing things up as a hypothesis test when it is not). As with any culture change I think this has to come from senior researchers with tenure. I think we all just need to stick our necks out and be honest. But I love your idea of an exploratory section of a journal!

And you are right about not just having a prior hypothesis but a prior analysis plan. I think good macroecologists know this. You set up your data exclusion rules a priori. And if you didn’t, you report it. In my 2008 paper on the energetic equivalence rule, I missed a really obvious data exclusion rule (i.e. didn’t think of it a priori to analysis – ti just jumped out at me how stupid I was not to have thought of it when I did the analysis – namely invasive starlings and house sparrows are exceptions). I reported the analysis with and without this exclusion rule and acknowledged it as a post hoc removal. This is the only way to do it ethically (and not surprisingly nearly every reader I’ve talked to was convinced it was legitimate). But this depended on my being honest about my thought process. Would have been the easiest thing in the world, to just drop them and make it look a priori.

Been thinking further about changing the current culture, in which people feel obliged to present everything as a pre-planned hypothesis-testing analysis and so end up presenting incorrect P-values. I totally agree that we don’t want people publishing exploratory analyses dressed up as hypothesis-testing analyses (and from the reactions you’re getting to the post it sounds like nobody–at least nobody on Twitter–wants to publish such analyses either!) But do we want people publishing exploratory analyses as exploratory analyses? My question here does *not* arise from questioning the scientific value of exploratory analyses. But I question whether everything that’s of scientific value should be published. (Joan Strassmann recently asked much the same question about peer reviews and early draft mss: (http://sociobiology.wordpress.com/2013/09/30/it-isnt-always-useful-to-publish-early-drafts-or-to-share-reviews/) For instance, I can definitely see the value of publishing exploratory analyses that lead to the development of interesting new hypotheses. But isn’t the resulting hypothesis the thing that’s most useful for other people to read? And don’t we already value and publish papers developing new hypotheses (by whatever means)? Are there any other reasons why others would want to read about your exploratory analyses? And as for grant applications, don’t NSF and NIH already put high value on exploratory analyses, in the form of preliminary data? After all, what’s preliminary data but data that suggests a hypothesis or conclusion, that you plan to test or confirm by collecting further data? Or should exploratory analyses be valued for, and published and included in grants for, other purposes in addition to hypothesis generation?

I guess I’m asking you (and other readers) to keep following the interesting line of thought you started at the end of the post. In a world in which we all value exploratory analyses and don’t hide them in the closet or dress them up as hypothesis-testing analyses, what do we do with them, and why?

Tangent re: your remark that, “as with any culture change, it has to come from senior researchers with tenure”. I can totally see why you say this. But are you sure it’s true? I’m thinking of data sharing, for instance. Would you say that the push for that came primarily from senior researchers with tenure? I have no idea, actually. But I had the vague impression that that culture change came more from younger folks, or at least not primarily from senior folks.

You raise a good question. At the most simplistic level, I don’t think all exploratory analysis is interesting or successful or deserving to be published by any stretch! Of course nor do I think that about hypothetico-deductively framed work either.

As per my conversation below with Steve, stuff that gets published should be novel, interesting, important and move science forward. Really all I’m saying is that collecting data, exploring it, and then producing a hypothesis described in discussion instead of the introduction should be publishable in its own right if that is how things happened (without dressing it up as a hypothesis test).

I personally don’t know anybody who frames preliminary data in a grant as I explored it and came up with a hypothesis which I now want you to pay me to test. Every grant I see presents it as, I had a prior hypothesis and then I collected a little data to prove to you it is worth paying me to collect a lot of data to test it.

RE who produces change. You may be right. But it comes at much greater cost to junior academics. It would be nice if senior academics actually stepped up to the plate once in a while. The power dynamics are pretty severe in academia. I sometimes think academia is one of the most conservative, change-resistant institutions in society.

Interestingly, judging by early comments on Twitter, the response to your post is going to be larger and more positive than for any of mine. Which I suppose is a lesson in “messaging” for me. I’ve been telling people about stuff they shouldn’t do. You said exactly the same things about what they shouldn’t do–but at the end also told them something they could do (do exploratory analyses and report them as such, without P-values). Your way seems to be going over much better than my way.

Not to be critical of your posts, Jeremy, because they generate a tremendous amount of thoughtful discussion, but I agree that Brian’s post here has a very nice positive message. He does a great job of framing the post in terms of good approaches to science and analysis while still identifying the things that must be avoided. Great job, Brian! I think this in combination with a couple of the prediction posts and/or associated primary literature would make for a great lab group discussion.

Thanks for the great post. I often see people do AIC then report p-values for the covariates of the best model. If the AIC model selection was really the

a priorihypothesis test, as often used in wildlife studies, then should that be the single test from that data set (i.e. Model A has a better balance of fit and complexity than Models B or C)? Would you consider adding the secondary hypotheses relating to whether each covariate differs significantly from zero as a problem? If AIC is used in this fashion, should the coefficients and their confidence intervals be reported as a measure of effect size but without p-values?Hi Daniel – good practical question!

Using AIC to choose a model is a form of variable selection. As such any later tests will have an inflated p-value. Therefore testing which variables are statistically significantly different from zero is wrong.

I’m not aware of specific p-value-correction technique to use after variable selection happened, although there might be one embedded in all of Freedman’s math (and a sufficiently complex resampling process could probably get you there). But like all p-value corrections it is qualitative rather than quantitative. Which is to say if your p value is p<0.000000001 it would probably still be significant after the correction, but I can't tell you where the line is, making this a slippery slope. Its also a matter of degree – if you selected 2 variables out of 4 the problem is qualitatively less than if you selected 7 variables out of 25. This leads to greater or lesser corrections.

Personally, I am happy with the view that a p-value is a courtesy statistic conveying information to the reader to use as they see fit (which is very different than the rigorous p<0.05 kind of frame of mind) – it is a mix of effect size and sample size which is useful and familiar even if it is not strictly rigorous. But this is not how they are conventionally interpreted! One needs to be very explicit that this is what one is doing when reporting them this way (indeed I would call this dangerous waters, but still worth doing sometimes).

In short the safest would be to do as you say and only report effect sizes. But you could report p-values and make a clear statement about how they should (informational but approximate) and should not (i.e. hypothetico-deductive) be interpreted.

That makes sense and seems like a very reasonable approach. Thanks again.

This was my exact question. Thanks for the great post and responding to these thoughtful responses.

And a second thought – just to follow the goals a little, once you’ve explicitly admitted the goal is exploratory (which of these models is best – I don’t have a prior hypothesis) what is the value of testing whether a chosen variable is statistically significantly different than zero? (other than of course having a sentence “p<0.05" to get the article published, sigh). Once you've admitted you're exploring and chosen the best model, that ought to be a good stopping point. It then becomes a hypothesis which can be usefully used a priori in future work. That would be my philosophical take but I recognize it might not be the easiest route to get published pushing this philosophy. The p-value as informative but not accurate and test-worthy idea in my last reply might be a compromise.

Pingback: In Praise of Exploratory Statistics | Daniel J. Hocking

Love this: “If exploratory statistics weren’t treated like the crazy uncle nobody wants to talk about and everybody is embarrassed to admit being related to, science would be much better off. “

This will be a great link because it precisely sums up my continuous struggle with faculty in how we train students and as a reviewer for manuscripts. It doesn’t help that NSF-Ecology requires proposals to be “hypothesis-driven” (this is not true of other panels, at least explicitly). The consequence of pigeon-holing a study into hypothesis-driven framework is that very weak hypotheses are formulated. These are generally trivial (“I hypothesize a difference between A and B”) or inductive (“study A found this effect so my hypothesis is that I’m going to find it too”) instead of a strong hypothesis constructed from good (and generally quantitative) theory. And since the study wasn’t really designed to test a strong hypothesis, there is very little ability to reject alternative hypotheses (that is, the data could fit multiple hypotheses).

I hadn’t seen Freedman’s paper. I think much of researcher degrees of freedom is a little different from the random noise. Also, his results are very sample size dependent. A dataset with N=100 and p=50 is pretty sparse so R^2 has to be high. Just write an R script to re-do his analysis with N=1000 or 10000. R^2 drops very quickly. And the number of significant P-values in round 2 drops to the expected (alpha) value (at least as a fraction of p and not the number of “retained” variables in the second pass). I don’t know how to format the results but here they are for 1000 iterations at each level of N (.1 refers to pre-selection and .2 refers to post-selection, “n” refers to the number of variables with P < .25 or .05)

N R2.1 n25.1 n05.1 R2.2 n25.2 n05.2

100 0.50 12.38 2.374 0.304 10.7 5.391

1000 0.05 12.36 2.533 0.035 11.6 2.810

10000 0.005 12.53 2.491 0.0036 12.2 2.536

Good point about sample size in Freedman’s paradox. Very relevant to the evaluation of researcher degrees of freedom. And R code always appreciated!

I’m not the most efficient R scripter!

#freedman

niter <- 1000

narray <- c(100,1000,10000)

sumtable <- data.frame(NULL)

for(which_n in 1:length(narray)){

mytable <- data.frame(NULL)

for(iter in 1:niter){

n <- narray[which_n]

p <- 50

x <- matrix(rnorm(n*p),nrow=n)

y <- rnorm(n)

res <- summary(lm(y~x))

R2.1 <- res$r.squared

inc <- which(res$coefficients[2:(p+1),4] <= 0.25)

n25.1 <- length(inc)

n05.1 <- length(which(res$coefficients[2:(p+1),4] <= 0.05))

res <- summary(lm(y~x[,inc]))

R2.2 <- res$r.squared

inc <- which(res$coefficients[2:(n25.1+1),4] <= 0.25)

n25.2 <- length(inc)

n05.2 <- length(which(res$coefficients[2:(n25.1+1),4] <= 0.05))

mytable <- rbind(mytable,data.frame(N=n,R2.1=R2.1,n25.1=n25.1,n05.1=n05.1,R2.2=R2.2,n25.2=n25.2,n05.2=n05.2))

}

sumtable <- rbind(sumtable,apply(mytable,2,mean))

colnames(sumtable) <- colnames(mytable)

}

Thanks Jeff! I’ll bet you that code gets run often (it is a great way to develop intuition about the topic at hand).

By the way, I wonder how long this took Freedman’s tech to write the code and run 10 iterations of N=100?

I don’t have a background in statistics or experimental methods (beyond undergrad physics training), so maybe this question is completely naive but why don’t people just do cross-validation? Whenever you come back from your field study, randomly partition your data into sets, do whatever crazy exploratory statistics you want on the first set to generate an awesome hypothesis, and then do hypothesis testing of that with your second set. This is usually the most basic step in machine learning, why can’t it be used in experimental methods? Or am I missing something obvious? It looks to me like replicating a study or testing a hypothesis that already appeared in the literature is basically this same cross-validation except it is easier to check that the two sets were separated fairly.

I think it’s a sample size issue. Especially for field data where you have to put a lot of effort into a sample of 50 or 100.

I think ATM has given the main practical problem with this idea, but I do think ecologists ought to be much more open to the practice of cross-validation. It is a very powerful technique (and it doesn’t have to be a 50-50 split – often times 70/30 (test on 30% held out) is good enough.

I agree, sample size is the main problem, I like the rules of thumb in “Regression modeling strategies” by Harrell. I’d like to point out that when I have sufficient data, I like to do a classic validation on data never used for calibration (not cross-validation). My point is that for prediction purposes, at some point you want to predict data for which you do not have the response variable (prediction, of course). So, at some point, I want to have a confirmation that my model is predicting well. You would be surprised how mistakenly confident you are about the predictive capabilities of your model after cross-validation, but before “real” validation.

Hi Simon – good points. Especially the distinction between cross-validation and validation on separate (or permanently held out) data. I believe in theory these should give the same results, but in practice, for reasons such as my post on spatial autocorrelation and machine learning, cross-validation definitely overfits in many real-world situations (happening to me now on a project for spatial interpolation of climate as well).

This is a great conversation to have because my sense is that most people aren’t nefariously misusing exploratory statistics but rather don’t quite understand the implications of what they do prior to publication. Practically speaking, it can be especially difficult for students because as one proceeds through a thesis or dissertation, writing small grant proposals, giving talks, etc., one is forced to meet and greet his/her data in a way that will inform future analyses, and my experience suggests that most students are simply not prepared to have all their ducks in a row before they get in the field or are required to run preliminary analyses for one reason or another.

One other potentially valuable use of exploratory statistics in wildlife and ecology relates to your study objectives. If you are getting paid to understand, say, elephant resource selection within a national park, then one could argue that you owe it to the park to stray from a priori hypothesis testing. In essence, in such a case you might want to err on the side of over- rather than under-fitting your data.

Agree on both points. I think you’re right. You have to be an experienced research who has been bitten once or twice by failing to make a decision a priori (be it about variables or data subsetting, or …) to really get how much thought you need to put into those up front.

And a great example of where the goal of exploration aligns with the people paying for your research.

Great post. I love your vision, but think you are giving us the hard sell without pointing out what we might lose if we follow it. In particular, we’ll have a lot more honesty, but also a lot more subjectivity. Instead of publishing an exploratory analysis if p is less than 0.05 (which is bad and dishonest), we would need to switch to publishing it if its interesting (which is subjective). I think Jeremy was getting at this above…ultimately, we would need to get back to something that can be objectively tested. As I’ve said before, I’m neither arguing for nor against p-values, but just that its nice to have a clear criterion for measuring wrongness (e.g. p greater than 0.05), regardless of the degree to which the criterion is philosophically justifiable. Its like how its good for countries to have both a system of law and order (~ p-values), as well as a democratic right to civil disobedience (~ data exploration).

Also re: “1. Hypothesis testing 2. Prediction 3. Exploration”. What about estimation? I think if an exploration can make arguably precise estimates, then that’s the best of both worlds — good measures of wrongness (are the interval estimates really wide?) and no need to pretend like you had a specific quantitative hypothesis all along. Not that you can’t have researcher degrees of freedom problems with estimation (e.g. cherry pick covariates that give you the estimates you want).

It’s a tough topic.

(Moderator’s note: Steve originally posted a comment with typos and then posted a corrected version in reply to himself. I’ve taken the liberty of replacing the original comment with the corrected version to clean up the thread.)

HI Steve – as always important points.

To your first point, I would say simply, yes, we should focus on publishing interesting stuff (which is inherently subjective) (and regardless of whether it is exploratory or hypothesis testing), and stop using objective criteria like p<0.05 to decide what to publish. I can't tell you how many uninteresting papers I've read that have p<0.05 but vs a null hypothesis of something along the lines of fertilizer had an effect, p<0.05. I'd way rather read an exploratory paper that takes us somewhere new. It does, I have to acknowledge, require more trust in reviewers to decide what is interesting. But in this day and age of way too many papers to ever read them all, isn't that part of what reviewers could and should be doing anyway? I feel like I must be missing some of your point going with such a one dimensional answer, but there it is.

Estimation I would argue can be used in the service of any of the three goals. But it certainly is friendly to the prediction and exploration goals. And in other fields like physics you can make a career getting a better estimate of the gravitational constant or what not. I wouldn't mind seeing that happen in ecology. (of course then we'd have to figure out what was so important to estimate so precisely, which we haven't done yet I think).

Thanks for the response. In terms of viewing interestingness as being an intrinsically more important quality than objectivity (sorry if that’s not an accurate paraphrase of your position), I just can’t agree. As for the ecological gravitational constant, I don’t think we need anything that fundamental in order to focus on estimation. For example, estimating the density of a rare species in a nature preserve seems like something ecologists should care about that doesn’t involve hypothesis testing or prediction or exploration.

In thinking further about your proposal, I am excited by how empowering the message is (further evidenced by the length of the twitter feed). But when taken to its logical extreme, your proposal seems to say that if you can tell good stories you don’t need to demonstrate that those stories have some sort of repeatability (devils advocate).

I’m reminded of an argument recently made by Gelman, that the problem isn’t that we are testing statistical hypotheses, but rather that we are testing the wrong ones. Usually we test the null hypothesis (no effect) against the constant effect model (the same effect everywhere). What we need he argues are hypotheses that include varying effects (e.g. mixed effects models, Hierarchical Bayesian models). In ecology, we know that the null and constant effects models are always wrong. And that’s a source of the problem with p-values — they usually compare two models we know are always wrong. For example, we know that the strength of competition is different in different contexts. Therefore, a more honest and objective approach would be to estimate variation in the strength of compeition.

Maybe there is a false dichotomy between hypothesis testing and exploration? Rather, we need to explore hypotheses that are more reasonable than the null and constant effect models. Mixed models can help here.

Hi Steve,

I certainly agree that testing a null model of “there is no difference” is pretty boring (and often useless).

As for the larger point, I guess I would push back on the dichotomy of p value is objective while exploratory stats are a story. I can get p<0.05 in ways that are well done or badly done (e.g. data dredging – part of the point of this post – but also being a big name PI who can afford 10,000 replicates that get p<0.05 even for the smallest effects, testing a cheap null hypothesis – e.g. using chi2 to test for checkerboard vs some of the more sophisticated null models, etc). To push it more extreme, and I guess this is Jeremy's point about researcher degrees of freedom, two different researchers with the same dataset might report different p-values, so how objective are they? Conversely, exploratory stats don't have p-values but there are objective metrics (e.g. R2, % of variance explained by PCA axes, etc). There are also ways in which exploration can be well done or poorly done (e.g. cross-validation or not, a priori thinking about biologically important variables to include even if this doesn't rise to a full hypothesis).

But in the end, all I can really say is that when I read a paper, when I walk away my total sense of how good/important the paper is is not reducible to a 1-D assessment of a single statistic (p-value or otherwise). It is a gestalt of how well done the statistics are, how big the effect sizes are/amount of variance explained, whether the results seem biologically credible, whether it changes how I view the ecological world, and etc. (and yes there are problems with each of those individual) criteria. p<0.05 is on that list but fairly far down.

Fair enough. I guess all I can say is that I wish we had clearer guidelines for how to explore successfully. When testing hypotheses we have pretty clear guidelines…its just as you point out we rarely have real a priori hypotheses. That’s why I make the link with estimation…because it doesn’t require pseudo-a priori hypotheses and we do have fairly good guidelines for when it is successful (i.e. narrow interval estimates) — although not immune to RDF. I think our common ground is that we both don’t like the dishonesty that is encouraged by a culture of phrasing all scientific questions in terms of standard statistical null hypotheses. In the end you may be right…the best general guideline could just be ‘whatever works’. Thanks for the discussion. Its been really useful for me.

I have to agree the tools are both better and better communicated/understood around p-values than exploration approaches. My point is that this won’t change until we start accepting exploration as valid. But you are 100% right that work is needed to bring this up to the level of sophistication about p-values. Bit of a chicken and egg problem I suppose

Re: model selection bias, is it legit to take a randomization-based approach to correct for this? For instance, do a backwards elimination multiple regression (or whatever) on your original data. Then randomize the data 1000 times, each time running the same backwards elimination algorithm, to get distributions of the expected outcome of the entire model selection algorithm under the null hypothesis that there’s nothing but noise in the data. In particular, you’ll get a distribution of expected P values for the final selected model under the null hypothesis, to which you can compare the observed P value. You’ll get an estimate of the fraction of the time you’d expect any given predictor variable to be included in the final selected model under the null hypothesis. Etc.

Is this legit, or is there some technical problem with it that I’m missing? And if it’s legit, why does no one do it? It wouldn’t be that computationally intensive for many ecological applications. Is it just that nobody’s ever taught to do it?

I’ve never seen anybody attempt this, but it does seem like it would add some value. However, I fear it doesn’t go far enough. It randomizes across data selection, but it doesn’t randomize across the list of variables included to begin with. Which starts to get back to some deep philosophy and what is the claimed scope of a hypothesis test. But still I would give somebody who ran this test more benefit of the doubt than somebody who didn’t.

@Brian:

“It randomizes across data selection, but it doesn’t randomize across the list of variables included to begin with.”

Sure. But I doubt there’s any statistical way to correct for “variables I could’ve included in the analysis but chose not to”! As you say, dealing with the subtler aspects of “researcher degrees of freedom” ultimately takes one into very deep and murky waters, philosophically.

This reminds me of the old multiple-comparisons joke of controlling your career-wise error rate. Maybe we should randomize hypotheses among scientists, and see what distribution of p-values we get under that null! 8)

@stevencarlilewalker:

“Maybe we should randomize hypotheses among scientists, and see what distribution of p-values we get under that null!”

Is your null hypothesis that all scientists are equally good at finding statistically-significant effects? Because that strikes me as a great example of an obviously-false null hypothesis! ;-)

Touché Jeremy. 8)

Actually the more I think about it…the more I like my idea, in a funny sort of way. If we all had to spend some of our time testing other peoples’ hypotheses, we wouldn’t feel so emotionally attached to whether they were true or not and therefore may be less likely to torture the data to give us the result we want (a null model of a world without researcher degrees of freedom??). Maybe instead of peer review as a service, we should test other peoples’ hypotheses? On the other hand, we might be less persistent at pursuing an ultimately correct hypothesis. Hmmm… Its obviously a silly idea, given that just testing a few of your own ideas is usually laborious enough, but I kind of like it.

+1

I may be missing something too, but I really like this idea.

Aha! It is legit! And feasible! So sez Brad Efron himself!:

http://statweb.stanford.edu/~ckirby/brad/papers/2013ModelSelection.pdf

I’m totally doing a post on this just so I can claim to have independently come up with the same* idea as Brad Efron.

*For a suitably-broad value of “same”.

Pingback: Friday links: dance your statistics, ecological theory then and now, and more | Dynamic Ecology

But is the problem of doing an exploratory analysis and then claiming you were testing a specific, apriori hypothesis really all that common or big of a problem?

I see a lot of problems in science practice, but this is not one that I would put even in the top five or ten.

I don’t have any hard data (don’t think anybody does). But judging from the reaction to this post, it seems like a lot of people think its very common and a real problem..

FWIW, a reviewer on my most recent paper eyeballed the x-y data in one of my figures and on that basis asked me to test whether a breakpoint regression would fit the data…

I don’t know that people often do completely open-ended exploratory analysis and then try to pass it off as a priori hypothesis testing. But I do think it’s quite common for people to fail to pre-specify some data-processing and analytical decisions. Which is just a milder version of the same problem.

Pingback: Links 10/20/13 | Mike the Mad Biologist

Pingback: Two stage peer review of manuscripts: methods review prior to data collection, full review after | Dynamic Ecology

Pingback: Expiscor (21 October 2013) | Arthropod Ecology

Reblogged this on bayesianbiologist.

Pingback: The one true route to good science is … | Dynamic Ecology

Pingback: Friday links: an accidental scientific activist, unkillable birds, the stats of “marginally significant” stats, and more | Dynamic Ecology

Pingback: Thursday Links: Fourier transforms, paleontology blogs, evolutionary tempo, adorable kittens | fossilosophy

Pingback: Beating model selection bias by bootstrapping the model selection process | Dynamic Ecology

Many thanks for this post ! I really appreciate the explanation, but I don’t understand Really your explanation about freedman’s paradox! Why reporting p-value is wrong ? Many thanks

Pingback: Interpreting ANOVA interactions and model selection: a summary of current practices and some recommendations | Dynamic Ecology

Pingback: How many terms should you have in your model before it becomes statistical machismo? | Dynamic Ecology