Does peer review ever increase “researcher degrees of freedom” and compromise statistical rigor?

“Researcher degrees of freedom” is an umbrella term for all the data-processing and analytical choices researchers make after seeing the data. It’s something I and many others are increasingly worried about. Insofar as your analytical choices–which data to include, which hypotheses to test, which predictor variables to include, etc.–depend on how the data happened to come out, you run a serious risk of compromising the severity of your statistical inferences (e.g., by inflating your Type I error rate). Not always of course. For instance, it’s legit to check whether your residuals conform to the assumptions of the analysis, and if not transform the data or otherwise modify the analysis to improve the residuals (not to “improve” the P value!). But increasing evidence suggests that researcher degrees of freedom compromises our analyses more often and more seriously than many researchers care to admit.

Recently, I had a rather scary thought: what if peer review is part of the problem here?

Think about it. What do reviewers often do, after having had a look at your results? They suggest alternative ways to process and analyze your data. They ask you to include or exclude certain data, because those desert sites are “obviously” going to be different or that data point is “clearly” an outlier or whatever. They suggest that you test your hypothesis using different analyses. They suggest that you analyze each subgroup of your data separately because there might be heterogeneity among subgroups. They question whether your result is mainly due to one or a few influential data points and so ask you to redo the analysis with those points excluded. They notice an apparent pattern in your data that you didn’t discuss, and ask you to test whether it’s significant. They ask you to include an additional predictor variable in your analysis, because from eyeballing Fig. X it looks like that variable might matter. Etc. etc. Probably most people who’ve written an ecology paper have gotten comments like this and not seen anything problematic about them. I mean, you might or might not agree with the comment, but you probably don’t consider it problematic to get this sort of comment. I certainly haven’t considered such comments problematic, until recently. And probably many of you have made such comments when acting as reviewers; I certainly have.

But aren’t these sorts of comments statistically problematic? Insofar as analytical decisions that aren’t pre-specified compromise your analyses, it doesn’t matter whether those non-pre-specified decisions are made by you or a reviewer. And crucially, reviewers don’t ordinarily ask that you respond to their analytical suggestions by collecting new data, planning in advance to analyze that new data as they’ve suggested. Rather, they suggest that you implement their analytical suggestions on the data you already have–the very same data that often inspired their suggestions in the first place.

I emphasize that I am not trying to concern troll here. I’m a big fan of pre-publication peer review, it continues to be a great thing for my own papers and for science as a whole. Seriously, this post is NOT an attack on pre-publication peer review! I also emphasize that it’s perfectly natural for reviewers to think about alternative analyses and suggest them to the author. That’s what you do after seeing someone’s results and thinking about them. I’m not questioning the statistical competence of reviewers who make these sorts of analytical suggestions (as I said, I’ve made such suggestions myself as a reviewer). Finally, I emphasize that some statistical suggestions reviewers make do not compromise statistical validity if followed by the authors. I am not saying that reviewers can’t ever legitimately question anything about the authors’ statistics after having read the paper! For instance, if the author made a flat-out statistical mistake, like treating a nested design as a factorial design, it’s totally legit for a reviewer to point that out, and for the author to redo the analysis correctly. If the residuals aren’t distributed in anything like the way assumed by the analysis, it’s totally legit for a reviewer to point that out and for the author to redo the analysis so as to fix the residuals. Etc. And there might even be times when it’s worth somewhat compromising statistical rigor at the request of a reviewer for the sake of some larger scientific goal. All I’m saying is that, if making data-dependent analytical choices can compromise one’s statistics–and it often does–then why should it matter if those data-dependent analytical choices are made by reviewers as opposed to authors?

If these sorts of reviewer comments are statistically problematic, I think that’s another argument for disclosure requirements. Requiring researchers to disclose all the data processing they did and all the analyses they did, including analyses that got “left on the cutting room floor” (e.g., exploratory analyses). And including all analyses performed at the request of reviewers.

I guess another option would be for authors to respond to reviewer requests for alternative analyses by saying something like “The reviewer makes an interesting suggestion. However, because the suggestion was not pre-specified, we are unable to pursue it in a statistically-rigorous way using the data reported in the ms.” That response would probably be best-justified if the authors had pre-registered their study design and planned statistical analyses. Indeed, if I understand correctly (please correct me if I’m wrong), that’s more or less how the authors of drug trials would be entitled to respond if a reviewer asked them to deviate from their pre-specified analytical plans (see here for background). I believe subatomic physics (e.g., the LHC folks who discovered the Higgs boson) is another field where data processing and analytical decisions are entirely pre-specified and aren’t ordinarily changed at the request of reviewers (again, please correct me if I’m wrong on this).

What do you think? Is peer review an important source of “researcher degrees of freedom”? If so, what if anything should be done about it?

22 thoughts on “Does peer review ever increase “researcher degrees of freedom” and compromise statistical rigor?

  1. Hm. But these are the sorts of things we should be doing as statisticians – I agree they should be reported, but why give reviewers stick for suggesting the author does a proper job?

    • Not sure what you mean Bob, can you elaborate? I certainly didn’t intend to give reviewers stick here. And the sort of data-dependent analytical suggestions I’m discussing don’t necessarily (or even usually) arise because authors have done an improper job.

  2. There is the (probably obvious) difference that the reviewer does not usually have access to the raw data or workflow that led to the published analyses, so their suggestions won’t be guided by what previous analytical attempts seemed to suggest. So the opposite could also happen: reviewers may in many cases reduce researcher degrees of freedom, by suggesting (without knowing) that some steps which the authors had originally considered, but perhaps found some reason to not include, really do need to be taken into account despite what the researcher’s “gut” says.

    Also, another way of thinking about the issue is: rather than avoiding researcher df’s, maybe we should not choose to ignore/avoid them but rather be conscious of our subjectivity, and even try to measure its influence? For example, if certain conclusions are robust to different analytical approaches, or if they are challenged by them, maybe that is actually informative? This would be an argument for the importance of data/code/workflow deposition, and for adding rather than trying to avoid researcher df’s.

    • Yes, it is possible that sometimes reviewers end up unwittingly obliging authors to report analyses that the authors had done but for whatever reason decided not to report.

      There certainly are situations in which a conclusion is robust to different analytical approaches, and I agree that that’s reassuring. I’d only add that, in ecology, such robustness can be hard to come by sometimes. In part because different statistical analyses often test quite different claims rather than acting as different ways of testing the same claim. And in part because many ideas of empirical interest in ecology take the form of loosely-defined verbal models. One key source of “researcher degrees of freedom” in ecology is “freedom to decide exactly how I want to define some vaguely-defined concept from some verbal model”. You can’t test the robustness of a conclusion to alternative statistical analyses unless you can first get everyone to agree on sufficiently-precise definitions of terms.

      Re: subjectivity: I see statistics, done well, as a set of techniques to keep us from fooling ourselves, keep us from mistaking noise for signal (which our brains really, really want to do). It’s a set of techniques for reducing (not eliminating, because that’s impossible, but reducing) subjectivity. I see reforms like pre-registration of study design and statistical analyses (which are required for drug trials) as a way to reduce subjectivity in our analyses.

  3. This can be easily resolved by use of preprints. If you are unsure if some studies analysis is sound, just look at the pre-review draft they uploaded, and see if they introduced a bunch of new analysis since. It also makes the “value added” of reviewers more obvious if you can see the original submission in the archives.

    Also, I am not sure if results like LHC’s go through the same sort of peer-review or benefit that much from it. Most of the review seems to come from inside the ridiculously huge and heterogeneous team.

    • Re: preprints, yes. Though of course that doesn’t help with “researcher degrees of freedom” that act before the preprint is posted.

      Yes, results like the LHC’s do go through a lot of internal peer review. But I raised that example to make a different point. I’m under the impression (and I could be wrong) that the LHC team pre-specified how they were going to analyze their data, and that that’s key to the statistical rigor of their results. It’s the pre-specification I wanted to highlight.

  4. I have to confess I am not sure I get the big deal about researcher degrees of freedom. Here are several reasons:
    1) While researchers certainly do posthoc make up hypotheses, they actually do have a priori hypotheses they’re testing a fair amount of the time. We may not have a registry like drug trials, but a great deal of research these days starts with either a grant proposal or a student dissertation proposal. The whole Andrew Gelman Slate smackdown looked to me (we’ll never know for sure) like he picked an article that actually did have a firm a priori hypothesis.
    2) Other researchers (graduate committeees, co-authors) have a pretty good nose for when the original hypothesis was “saved” by finding out it was true, but only if you examine just the data collected on Sundays while standing on your left leg. While peer reviewers and journal readers have less information they can often detect this too.
    3) Science is at its core a social enterprise. Not much gets accepted as really foundationally true until it is reproduced by other people in other systems. And the very nature of reproducing a result means you are using an a priori hypothesis.

    I do have one suggestion for a “fix” to this problem (which I am not too worried about but I would like to see the fix happen anyway): take exploratory statistics out of the closet. I teach my students there are two fundamental modes of statistics hypothesis testing and exploratory and you need to know up front which you are doing and stay in that mode. If you’re doing exploratory you should acknowledge it and not report p-values. There are enough frontiers in ecology where we know so little that exploratory research (natural history with statistics?) is perfectly appropriate (although it hopefully leads eventually to hypothesis based research). But when I tell my students this they all give me that sad-you’re-a-little-touched-in-the-head look because we drill into students all the time hypothesis testing=good science. Nobody is going to write a grant or go to their advisory committee saying this is a domain we know so little about I’m going to collect data and do exploratory statistics. But that is really unfortunate because that is exactly the right way to do things. You don’t always have to have a hypothesis a priori to do rigorous science. But you do have to be intensely intellectually honest about whether you are doing hypothesis testing or exploratory statistics.

    So if you’re really concerned about researcher degrees of freedom, start a campaign to take exploratory statistics back out of the closet and make them socially acceptable!

    • Brian, I disagree with you that researcher degrees of freedom aren’t a problem. I’ve seen analysts essentially act like walking/talking 20-order polynomials. However, I give you a big thumbs up for your suggestion on how to deal with them. As a fellow observer, I completely agree. Confirmatory analysis is great, so is exploratory analysis.

      Here’s where I see a practical difficulty, though. With confirmatory analysis, there is a very clear criterion for success: p < 0.05. I'm not sarcastically dissing p-values here. Seriously, this criterion is helpful. No it doesn't ensure that type I errors are controlled at a rate of 0.05 — essentially because of researcher degrees of freedom — but it really does help in many cases. But — and here's the practical problem — with exploratory analysis there is no such clear criterion. Reviewers can't say something like 'well the coefficient of exploration is pretty high, so I don't think we can publish this exploration'. And because we don't have this, exploring authors can always just respond to criticism by saying 'hey man…chill…I'm just exploring'.

      I think the important distinction isn't confirmatory versus exploratory, but rather hypothesis testing versus estimation. If your estimates have terrible precision and/or don't cross-validate well, then there would be grounds for not publishing an exploration. Because in such a case you've neither tested anything, nor described/estimated anything!

      More generally, explorers need to be able to ask compelling questions that can be convincingly answered with exploratory analysis, and that's the challenge. Statistical natural history could be great if you can ask and answer interesting questions. In politics/economics/sports this is quite doable, because people really care about an estimate of, say, the difference between American health before and after ObamaCare. I'm getting off topic…

      • I agree there are people out there acting like a “walking 20 degree polynomial”. The question though is are other people being fooled or calling them on it? Or in other terms what is the level of analysis (individual paper, researcher, or all of science). I tend to think that while not perfect, people get called out (or their results ignored) for this kind of behavior.

        I do agree that estimation/prediction is another worth goal that is somewhat independent of testing hypotheses and exploration (I actually teach this as a valid mode in my stats class too – I oversimplified in my last post to stay on topic – I’ve said too much already on prediction in this blog).

        Interesting point about not having a p-value-like threshold for exploratory statistics. I’m writing a post for next week on exploratory statistics. I’ll have to think about it.

      • “I agree there are people out there acting like a “walking 20 degree polynomial”. The question though is are other people being fooled or calling them on it? Or in other terms what is the level of analysis (individual paper, researcher, or all of science). I tend to think that while not perfect, people get called out (or their results ignored) for this kind of behavior.”

        Fair enough. I hope you’re right. Though it would be nicer if people weren’t ignoring each other, and instead developing strategies for helping the research community in general avoid over-interpretation of data. Ben Bolker is absolutely amazing in this regard.

      • @Jeff:

        Great comment, you’ve articulated much better than I did exactly what “researcher degrees of freedom” is all about. Perhaps I should farm the next post on this topic out to you!

    • Brian: the researcher degrees of freedom is as much a problem of hypothesis-driven confirmatory research as it is with exploratory research. I know my research the best, so I’ll use an example: I have a very good biomechanical model of how body shape affects swimming performance – say the “escape” or evasion ability of a fish to rapidly accelerate out of the path of a striking predator. So I “test” this model by measuring body shape and fast start performance and then see if the magnitude and sign of the regression coefficients matches my model. Simple hypothesis-driven proposal. But when the actual work is done there are many, many decisions to be made on how to process the data. Fish are different size so I need to adjust for size as a nuissance covariate. Do I add size to the model or do I use a ratio of the morphometric measures over a size measure? And what size measure – length of the fish, mass (or cube of mass) or the geometric mean of all morphometric measures or the geometric mean of length, depth, and breadth? And what performance measure? I can measure distance traveled over some time period or maximum velocity or mean velocity over some time or maximum acceleration or mean acceleration. And if I chose any of these over some window of time how do I chose that window of time (15 ms or 20 ms or 100 ms)? Do I analyze the sexes together or separately? Do I include all covariates or a subset? Do I delete outliers? I can go on an on. These choices matter to the results (depressingly so) and more importantly, I can justify any of them. In a typical analysis session I might try a dozen combinations of these as I’m battling with my mind over the right path to choose (by path I mean set of decisions). But maybe my brain is justifying a path because its seen the results and likes it? This is the issue with researcher degrees of freedom. Even with what seem like very clean experiments, the multiple decisions one takes in processing the data can very quickly become astronomically large.

    • I love the suggestion to promote exploratory analyses (and had commented with a similar thought on an earlier post on data sharing). A colleague and I have just submitted a paper that is almost only data exploration and hypothesis-generation, on a common but completely unstudied frog species in India–we’ll see what the reviewers make of it! I think of such papers primarily as vehicles for getting the data out there. With so many completely unstudied species, especially in the tropics, it would be a waste of time and resources to re-collect data that had been already collected by someone else, who chose not to publish the data simply because they did not fit a neat hypothesis generated after the fact.

  5. @ Brian and Steve:

    I don’t have much to add to your very good discussion. I would quibble with Brian’s confidence that nothing gets accepted as really foundationally true in science until it’s been replicated. There are such things as zombie ideas! And while having a priori hypotheses certainly goes a long way towards reducing researcher degrees of freedom, I worry that it doesn’t go as long a way as most of us (including me until recently) think it does.

    I freely admit that it’s hard to quantify just how serious a problem “researcher degrees of freedom” is, either in general or as arising specifically from reviewer suggestions. I admit that my own concern about it has largely, though not entirely, anecdotal sources.

    There’s kind of a catch-22 here, in that the reforms discussed in the post aren’t likely to be implemented unless lots of people come to see researcher degrees of freedom as a serious problem. But implementing those reforms is one of the few ways I can think of to really get a handle on how serious a problem researcher degrees of freedom is. Can anyone think of any other ways to get a better handle on the scale of the problem?

    EDIT: and yes, “Get everybody to do as Ben Bolker would do” would, if implemented, be a great way to address the problem. 🙂 Perhaps we can try to make that a meme: WWBBD? (“What would Ben Bolker do?”) The meme takes advantage of the fact that Ben looks a bit like Jesus: 🙂

  6. I agree with Brian, especially his point about science being a social enterprise. It is mostly impossible to use one study to confirm or deny anything. A well-controlled experiment will have less of a problem with research df and may provide strong evidence for a certain hypothesis, but that hypothesis will likely be relatively simple with the results lacking generality. An analysis using observational data will have more issues with research df but also a reduced ability to provide strong evidence. Yet, it could provide empirical evidence for theories/hypotheses that warrant further examination (hence the social enterprise).

    I also think exploratory analyses need to be given more credit. They fall on the spectrum between experimental hypothesis-testing and pattern description; this entire spectrum is useful for providing information. Exploratory analyses can identify hypotheses worth testing and if more were published, more scientists would be exposed to potentially new ideas. Information is information, right? The internet has taught us that.

    Gelman’s example underscores the difficulty in sharing scientific studies with the general public, especially small studies with conclusions that need to be taken with a grain of salt. That study was published in a highly respected journal and media outlets were quick to report it as fact (as they always do). Scientists know that these conclusions are potentially conditional on the data and will be properly skeptical, but the general public will not. So researcher df makes it even harder for the average person to interpret information from a scientific study given that decisions made during analysis will not be obvious. I’m not sure this hinders science, necessarily.

  7. Pingback: Friday links: how to do great research, what would Ben Bolker do, and more | Dynamic Ecology

  8. Pingback: Links 10/4/13 | Mike the Mad Biologist

  9. Pingback: In praise of exploratory statistics | Dynamic Ecology

  10. Pingback: Two stage peer review of manuscripts: methods review prior to data collection, full review after | Dynamic Ecology

  11. Pingback: The one true route to good science is … | Dynamic Ecology

  12. Pingback: Friday links: an accidental scientific activist, unkillable birds, the stats of “marginally significant” stats, and more | Dynamic Ecology

Leave a Comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.