Data sharing is all the rage these days. Journals in ecology and evolution increasingly require authors to make their raw data freely available online. One benefit of this is that it makes it possible for others to try to reproduce published analyses, possibly catching serious errors in the process. And there are many other good reasons to share data; Tim Poisot has a fine summary.
But there’s a downside to data sharing too: it’s probably going to lead to publication of more false results.
Problems with scientific reproducibility don’t just arise from clear-cut analytical mistakes like those described in the above link, or from publication biases. They also arise from common scientific practices that compromise the validity of our statistical analyses. Indeed, I suspect this is the most important source of irreproducibility in science.
Statistician Andrew Gelman calls it CORRECTION: Andrew Gelman, quoting Simmons et al., calls it “researcher degrees of freedom“:
The culprit is a construct we refer to as researcher degrees of freedom. In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared? Which control variables should be considered? Should specific measures be combined or transformed or both? It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various analytic alternatives, to search for a combination that yields “statistical significance,” and to then report only what “worked.” The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding at the 5% level is necessarily greater than 5%.
For instance, if you first explore a dataset–even just by eyeballing the data–to get a sense of what patterns might be there, and then do a statistical test for the “pattern” you just spotted, your test is invalid. It’s circular reasoning–you’re letting the data tell you what hypotheses to test in the first place, and then testing those hypotheses using the same data. It greatly inflates your chance of getting a false positive. Which doesn’t stop many people from doing just that, and then pretending that they planned to test that hypothesis the whole time. Or, maybe your analysis of the full dataset doesn’t come out how you expected or hoped. So you decide to divide the data into subsets and see if maybe the results are being dictated by the behavior of some unusual subset. Of course, the problem is that different subsets of your data will always look different, and you can always find some post-hoc reason to focus on certain subsets over others. Or maybe the results seem to be heavily influenced by a few “weird” datapoints–perhaps they’re outliers, or data points from one particular site or species, or whatever. So you redo the analysis with those points excluded, and find that it comes out differently, in a way that makes much more sense to you. Which amounts to letting your hypotheses tell you which data are right, rather than the other way around. After all, if those sites or species or whatever were so “weird”, how come you included them in your study in the first place? And if the data from those sites or species was really so “weird”, how come you only decided to exclude it after doing some preliminary analyses? Etc. (And as an aside, let me emphasize that I’ve done this sort of thing too.)
In an ideal world, statistics–both Bayesian and frequentist–is a set of procedures to keep us from fooling ourselves, from mistaking noise for signal. But those tools only work if they’re used properly. And there’s increasing evidence that they’re often not. Leading to calls for reforms that reduce “researcher degrees of freedom”, such as study registries and disclosure requirements. Those reforms would force authors to build a firewall between exploratory and confirmatory analyses, and to reveal all the exploratory analysis and data massaging that everyone does but doesn’t ordinarily report.
Increasing data sharing will only make these problems worse, I think, though I freely admit that how much worse is open to debate. Data sharing increases temptation and opportunity. There’s all this data out there to explore, and it’s a waste not to squeeze everything out of it we can, right? And let’s be honest: exploring already-existing data is often a route to a quick paper. For instance, somebody publishes an analysis that doesn’t come out as you would’ve expected, or that for whatever reason you just don’t like–it’s not the way you would’ve done it. Well, now you can just download the data, do the alternative analysis that you prefer (or explore a bunch of alternatives until you find one that gives what seems to you to be the “right” answer, or the “best” answer), and publish it. Or somebody puts a huge, amazing dataset online for anyone to explore. And so lots of people do–and because of researcher degrees of freedom they all come to false conclusions. The fewer people who explore a dataset, the fewer false conclusions get drawn.
Now, one possible response to this is to suggest that the “signal” of the truth will always emerge from the “noise” in the long run as more alternative analyses are done and more data are made available. That’s how science is supposed to work. But I’m not so sure that’s how it actually does work. Again, if the scientific process is really so good at converging on the truth, how come most published research findings are false? And how come corrections, refutations, and retractions are so little-noticed?
In his recent book, Nate Silver criticizes the notion that having “Big Data” is always helpful, on the grounds that there’s only a certain amount of truth in the world, no matter how much data you have. Having more data may just increase the size of the “haystack” you need to search to find the “needle” of truth. My worry here is similar.
Now having said all that, I don’t think the solution is to ban data sharing! But I confess I don’t know what the solution is, beyond what’s suggested in the old posts linked to above.
Advocates of data sharing have won the argument for its value, and rightly so. It’s time to start having the next argument: how do we ensure that all this newly-available data is analyzed effectively?
UPDATE: In correspondence published in this week’s Nature, ecologists David Lindenmayer and Gene Likens express much the same concern, writing:
Large open-access data sets offer unprecedented opportunities for scientific discovery — the current global collapse of bee and frog populations are classic examples. However, we must resist the temptation to do science backwards by posing questions after, rather than before, data analysis.
(UPDATE #3: In the comments, Brian notes that he read the L&L correspondence rather differently than I did. Brian reads L&L as singling out for special criticism those who analyze data collected by others, as opposed to collecting their own data. I didn’t read it that way, but now that Brian’s pointed it out I can see how it could be read that way. So just to be clear: the reason I quoted L&L is because I took them to be saying the same thing I tried to say in my post. And as I hope was clear from the post and the comment thread, I don’t think people who analyze data collected by others are worse at statistics than people who collect their own data. Let me also clarify that I quoted L&L not in an attempt at proof by authority, but simply because their letter happened to be published the same day my post was, so I figured I’d point the letter out.)
UPDATE #2: And in another piece of correspondence in this week’s Nature, Jason McDermott discusses “red flags” that may signal irreproducible research. One of which is doing a bunch of statistical tests and failing to correct for multiple comparisons, an issue I raised in the post. Man, it’s like everyone who wrote to Nature this week read my mind!