Data sharing is all the rage these days. Journals in ecology and evolution increasingly require authors to make their raw data freely available online. One benefit of this is that it makes it possible for others to try to reproduce published analyses, possibly catching serious errors in the process. And there are many other good reasons to share data; Tim Poisot has a fine summary.
But there’s a downside to data sharing too: it’s probably going to lead to publication of more false results.
Problems with scientific reproducibility don’t just arise from clear-cut analytical mistakes like those described in the above link, or from publication biases. They also arise from common scientific practices that compromise the validity of our statistical analyses. Indeed, I suspect this is the most important source of irreproducibility in science.
Statistician Andrew Gelman calls it CORRECTION: Andrew Gelman, quoting Simmons et al., calls it “researcher degrees of freedom“:
The culprit is a construct we refer to as researcher degrees of freedom. In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared? Which control variables should be considered? Should specific measures be combined or transformed or both? It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various analytic alternatives, to search for a combination that yields “statistical significance,” and to then report only what “worked.” The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding at the 5% level is necessarily greater than 5%.
For instance, if you first explore a dataset–even just by eyeballing the data–to get a sense of what patterns might be there, and then do a statistical test for the “pattern” you just spotted, your test is invalid. It’s circular reasoning–you’re letting the data tell you what hypotheses to test in the first place, and then testing those hypotheses using the same data. It greatly inflates your chance of getting a false positive. Which doesn’t stop many people from doing just that, and then pretending that they planned to test that hypothesis the whole time. Or, maybe your analysis of the full dataset doesn’t come out how you expected or hoped. So you decide to divide the data into subsets and see if maybe the results are being dictated by the behavior of some unusual subset. Of course, the problem is that different subsets of your data will always look different, and you can always find some post-hoc reason to focus on certain subsets over others. Or maybe the results seem to be heavily influenced by a few “weird” datapoints–perhaps they’re outliers, or data points from one particular site or species, or whatever. So you redo the analysis with those points excluded, and find that it comes out differently, in a way that makes much more sense to you. Which amounts to letting your hypotheses tell you which data are right, rather than the other way around. After all, if those sites or species or whatever were so “weird”, how come you included them in your study in the first place? And if the data from those sites or species was really so “weird”, how come you only decided to exclude it after doing some preliminary analyses? Etc. (And as an aside, let me emphasize that I’ve done this sort of thing too.)
In an ideal world, statistics–both Bayesian and frequentist–is a set of procedures to keep us from fooling ourselves, from mistaking noise for signal. But those tools only work if they’re used properly. And there’s increasing evidence that they’re often not. Leading to calls for reforms that reduce “researcher degrees of freedom”, such as study registries and disclosure requirements. Those reforms would force authors to build a firewall between exploratory and confirmatory analyses, and to reveal all the exploratory analysis and data massaging that everyone does but doesn’t ordinarily report.
Increasing data sharing will only make these problems worse, I think, though I freely admit that how much worse is open to debate. Data sharing increases temptation and opportunity. There’s all this data out there to explore, and it’s a waste not to squeeze everything out of it we can, right? And let’s be honest: exploring already-existing data is often a route to a quick paper. For instance, somebody publishes an analysis that doesn’t come out as you would’ve expected, or that for whatever reason you just don’t like–it’s not the way you would’ve done it. Well, now you can just download the data, do the alternative analysis that you prefer (or explore a bunch of alternatives until you find one that gives what seems to you to be the “right” answer, or the “best” answer), and publish it. Or somebody puts a huge, amazing dataset online for anyone to explore. And so lots of people do–and because of researcher degrees of freedom they all come to false conclusions. The fewer people who explore a dataset, the fewer false conclusions get drawn.
Now, one possible response to this is to suggest that the “signal” of the truth will always emerge from the “noise” in the long run as more alternative analyses are done and more data are made available. That’s how science is supposed to work. But I’m not so sure that’s how it actually does work. Again, if the scientific process is really so good at converging on the truth, how come most published research findings are false? And how come corrections, refutations, and retractions are so little-noticed?
In his recent book, Nate Silver criticizes the notion that having “Big Data” is always helpful, on the grounds that there’s only a certain amount of truth in the world, no matter how much data you have. Having more data may just increase the size of the “haystack” you need to search to find the “needle” of truth. My worry here is similar.
Now having said all that, I don’t think the solution is to ban data sharing! But I confess I don’t know what the solution is, beyond what’s suggested in the old posts linked to above.
Advocates of data sharing have won the argument for its value, and rightly so. It’s time to start having the next argument: how do we ensure that all this newly-available data is analyzed effectively?
UPDATE: In correspondence published in this week’s Nature, ecologists David Lindenmayer and Gene Likens express much the same concern, writing:
Large open-access data sets offer unprecedented opportunities for scientific discovery — the current global collapse of bee and frog populations are classic examples. However, we must resist the temptation to do science backwards by posing questions after, rather than before, data analysis.
(UPDATE #3: In the comments, Brian notes that he read the L&L correspondence rather differently than I did. Brian reads L&L as singling out for special criticism those who analyze data collected by others, as opposed to collecting their own data. I didn’t read it that way, but now that Brian’s pointed it out I can see how it could be read that way. So just to be clear: the reason I quoted L&L is because I took them to be saying the same thing I tried to say in my post. And as I hope was clear from the post and the comment thread, I don’t think people who analyze data collected by others are worse at statistics than people who collect their own data. Let me also clarify that I quoted L&L not in an attempt at proof by authority, but simply because their letter happened to be published the same day my post was, so I figured I’d point the letter out.)
UPDATE #2: And in another piece of correspondence in this week’s Nature, Jason McDermott discusses “red flags” that may signal irreproducible research. One of which is doing a bunch of statistical tests and failing to correct for multiple comparisons, an issue I raised in the post. Man, it’s like everyone who wrote to Nature this week read my mind! 🙂
Is one potential solution to encourage the publication of exploratory analyses as distinct from hypothesis-testing analyses? This would enable the maximum utilization of existing datasets and provide preliminary results for future hypothesis-driven research. It would also help avoid the propagation of false results that you caution against, as such papers would generate, as opposed to come down in support of, particular hypotheses.
There are several concerns that arise (what constitutes a thorough exploratory analysis? where would such papers be published?), but I think that the benefit of gaining the most that we can from already-collected data makes exploring this option worthwhile.
“Increasing data sharing will only make these problems worse” – I can see the point if we speak about sharing of unconnected data. However, the result of data-sharing is that we have more and more publicly documented datasets for the same phenomenon, and if reviewers insist that analyses are done with all available data (sets), I would think that cherry-picking as described above becomes harder, not easier.
I completely agree with Florian. We move from a case study approach to studying phenomena or theories to examining the general support for it.
I think you and Morgan are probably right–but only if those multiple datasets on the same phenomenon are analyzed properly. For instance, do exploratory analyses on some datasets, and then test the hypotheses you’ve generated using other datasets. That’s analogous to cross-validation. It’s a very good way to build a firewall between exploratory and hypothesis-testing analyses, and so keep from fooling yourself. And you need multiple datasets (or one really big dataset) to do it. But just having a bunch of datasets on the same phenomenon isn’t a cure for researcher degrees of freedom in and of itself–it’s all in how you analyze the data.
I have to admit Jeremy, I find this argument against data sharing a little….odd. It isn’t a problem with sharing per se, it’s a problem with analysis. I think one of the advantages of data sharing is that others can assess your data too and see how robust your conclusions are using other approaches, which can then allow a discussion about the confidence in the results. I think this story about debt ratios is a good example of why broader scrutiny and reanalysis can be very important in science:
I think the better discussion for us to have is how we should be be thinking about false positives, not what your post seems to be implying: the risk of false positives should preclude data sharing.
If you didn’t get it then it means I didn’t write the post well, Morgan. My bad. Let me try again:
To be clear, I didn’t say that the risk of false positives should “preclude” data sharing. I mean what I said in the post: data sharing is a good thing on balance. But saying that data sharing is a good thing on balance isn’t the same as saying it’s an unalloyed good, all upside with no downside whatsoever.
Others analyzing your data using other approaches might be a way to assess the robustness of your conclusions. The case in economics that you link to is the same one I linked to in the post and have discussed in previous posts. That’s a case where reanalysis by others discovered out and out errors. As I said in the post, that’s a great argument for datasharing.
But others analyzing your data using other approaches can also be a way to add noise rather than signal. For instance, I think this is happening with the Adler et al. dataset on plant diversity as a function of productivity. Lots of people don’t like the conclusion Adler et al. drew. So they’re going back and doing their own preferred analyses until they find one that yields the conclusion they want, then arguing that their analysis is the “right” one and all the various analyses Adler et al. did are the “wrong” ones. That doesn’t strike me as testing robustness. That strikes me as data-dredging.
Does that help?
It’s also possible that she doesn’t get it because she doesn’t agree. You made your argument well. I will convince some, however, since it’s not *purely* a matter of fact, some will disagree.
I took Morgan’s comment to be a mix of “I don’t get it” and “I get it but I disagree”. Insofar as someone like Morgan (who’s very sharp, a regular reader, and a friend) doesn’t get it, that means I could’ve written the post better. Insofar as she disagrees, yes, that’s absolutely fair enough. As you say (and I agree 100%), the issue I’m raising is not a pure matter of fact, and so there’s scope for reasonable disagreement.
Whether or not I got it and agreed or disagreed really depends on whether Jeremy is arguing that false positives constitute a valid reason for not sharing data (in which case I got it but disagree). But since Jeremy says his argument was not that this precludes data sharing, just that he’s highlighting an issue that is inherent to data analysis generally, then I didn’t get it and I agree with Jeremy.
Do others not liking a paper and reanalyzing the paper’s data create noise? Yes. But as Jeremy points out most issues are about the balance of positive and negative. I wonder whether the increased noise over something controversial is more detrimental than someone publishing a result with a flawed analysis that is highly influential and difficult to independently assess w/o spending our limited $$$ to conduct followup up experiments? We all know influential experiments that set entire research areas on fire. I’d rather there was open assessment of the robustness of the original study before people spend their careers studying it….
Jeremy, just to clarify: it wasn’t Andrew Gelman making the statement quoted at the beginning of the post. The quote comes from the paper by Simmons et al that is linked to from the Gelman site. Great site by the way (yours that is)
Good catch, will correct the post. And thanks, glad you like the blog!
It seems to me that one big issue is whether data sharing leads to the data being used for the same types of questions over and over or not. As a practicing macroecologist, my tendency is to ask a question, go find 5-10 datasets (or in some cases 100 datasets) that could be used to answer that question. In most scenarios (90%+) the datasets were never analyzed with my question in mind. As Florian & Morgan are getting at, I think this is actually a pretty strong path to avoiding false positives. If I ask my question before I have selected my data, and then get multiple datasets, and the datasets were collected for goals other than my question, there is essentially zero chance of the circular false positives you are worried about. Indeed, probably less than the traditional one ecologist who collects their own data to answer one question.
The major danger in this scenario is post hoc filtering of datsets which messes up statistics just as much and is just as unethical as post hoc data point removal from a single dataset (unless you clearly document that you’ve done it in which case its ethical- but you have the uphill battle of convincing the readers it was justified and that you’re not introducing these types of false positives).
Perhaps one way to put the point of the post is to say that data sharing is an amplifier. As you say, if you’re doing your stats well, having more data increases your ability to separate signal from noise. But if you’re doing your stats poorly, data sharing amplifies that, too, by making it more likely that you’ll get a false positive.
So at one level you could say, as Morgan does, that the “real” problem is people doing stats badly rather than data sharing. I think that’s a reasonable stance to take. On the other hand, given the strong evidence that a lot of people do their stats badly, I think it’s reasonable to worry about amplifying that problem. Data sharing makes it more urgent than it already was to address “researcher degrees of freedom”.
p.s. And yes, among the issues here is people people using the same data over and over to address the “same” question but in different ways. I put “same” in scare quotes because in practice it’s never the “same” question, not from a statistical perspective. That’s why in another comment I raised the example of re-analyses of the Adler et al. 2011 dataset that are starting to come out. The biological claim people are interested in isn’t so precisely defined as to place many constraints on how one processes and analyzes the data. So you’re seeing people do different analyses that are all purportedly addressing the “same” biological hypothesis (“is diversity a humped function of productivity?”) but which from a statistical point of view are totally different (e.g., a conventional regression and a quantile regression are two totally different statistical analyses). That those different analyses come out differently doesn’t mean that the conclusions aren’t “robust”, it means that different statistical analyses test totally different hypotheses, and the statistics can’t tell you which of those hypotheses is the “right” one to test from a biological perspective. I have an old post on this: https://dynamicecology.wordpress.com/2012/05/02/advice-on-choosing-among-different-indices-of-the-same-thing/
“The fewer people who explore a dataset, the fewer false conclusions get drawn.”
One could also just say, the fewer people who explore a dataset, the fewer conclusions get drawn. Period.
In many cases, the uses of shared data are, believe it or not, more than just creating an academic discussion about generalizable theories and frameworks. The data drives decision making. Coming at ecology from an applied angle (restoration ecology – what Nick Gotelli might refer to as “environmental science”), both researchers and managers need to be able to explore the data that is gathered. Often, the applicable conclusion may not be that which was drawn at the initial scale of interest. Therefore, data needs to be in the open for all to see and for all to explore.
For example, let’s say someone has collected riparian vegetation data at reaches across the entire Columbia Basin, analyzing it at the same scale. Another individual works for the BLM and is interested in riparian rangeland health, and another individual coordinates a local watershed council, and wants to look at relationships between vegetation and logging or some other disturbance. These individuals aren’t interested in data at the initial scale of inquiry – they want to analyze the data collected on their agency’s land or the data within their targeted sub-watershed. Because the questions that these on-the-ground folks may ask are at much different scales than the original question, they probably will draw different conclusions than the original researcher. These conclusions will hopefully guide their applied actions. This is one of the ideal things about data sharing and exploration – if you provide the raw data, people can ask and answer different questions, some of which, as Brian mentions, were not the questions initially asked during data collection and publication. How can you change the spatial resolution of big data to scales meaningful for the applied realm if the data isn’t made available?
In an era of rapid global change, sharing scientific data may be one of the few ways that ecologists can do applied science that matters to stakeholders in real time. If you can’t detect the cause of a problem, you can’t propose a solution, especially if there are geographic issues or issues of scale in how much data you need to meet your applied science and management objectives.
Sure. I don’t disagree; see reply to Brian’s comment.
I’d only note that, analyzing a dataset to address a different question than the one the data was originally collected to address is neither necessary nor sufficient, in and of itself, to ensure that you’re not making the sort of statistical errors I describe in the post. For instance, your BLM manager who wants to look at the relationship between vegetation and logging disturbance only on the agency’s land will still be making a mistake if for instance he first correlates a bunch of different measures of logging with a bunch of different vegetation metrics and then decides to manage based on the nominally-significant correlations. Even managers, and even people using data for new purposes, have many “researcher degrees of freedom” available to them.
As you mention, there are still those analytical “degrees of freedom.” I may be giving the BLM too much credit in assuming that they have the people for the task, but that’s another tangent entirely. Besides, they’d get sued and have to do a NEPA based on the study’s conclusions anyway no matter the conclusion.
I concede that the assumptions of a given data set being right for other questions and that data will be correctly analyzed are big ones, but without open data (which is what this is really about), no answers can be had.
The title of the post is unduly antagonistic. The points you raise have very little to do with data sharing other than that it increases the availability of data that people can and will abuse.
Most of your points are valid criticisms of poor standards in statistical practice and issues with the publishing system – if you can get a paper published by showing a reanalysis of the data using a different approach (but not why the original results are wrong) then there is something wrong there too.
Your linking this to data sharing is flawed though. Sharing data doesn’t make people any more likely to abuse statistical practices. It will increase the opportunities for those things to happen, but that is a different point. The post seems set up to point the finger squarely at data sharing when a thorough reading reveals that this is not the entirety of your argument.
In your reply to @Morgan above, you try to argue that “on balance” data sharing is a good thing. If on the other side of scales you are weighing bad statistical practice, data dredging, etc, should the same criticism be levelled at statistics and the statisticians that come up with the methods we deploy? No, that would be absurd. Is statistics “on balance” a good thing rather than a profoundly good thing? Of course it isn’t; that people abuse statistics doesn’t undermine at all the benefits of having those methods available in the first place.
I think the post and my replies to the other commenters address most of your points, so I won’t repeat myself here.
If a thorough reading of my post makes clear what my argument is, well, that’s why I wrote the post! No post can be fully summarized in the title. Yes, I choose titles that I think will draw readers, but I disagree that my title here is misleading. Indeed, I try to avoid misleading titles, because I want people to read, understand, and engage with my posts. A title that misrepresents the post gets in the way of that, by annoying the reader. Unfortunately, despite my best efforts sometimes some readers will find a post title misleading, as you did in this case. But there’s no way to prevent that entirely except by having completely uninformative post titles. So sorry you didn’t like the title, but that happens. Thank you for nevertheless reading through to the end and taking the time to comment.
Afraid I don’t understand your criticism of my reply to Morgan. Yes, statistics is “on balance” a good thing. Not sure why you see that as an absurd claim, I think it’s a rather obvious claim. But I suspect we’re talking past one another here.
Pingback: #DataSharing in #ecology – risks, rewards and expectations? | Early Career Ecologists
I think the quote from Lindemayer & Likens is rather revealing, and actually insulting. I dont’ know a single person who uses shared datasets that does the analysis before the question! In fact, as I already outlined, the fairly high cost of downloading and making usable a shared dataset almost guarantees nobody is doing this. On the other hand, I think a very high fraction of people who analyze data they collected themselves do exactly what Lindemayer & Likens attack all the time – changing hypotheses to fit what the data shows – as Terry pointed out in his nice post you linked to. So they have it exactly backwards.
I know this is not where you’re coming from Jeremy, but the Lindenmayer & Likines quote kind of exudes a snobbish judgement of/prejudice against people who use shared datasets that is entirely divorced from facts. It is a not uncommon attitude that people who use shared data face all the time in career-impacting ways (see recent moans by distinguished field ecologists that NCEAS is destroying ecology). If you’re getting more pushback on this post than you expected, I think that is why.
Ah. I read L&L as raising more or less the same issue I raised, rather than singling out those who analyze data they didn’t collect themselves for special criticism. But yes, now that you point it out, I can see how one could read L&L in the way you do. I agree with you that there’s no reason to think that, as a group, people analyzing other people’s data are either more or less prone to statistical mistakes than people collecting their own data.
I suppose if I’d recalled some previous things Linenmayer & Likens have written (complaining about what they see as the growth in meta-analysis at the expense of primary data collection), I might have read their correspondence as you did.
I’ll update the post to clarify.
EDIT: Here’s an old post I did make fun of Lindenmayer & Likens’ complaint that mathematics and meta-analysis is destroying the place-based, natural-historical “culture” of ecology: https://dynamicecology.wordpress.com/2011/10/05/mathematics-vs-natural-history/ I strongly disagree with their view that analyzing data others have collected (or doing mathematical theory) is somehow inherently inferior to, or competitive with, collecting and analyzing one’s own data.
And here’s an old post linking to actual data on what mix of work is published in ecology, and how that mix has changed over time:
Bottom line: there’s basically no sign that meta-analyses and modeling are driving out papers reporting new data collected by the authors.
I’ll just use my favorite reply here: I agree with what Brian said and there’s nothing else intelligent for me to add to it. 🙂
“If you’re getting more pushback on this post than you expected, I think that is why.”
I actually expected a *lot* of pushback, so I haven’t gotten more than I expected. If anything, I’ve gotten less. But I didn’t expect the pushback to take quite the form it’s taken so far.
I expected more people to misread me as opposing data sharing. But not many people have misread me that way as far as I can tell, and those who have have been satisfied with my clarifications.
I expected a lot of denial that the “researcher degrees of freedom” issue even exists, much less that it might be amplified by data sharing. But so far it seems like most everyone agrees the researcher degrees of freedom issue is a serious problem, just disagreeing on whether data sharing is likely on balance to make the problem worse. And while my own suspicion is still that data sharing will make it worse (because I expect to see a lot more of the sort of problematic analyses described in the post), you and Morgan and other commenters are absolutely right to highlight cases where data sharing has helped solve “researcher degrees of freedom” problems rather than making them worse. Cases where data sharing allowed others to catch somebody abusing their researcher degrees of freedom. Cases where it allowed researchers to better separate noise and signal by giving them bigger sample sizes. Etc.
I didn’t expect that the post would be read as a specific attack on the statistical or scientific competence of those whose work is based on data collected by others. That’s not at all how I meant the post (or the update quoting L&L), and hopefully my most recent update clarifies that.
I really like the topic, food for thought! Regarding the own versus other’s data: I suppose the time-investment argument can work both ways. The data-collector will have invested more time on any single data-set (can’t argue with that?), so will be more inclined to start snooping around for patterns not initially intended. The analyser on the other hand, has such massive statistical power that he/she can scan more data and find patterns that may in some instances be caused by obvious confounding factors not of primary interest “lost-in-translation” (and might be known by the persons collecting the data). Not deciding who has the biggest problem here, though, and pro data-sharing.
Hi Jeremy, A thoughtful post as always! I can’t help but jump in as I suspect one of my posts at BioDV lent some inspiration for this post.
I just want to reiterate one point (with an example) and make one new point:
(1) the one that has already been made that data sharing != poor statistics. In fact, I would argue the opposite. Making data available may increase the chances that someone will do an even better job than you at analyzing it. An interesting example: https://en.wikipedia.org/wiki/Netflix_Prize. Yes you may get many people who do it worse, so the issue here is separating the chaff from the wheat. But that is another issue entirely… (perhaps one of you would kindly share data before publication and let us all take a crack at it? 😉 Mad props if you do — I’m not I could be so brave!)
(2) Putting aside the questions first vs questions later approach*, let us not forget that patterns exist freely of statistics. I can look at two means, 10 +/- 1 and 10 billion +/- 100, and tell you they are different without having to run a t-test. Similarly, I can look at a plot and tell you whether or not a relationship is linear vs non-linear (ok, in some circumstances). Just because I fit a model that doesn’t support my interpretation doesn’t mean the pattern doesn’t exist, it’s simply that I have failed to describe it within a certain tolerance of error using the predictors I have chosen and my dataset. This may be one argument for “big data.” By incorporating more information and using techniques that are sensitive to overfitting (e.g., model selection through BIC), you can potentially explain the “noise” and evaluate whether the underlying pattern is statistically supported or not. I think you are taking the other track, which is to say, if the pattern is not supported with a simple model, then how much do we really gain by explaining away the noise with a more complex model? We’re still left with a small (but “significant”) effect relative to everything else. I have mixed feelings about this. Nature is complicated and one effect size does not fit all. But generally I’m envisioning large-scale questions; I would shy away from this approach for, say, experimental data.
*for the record, I have always advocated for a questions first approach. However, that doesn’t mean the questions won’t change as I work through the data, but that is simply a product of insight gained through increased access to information. Had I known about a certain relationship beforehand, would not I have structured my hypothesis differently? What if my goal is prediction? Wouldn’t changing the model after seeing the data being more appropriate than sticking the original model?
Thanks for your comments. Re: data sharing being more likely to increase the odds that somebody will reanalyze the data but with better stats, sure, it can happen. But I think that’s actually fairly rare, at least in ecology. I think it’s much more common for others to reanalyze the data using *different* stats and then for people to get into inconclusive arguments about whose stats are “better”. So I don’t think “separating the wheat from the chaff” is actually a totally separate issue–I think it’s very much part of the issue here.
Not sure what you mean by saying “patterns” exist independently of statistics. If I randomly generate 100 independent random variables and then do all possible pairwise correlations, I’ll probably find numerous very strong correlations. And yes, those correlations are there, in that particular dataset. But I wouldn’t call those correlations “patterns” because they’re meaningless noise. If you did the same simulation again, you wouldn’t expect to find those same variables to be strongly correlated. But I know you know this and so I’m not quite sure what you’re getting at here. And yes, there are estimation and model selection techniques that can help us control overfitting (AIC, cross validation, lasso and other shrinkage methods, etc.), although those techniques are of course not foolproof.
You raise a very important and challenging issue at the end, to do with how one updates or modifies one’s hypotheses in light of the data and then goes on to further test those modified hypotheses. This is what we all try to do as scientists, but it’s trickier than we often think to do it well. For instance, if you have some theoretical model that fails to fit some data, no matter what parameters you pick, that’s good evidence against that model. And if you then modify that model so in a way that seems like it might help the model fit the data better, that’s good too. But if you then take the improved fit of the modified model to the data as evidence for the model, you’ve slipped into circular reasoning, since the whole reason the modified model fits better is that it was constructed to fit better. That’s not a good idea, I don’t think, even if your goal is prediction rather than something else.
But like I say, it’s a tricky issue, much deeper and trickier than that little hypothetical model fitting example suggests. For instance, there’s a whole philosophical literature on the “old evidence” problem. Why is it that scientists sometimes take old evidence that was already known as evidence in favor of a newly-proposed hypothesis? When does the fact that the evidence was already known compromise our inferences, and when does it not? Deborah Mayo is one philosopher who’s written about this, but there are many others.
I agree that data sharing leading to successful reproduction of results is rare in ecology, probably because confirmation of results rarely leads to a fancy new publication. Is there a role for blogs in all of this? A Netflix Prize for ecology? One must also consider whether the data is appropriate for the question. For instance, the Adler 2011 data you raised, I’m not sure that adequately addresses the question posed (but is probably one of the best datasets we have available to answer it: catch-22). That is compounded by many different kinds of analyses that have been thrown at it (all of which, I should note, have been developed in a hypothesis testing framework, which in my opinion is not the best one when attempting to understand functional forms).
Haha, my second point got away from me a little bit, but I think its still be relevant. You suggest seeing (or not seeing) patterns in the data and bending over backwards to validate them statistically (or wrangle some meaningful pattern from the ashes). My previous comment was about statistically validating patterns. Say, there is a strong positive effect of X on Y but the standard error on the estimate is quite high. Ok, so you did a bad job collecting enough samples, or the system is just inherently noisy. I don’t think it’s irresponsible to attempt to construct a better model to help identify sources of noise in the system, in the same way its not irresponsible to bootstrap to get around sample size issues. One way to accomplish the former may be to bring in additional predictors from other sources (i.e., “big data”) that are relevant to the relationship at the time and place the data were collected (I’m thinking primarily environmental data). (My comments re: overfitting were to offset the inevitable “Oh but of course a model with more predictors is a better model!”) What is not OK is to cherry-pick subsets of data to support a certain viewpoint or throw out outliers without justification, as you suggest.
As far as separating chaff from wheat, the metaphor has not been followed through to its logical conclusion. Ok, open data will lead to a proliferation of analyses, some better some worse. Which make it into the lit? From what I can tell, your concern is that that market gets flooded with *both* types of analyses, and that just muddies the issue (maybe irrecoverably?). Is this a failure of the journal/peer-review system? I’m not sure it is, but I’m not sure it’s not. In other words, I need to think on it some more!
You make an interesting point about old evidence, which is something that has always soured me on Bayesian inference. Sometimes I feel like not constraining my expectations based on prior information may in some cases be wiser (e.g., taxonomy informing phylogeny), but I’ve yet to see a compelling discussion of when Bayesian methods may or may not be appropriate (well, except in the case when you have no priors!).
Thanks for the convo, Jeremy!
I think this is an excellent point, but you have to keep in mind the denominator too: increased data sharing will increase the total number of papers, as well as the number of wrong papers. I imagine that the fraction of wrong papers will decline. Imagine that Group A finds a one result, writes a paper about it, and shares the data. Group B reanalyzes the data using a different technique, perhaps combines the data with other shared data sets, and gets a different result. Since the data have now been analyzed twice using different techniques, shouldn’t we expect the fraction of bad results to decrease?
In any case this highlights the fact that a single paper should never, never, never be taken as fact. Facts are not established until an idea has been retested (several times, by several groups, using several experimental designs). In that sense, I think the positive effect of data sharing on our ability to generate hypotheses has to outweigh the likelihood that data sharing will increase the number of incorrect results.
“Imagine that Group A finds a one result, writes a paper about it, and shares the data. Group B reanalyzes the data using a different technique, perhaps combines the data with other shared data sets, and gets a different result. Since the data have now been analyzed twice using different techniques, shouldn’t we expect the fraction of bad results to decrease?”
Not necessarily, for the reasons discussed in other comments. Having said that, maybe you’ll turn out to be right that data sharing will lead to a decreasing rather than increasing fraction of papers reporting bad results. I’m modestly pessimistic on that, but I’m just guessing, so I could well be wrong.
Your point about how a single paper should rarely be taken as fact is a good one. Unfortunately, it happens a lot. Early papers that say one thing can become conventional wisdom pretty quickly, with researchers mostly ignoring or refusing to believe contrary results published subsequently.
I’d hate to disappoint you by not getting as much pushback as you thought, so let me add a bit more. I already agree with many of the comments already made by Gavin, Morgan, Brian and others. It seems that the root of your criticism is two fold. The first is that ecologists are bad statisticians and don’t approach their data with proper philosophical or statistical rigor via a priori hypotheses. The second is that in an era of scientific credibility measured by the number of publications a researcher has, this creates more opportunity for a quick paper grab. In the current academic climate, it can be seen how the latter drives the former. Yet these things don’t change at all whether ecologists are downloading data or counting plants in quadrats. At best small scale science just slows the pace of these two nefarious forces. Shutting off access to big data hardly solves the root problem, you’re just turning big data into a straw man.
I think the difference between yourself and those giving you pushback is that you are taking a cynical perspective on ecologists. Ecologists have all these bad habits (statistically) and they’re all grubbing for easy publications. If we make less data available, we’ll slow the rate of bad science. I think the real divide here is between those who have a cynical (perhaps they might argue more realistic) and those who are scientific optimists. The optimists see all this data as a way of testing general theories, asking bigger and different questions etc… Who knows what insights lay in the mountains of data that are out there? Certainly I don’t, but I do know that if it’s not open and accessible, no one will ever find out.
Well, since you’re echoing other commenters I’ll let my replies to the other commenters stand. I’ll just take the opportunity to re-emphasize that I’m no saint here. In raising the issue I’m raising I certainly don’t mean to put myself up on a pedestal as the only ecologist who knows how to do statistics! For instance, I have papers which don’t report exploratory analyses that got left on the cutting room floor. And I’ll re-emphasize (again!) what I’ve already said in the post and in other comments: I’m for data sharing. I don’t want to ban it. You’ve read the post and the comments where I’ve repeatedly and explicitly said that I’m in favor of data sharing and don’t want to ban it. So I’ll admit I’m a little miffed to see you raising the specter of “shutting off access to big data”.
I don’t know that I’m a pessimist or a cynic (or a realist) and that the folks pushing back are optimists, though I can see why it might seem that way. I’d say I and the folks pushing back have different backgrounds and experiences, and so tend to worry about different things. Which is fine (Brian and I have talked in other contexts about how we worry about different things, for instance). The other thing that’s going on is that I’m a contrarian, though hopefully not in a pointless way (because you can always find spurious reasons to criticize anything). Like I said, I knew I’d get pushback on this post. And I think the post has led to a productive conversation, which makes me hopeful that in least in this case I’ve been productively rather than pointlessly contrarian. Though of course everyone has to judge that for themselves. Perhaps some readers find this whole conversation totally unproductive, I don’t know…
Hi Jeremy, t’s clear that as a general principle we think more data is better than less but Nate Silver’s point (and actually made earlier by another cultural icon, Malcolm Gladwell) about there being a limited amount of truth is a critical one. More information is always a positive thing but more data… not necessarily so. The assumption among people who don’t work with data regularly (and perhaps some who do) is that data and information are synonomous but data that are not relevant to the question or poorly measured add noise rather than signal and detract from our ability to make proper inferences.
This brings me back to my pet point – the way you distinguish between data and information is by identifying which data can be used to make better predictions on an independent data set. If we took that message to heart it would be conceptually (perhaps not operationally) simple to find the needle in the haystack. And , of course, we want as close as we can get to the entire haystack because we want to be sure the needle is in there. Best, Jeff.
Pingback: What we’re reading: Genetic diversity at the range edge, symbiote-mediated host shifting, and the T-rex nontroversy | The Molecular Ecologist
On the whole I support more data sharing, primarily from the perspective of most research being taxpayer funded. However, I don’t think I’ve ever done an analysis that was without a bug on the first run through. Usually those bugs are so egregious that the analysis is clearly wrong, but sometimes they can produce counter-intuitive results that are still plausible. This tends to prompt me to do more checking of the data and code before I believe the result. Someone less familiar with the system, using the data posted online, may be less conservative about accepting counter-intuitive results. Does the knowing the system and the data collection save you from more goofs? How does this trade-off against biasing your results towards just the things that you expected?
I believe there are probably a lot of undetected goofs in the literature. In the Reinhart-Rogoff case data sharing led to catching the error, but I wonder if this is more than offset by new errors arising from others analyzing data they know little about and having a less conservative threshold for accepting counter-intuitive results. Cheers, Paul
Pingback: The one true route to good science is … | Dynamic Ecology