Alternative post title: tell me again what data sharing is for?
Recently I polled y’all on whether providers of data on public repositories such as Dryad are entitled to co-authorship of any future papers that use those data. I was motivated to poll on this because of my sense that both data-sharing, and authorship practices in ecology, are changing. I’m interested in the interplay of those changes. And whenever practices or norms are in the midst of changing, we can expect substantial disagreement about the changes. Anyway, here are the poll results, along with some commentary.
Sample size and profile of the respondents
We got 280 responses (thanks everyone who responded!). Not a random sample from any well-defined population obviously, not even “the population of readers of Dynamic Ecology.” But still, it’s a larger and probably more representative sample of ecologists than you could get just by reading social media or asking your friends’ opinions, so it seems worth talking about.
This blog has lost grad student readership over the years, so the respondents skewed more heavily senior than respondents to our polls used to do. Respondents were 41% faculty, 28% postdocs, and 17% grad students. Also 8.5% non-academic professional ecologists, and 5% other.
We didn’t ask about geography, but in all our past polls the geographic distribution of respondents has closely matched the geographic distribution of our pageviews. So it’s safe to assume that about 50% of respondents are currently based in the US, 10% in Canada, the rest in other countries (primarily but not exclusively European countries, Australia, and Brazil).
The poll asked “Do you think that those who collected data shared on public repositories such as Dryad are entitled to co-authorship of papers that use those data?” Respondents were pretty evenly split between those who answered “No” (42%) and those who answered “Maybe/not sure/it depends” (47%). Only 11% answered “Yes”.
More senior respondents were more likely to say “Yes”. 17% of faculty and 13% of non-academic professional ecologists said “yes”, vs. only 8% of postdocs and 4% of grad students. It’s possible that’s just a sampling error blip, but I highly doubt it.
These results indicate substantial yet circumscribed disagreement among ecologists as to the question asked. Substantial disagreement in that there’s no single view that commands a substantial majority of opinion. Rather, the bulk of ecologists are split pretty much down the middle between those who think that data on public repositories are fair game for others to use as they see fit without involving the original data providers, and those who think that there are circumstances in which the original data providers are entitled to co-authorship. Circumscribed disagreement in that few ecologists (and even fewer junior ecologists), think that you’re entitled to co-authorship of any paper based on data you uploaded to a public repository. (So if that is your view, well, whether you’re right or wrong, I think you’re going to need to find some way to reconcile yourself to the fact that most ecologists don’t agree with you.)
I can imagine various hypotheses to explain the gradient of opinion with respect to seniority. Senior ecologists are more likely to have accumulated long-term datasets that will form the basis of many future papers, and so are more likely to worry about scooped if they have to share those data and give up any claim to co-authorship. Senior ecologists also are more likely to have formed their views on data sharing and authorship during a time when data sharing wasn’t widely encouraged or practiced. A colleague suggested to me that more junior ecologists may be more likely to see their own research programs (and perhaps scientific progress as a whole) as data-limited, and so highly value easy access to (and easy use of) data. Whereas more senior ecologists are mostly likely to worry about being time-limited rather than data-limited (Meghan has a good post on this). Any other hypotheses?
There may of course be other variables besides seniority that might predict differences in opinion on this issue. I wasn’t sure how to poll on them, plus I wanted to keep the poll short. Comments welcome on this.
The poll results don’t surprise me. I think they resonate with results of our old poll asking about whether providing data originally collected for another purpose is on its own a sufficient contribution to a paper to merit co-authorship. There’s substantial disagreement on that, with the level of disagreement depending on how much data is provided.
The poll results do raise a big question for me, which I hope commenters will help me think through. If you think that providers of data on public repositories are sometimes or always entitled to co-authorship of any paper using the deposited data, what do you think public data sharing is for? Because if you put your data on a public repository, but yet others have to get your permission to use the data for their own purposes (at least in some circumstances), isn’t that basically the same as the old system of “I keep my data private; if you want to use my data, email me and we’ll talk”?
I mean this as a sincere question, not a rhetorical one. I tried to answer my own question–what is public data sharing for, if not to make the data available to others to use without having to get permission from the data providers? Here are the answers I came up with, in no particular order. Did I miss any?
- Publicly sharing your data is a way of advertising it to potential collaborators. It’s a bit like a company offering a “try before you buy” option on its products. People who have a look at your data and like what they see can then contact you to offer you co-authorship in exchange for your permission to use the data.
- Publicly sharing your data (and analysis code, if there is any) allows others to reproduce the results in your papers. This can help them better understand those results, and can reassure them that the results are free of (certain kinds of) errors.
- Publicly sharing your data (and analysis code, if there is any) allows others to check for anomalies, particularly anomalies that might indicate fraud. In theory, this increases the probability of detecting anomalies because you can get lots of eyes on the same dataset. Although in practice, public data usually only gets checked for anomalies if there’s some independent reason to suspect anomalies. Public data sharing also increases the speed with which purported anomalies can be investigated, because investigators don’t have to wait around for the author to provide the data. And in theory, public data sharing deters fraudsters, because they’ll know their data might be scrutinized, and that they won’t be able to claim that the data have been lost. Although in practice it’s not clear how many scientific fraudsters are deterred by having to post data publicly. Possibly, it’s only people who would never commit fraud who see public data sharing as a deterrent to fraud. Fraudsters often make choices that look irrational to others.
- Publicly sharing your data on a depository might be convenient for you. You can just tell people who have your permission to use the data “go download it from the depository”, rather than having to email it to them.
- Publicly sharing your data on a depository might signal your personal commitment to (one form of) “open science”. You might want to signal that commitment as a matter of personal integrity, and/or because you hope that others will observe your behavior and do the same themselves.
To my mind, #1, 4, and 5 are all purely personal reasons for putting your data on a public repository. That’s not to downplay them, it’s just to say that they’re not reasons for any journal or funding agency to encourage or require public data deposition. #2 and 3 are reasons why journals and funding agencies might encourage or require public data sharing. But journals and funding agencies could achieve goals #2 and #3 in other ways besides encouraging or requiring public data deposition. They could for instance require authors to provide their data and code to the journal data editor. The data editor would check it for reproducibility and anomalies, and hold the data in escrow in case any questions were raised about it in future. So I dunno. I’m struggling to understand this, and I’m looking forward to learning from commenters with different perspectives on this.