Poll on co-authorship of papers using publicly available data

We talked recently about how Am Nat and other leading EEB journals are giving their data sharing policies teeth. Going forward, they’re going to require data sharing. And of course, for years now increasing numbers of ecologists have been posting the data underpinning their papers in public repositories such as Dryad, even if not strictly required to do so by journal policies.

This is an area in which scientific practices and norms are changing fast. I’m old enough to remember when there was no expectation that you’d share data you’d collected, much less that you’d be obliged to share it!

I’ve been wondering lately how data sharing rules are affecting people’s views on authorship. My own view is that data on public repositories are fair game. Anyone can download them and use them in any way, including in their own papers (for instance, meta-analyses), without offering co-authorship to those who originally collected and deposited the data. I think this view is consistent with the spirit of Dyad’s policies; data deposited on Dryad are made available under a Creative Commons Zero license. But it occurred to me that I have no idea if my own view is widely shared. And we know from past polling data that ecologists’ views on authorship issues don’t have much to do with official journal authorship policies. So perhaps ecologists’ views on authorship in relation to data on public repositories don’t have much to do with the repositories’ official policies.

So to get some anecdata on this, here’s a a very short anonymous poll. I’ll publish a summary of the responses in a future post.

26 thoughts on “Poll on co-authorship of papers using publicly available data

  1. If data I’ve archived were used in a meta-analysis or some other large study with lots of smaller studies included, I’d not expect any collaboration. Recently, this happened and the authors asked a bunch of questions in order to use the data effectively in their meta-analysis; but even extensive correspondence shouldn’t make me a collaborator/coauthor. I feel like that’s just making sure the data is useful. In contrast, if someone made my dataset the focus of their entire study, or even a large part of it, then I think collaboration/coauthorship totally justified.

  2. Best practice for sharing data through any data repository is to provide a data use, authorship and acknowledgement policy with the data. Just because your data are in a public data repository does not mean that they are fair game and data users should consult the those policies to understand how to proceed. Copyright can and probably should be assigned to data and many people will use a with attribution copyright, which means that data contributors must be credited in any follow on work. I have seen my own data used without credit even though this explicitly contradicts the copyright and data use guidelines. My feeling is that it is always best practice to contact the primary data contributor to check about all of this. Often there will be specific acknowledgements that you should be including such as to funders or to local peoples, in addition to issues of attribution and authorship, that the primary data contributor can make you aware off if that information isn’t clearly indicated with the data. I personally prefer to use GitHub for data and code sharing because it is easier to share all of this information through readme files, data use documents and licences associated with the GitHub repository which can also have a DOI to facilitate the appropriate referencing of the data. I would really encourage ecologists working with other people’s data to do some research on data use best practice and really carefully investigate the data use policies of the data they are working with and start conversations with data contributors to offset issues. Personal views on this topic may or may not reflect the best practice standards!

    • “Best practice for sharing data through any data repository is to provide a data use, authorship and acknowledgement policy with the data. ”

      Well, except that Dryad provides that for you. I don’t think you can override Dryad’s license by including in your upload a note that says (for instance) “if you want to publish these data, you have to first offer me co-authorship” or “these data cannot be used commercially”. Are you suggesting that authors try to override Dyrad’s license somehow? Or are you suggesting that they avoid using Dryad because of the Creative Commons Zero license?

      “I would really encourage ecologists working with other people’s data to do some research on data use best practice and really carefully investigate the data use policies of the data they are working with…Personal views on this topic may or may not reflect the best practice standards!”

      I confess I don’t understand your claim to be describing “best practice” here. It sounds to me like it’s quite unclear what “best practice” is, or at least that there’s a lot of disagreement about what “best practice” is. After all, unless I’ve misunderstood (in which case my sincere apologies), it sounds like your claimed “best practices” directly contradict the practices of one of the most widely-used repositories in ecology! So rather than calling on ecologists to educate themselves on (what you say are) “best practices”, I think it might be better to say something like “this is an issue on which there’s no agreement as to what ‘best practices’ are, and so you take care to choose a repository that allows you to follow your own personal choices”.

      • Contacting authors is about more than authorship and goes beyond the official copyright assigned to a dataset. Getting in contact is about checking about data use, acknowledgments, interpretation, correct citations, funding, etc. Just because someone uses a database like Dryad and sets a copyright with limited restrictions doesn’t mean that they don’t want to be contacted and that they don’t have relevant information to share, the only way to find out is to contact those data contributors to ask. I also personally use more flexible ways to make my data public relative to Dryad.

        I would say best practice is to contact the authors in most all cases, even when working with large datasets. That is what we do in our lab as best we can. We also make authorship inclusion criteria really clear – data contributors who contributed X% to the overall dataset are offered authorship, data contributors who contributed <X% of the dataset are acknowledged, all data contributors are contacted about the study.

        Here is a recent blog post from my group on the topic of how much data cleaning is enough, rigorous data cleaning is difficult to do with out contacting at least some data contributors:

        Sharing is Caring: Working With Other People’s Data
        https://methodsblog.com/2020/09/04/sharing-is-caring-working-with-other-peoples-data/

        Here are two key articles that explore the complexity of making data public also there is a fair bit of information on the Open Science Framework (https://osf.io/) that is relevant to this discussion as well.

        Open Science Isn't Always Open to All Scientists – Current efforts to make research more accessible and transparent can reinforce inequality within STEM professions.
        https://www.americanscientist.org/article/open-science-isnt-always-open-to-all-scientists

        Parker, T.H., Forstmeier, W., Koricheva, J., Fidler, F., Hadfield, J.D., Chee, Y.E., Kelly, C.D., Gurevitch, J. and Nakagawa, S., 2016. Transparency in ecology and evolution: real problems, real solutions. Trends in Ecology & Evolution, 31(9), pp.711-719.
        https://www.sciencedirect.com/science/article/pii/S0169534716300957

      • “Contacting authors is about more than authorship and goes beyond the official copyright assigned to a dataset.”

        Fair enough. I agree. But the post deliberately asks a narrow, specific question that I think is worth asking. Sometimes I think it’s helpful to talk about large, complicated, multi-faceted issues. But sometimes I think it’s helpful to just talk about one narrow aspect of large, complicated, multi-faceted issues. Sometimes trying to talk about all aspects of a large, complicated, multi-faceted issue at once leads to an unproductive conversation. People talk past one another, or struggle to agree on what the conversation is even about.

        Of course, narrow focused conversations can have their own drawbacks. People might miss the forest for the trees. Might waste time discussing an issue that isn’t worth discussing, or that could be easily resolved, if seen in a broader context.

        Here, I posed a narrow question because I don’t have a complete, fully-formed view on all aspects of data sharing, collaboration, and authorship. So I decided it would be worth learning what others think about one narrow but important aspect of the broader complicated issues. I did so knowing that others might prefer to comment about broader questions, and I appreciate you taking the time to do so.

  3. While I agree that co-authorship might be justified when it is the core of a subsequent study, I don’t think that co-authorship should be the default. In some cases, it might even be better to specifically avoid co-authorship.

    For instance, if a subsequent study used my data to support my original findings, then I think the generality of the findings would be strengthened if the later study is independent of my inputs. By contrast, if a subsequent study contradicted my original findings, then it also makes sense that the authors remain independent of my work (so that I don’t water-down their paper to make it more consistent with my original).

    In a third case, where my data is used in a wholly new way, then I would only expect co-authorship if I was already working on the same questions. Collaboration should be preferred over competition.

    As also mentioned by ‘teamshrub’ above, it is common decency to contact the owner of the data before publishing anything. If the data are free-use, then contacting the author is a way to avoid unnecessary competition or duplication.

    • “As also mentioned by ‘teamshrub’ above, it is common decency to contact the owner of the data before publishing anything. ”

      Is it though? I don’t feel like many meta-analysis authors ordinarily contact the authors of every data set included in the meta-analysis, especially not if the datasets can be downloaded from public repositories, grabbed from figures, or retyped from published printed tables. Does that mean that every meta-analyst is indecent?

      • Sorry, I wasn’t clear. I was referring specifically to studies where a single dataset is the core of a follow-up study. In such instances, it is decent behaviour to reach out to the authors just to confirm that they aren’t already pursuing the same idea. I do think it is indecent to scoop someone using their own dataset.

        In terms of meta-analysis, I think it would still be a good idea to contact to original authors. Especially if the metadata or original paper aren’t 100% clear about the limitations of the data. But this is less about courtesy and more about ensuring that you’re not blind to the limits of the dataset.

        Aggregating data for meta-analyses is a skill in it’s own right. So, the owners of individual datasets can’t claim they are being scooped unless they are also aggregating data (in which case, they are doing exactly the same thing they are complaing about).

      • Ok, I’m with you now. I see where you’re coming from here. Not sure if I agree in the case of a single dataset collected by one person or group forming the core of a follow-up study by another person or group. But I’m unsure just because I’ve never written that kind of paper myself, and can’t think of many examples of such papers. Do you have examples in mind of papers in which somebody published a dataset, and then later they were scooped by someone else who published a paper based on that same single dataset? Honest question–I feel like I need a better “search image” for the kind of paper you’re thinking of.

      • Hi Jeremy, I’m replying here because wordpress won’t allow me to comment lower in the thread.

        “Do you have examples in mind of papers in which somebody published a dataset, and then later they were scooped by someone else who published a paper based on that same single dataset?”

        I don’t know of any instances where a research group was scooped using their own dataset, but I think it is possible for field data on metacommunities. For instance, a field campaign might repeatedly sample 25 ponds, collecting data on water chemistry, phyto- and zooplankton, fish, functional traits, spatial location, pond surface areas etc. Such a dataset could easily be published across multiple papers, testing different aspects of metacommunity theory.

        In such a scenario, the collectors of the data would realistically start by publishing a general descriptive paper that outlines the field methods in detail, so that they had something to refer back to in subsequent papers. However, if this first paper has open data, then there is a real possibility that other researchers could scoop some of the ideas.

        There are dozens of examples where the exact same metacommunity dataset has been recycled by others to test new hypotheses or statistical approaches. But off the top of my head, I don’t know of any cases where this was done at the expense of the owners of the original data.

  4. I co-authored a paper (Mills et al, Trends Ecol Evol) where we explained why we are not very keen to share our field long-term data. There is a myriad of motives: in my case, I live in Spain where funding for doing field ecology is extremely short. I mean it. Therefore, I have paid some of my campaigns with my own salary, so simple. Said so, I’m more than happy to share my datasets with other colleagues, especially if there is a topic of common interest where I may participate beyond a role of data-supplier.
    One further thought: I guess there is an age-dependent effect, i.e. young researchers tend to value less the collection of long-term data (my own students do), and the contrary holds for old seniors spending part of their lives collecting those data.
    A final thought: Can someone from abroad interpret rightly the data I’ve been collected over the years? In other words, I know the systems where I collect the data, the physical, biological and anthropogenic drivers, the occurrence of perturbations and extreme events, and the relationships with other species. No problem to share all this information either, but I’m a bit concerned about the reliability of some meta-analysis or the like that simply take dataset as raw data matrices. The “classic” paper by Sibly et al published many years ago in Nature on density-dependence is a good example about how picking up data without knowing the systems and the organisms (e.g. life histories) may end up in a biased and flawed study (very cited though).

    • Maybe this is getting off topic, but do you think that early career folks have less intrinsic interest in long term data (i.e. they think it is less value to science), or is it that there are external/systematic constraints (publications for a job, tenure etc.) that dissuade against investing time and resources in such projects. As an early career academic, I see tremendous value in LT studies, but feel like I need to be choosy about what I invest my time in at this point.

  5. Depends.

    For meta-analyses using data from multiple sources, no. No need for co-authorship.

    If using data from a single source to answer a new question, then one should probably at least contact the corresponding author behind the data. This should be in the interest of the new study – there’s always something additional you’d need to know about the data, so better cointact original author (and then it might be a good idea to offe coauthorship…?)

    If taking data from a database and publishing it as a major part of a new database (“data paper”), then definitely offer co-authorship, regardless of the license on the data. Whoever compiled the original database have definitely contributed enough…

  6. I almost entirely agree with Falko Buschke and Joacim Näslund, but want to point out one more thing: in the US, you can’t copyright data at all. If someone can access your data, they can use your data as they see fit. Dryad specifying CC-0 is useful for non-data things that are uploaded alongside tabular data, but as far as anyone working in the USA is concerned, even CC-0 does not apply to tabular data in the repository — and repositories attempting to specify other licenses aren’t actually able to do so.

    Now as other people have said, there’s a difference here between what’s legal and what’s right, and I do think that any analysis which is predominantly based on a single data source (or data from a single author) should at least reach out to the person publishing the data. You should cite the data source. But authorship isn’t justified unless they become much more involved in the subsequent work.

  7. I don’t believe the copyright terms of the data are relevant to the question of authorship. In US Law, copyright covers creative works. Organizations like Dryad use public domain declarations because they consider the data deposited to be “facts” and not “creative works” (recent examples of possibly fraudulent data notwithstanding). Additionally, I believe that authorship practices are more analogous to citation practices (i.e. governed by academic norms) than being governed by copyright. Darwin’s works are out of copyright and now part of the public domain, that doesn’t mean they can be used without citation.

    On authorship, I believe most journals would not consider only contributing data to meet the grounds for authorship. Thought experiment: can a deceased individual, say Darwin, be my co-author if my analysis depends tightly or exclusively on his data? Most journals require authors to have actual input on the conclusions of the manuscript. I agree with all the comments that suggest data reuse is often hard or impossible without input from those authors (which I believe is largely a reflection of poor metadata practices), but I think it would be inaccurate to suggest that the ‘data collection’ had contributed to authorship. The authorship was earned by the intellectual contribution of the person who explained the correct context or interpretation of the data, not for the act of collecting it. I’m not suggesting that this is how things should be — I think we should give more credit to publication of data and less credit to publication of articles — I’m only suggesting that authorship “for providing data” alone reads to me as a violation of how most authorship policies are stated.

    Lastly, I think an analogy to software products is also instructive. There are certainly software tools (one might even argue, theoretical techniques) out there that are either so complex and/or poorly documented that most researchers would have very little chance of using them correctly without the direct involvement of its original authors.

    • “On authorship, I believe most journals would not consider only contributing data to meet the grounds for authorship. ”

      You’re right. And I agree with you, and with those journals–authorship should involve a substantial intellectual contribution to the ms. But as I noted to Mike Mahoney above, many ecologists disagree, at least when “a lot” of data is contributed: https://dynamicecology.wordpress.com/2016/07/28/views-on-authorship-and-author-contribution-statements-poll-results-part-1/

      ” I agree with all the comments that suggest data reuse is often hard or impossible without input from those authors (which I believe is largely a reflection of poor metadata practices)”

      Heh. I’ll just be over here munching popcorn while the “you should contact me to understand my data” crowd has it out with the “you should write decent metadata so I don’t have to contact you” crowd. 🙂

      • “Heh. I’ll just be over here munching popcorn while the “you should contact me to understand my data” crowd has it out with the “you should write decent metadata so I don’t have to contact you” crowd”

        Personal experience with a particular study system may not be easy to summarize as metadata… Unless “see my previous 10 papers on the system for details” would be OK…

  8. In general, if the data’s already in a (peer-reviewed in a journal) publication, it can be cited and that’s enough for me; if it’s not, then it’s the first publication of the data, and they should be offered the chance to be an author.

    There’s some discussion above that merely collecting data isn’t enough for authorship, and that’s often a policy of journals. But if it’s data they have, presumably they designed and performed the experiment, didn’t just mechanically collect data, which is much farther along the “can be an author” path – offering an opportunity to participate in the final analysis/writing, and approve the final manuscript, can pretty much get you there, in my eyes.

    Of course, standards vary by field, and I’m an astronomer, where it’s well understood that authors who’re dozens, hundreds, or thousands of places down the author list have contributed significantly less to that particular manuscript. A lot of the standards seem to come from medicine, where people can typically name all their co-authors.

  9. I don’t think co-authorship is an appropriate expectation for openly shared data, but I do expect attribution. That’s my discomfort with Dyrad’s CC-0 policy. Are there journals that require use of Dryad specifically, or just open posting of data used to produce the manuscript? Pangea, figshare, Mendeley Data and I’m sure others allow CC-BY licenses that at least declare an expectation of attribution.

  10. Pingback: Poll results on co-authorship of papers using publicly available data | Dynamic Ecology

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.