Poll results on co-authorship of papers using publicly available data

Alternative post title: tell me again what data sharing is for?

Recently I polled y’all on whether providers of data on public repositories such as Dryad are entitled to co-authorship of any future papers that use those data. I was motivated to poll on this because of my sense that both data-sharing, and authorship practices in ecology, are changing. I’m interested in the interplay of those changes. And whenever practices or norms are in the midst of changing, we can expect substantial disagreement about the changes. Anyway, here are the poll results, along with some commentary.

Sample size and profile of the respondents

We got 280 responses (thanks everyone who responded!). Not a random sample from any well-defined population obviously, not even “the population of readers of Dynamic Ecology.” But still, it’s a larger and probably more representative sample of ecologists than you could get just by reading social media or asking your friends’ opinions, so it seems worth talking about.

This blog has lost grad student readership over the years, so the respondents skewed more heavily senior than respondents to our polls used to do. Respondents were 41% faculty, 28% postdocs, and 17% grad students. Also 8.5% non-academic professional ecologists, and 5% other.

We didn’t ask about geography, but in all our past polls the geographic distribution of respondents has closely matched the geographic distribution of our pageviews. So it’s safe to assume that about 50% of respondents are currently based in the US, 10% in Canada, the rest in other countries (primarily but not exclusively European countries, Australia, and Brazil).

Poll results

The poll asked “Do you think that those who collected data shared on public repositories such as Dryad are entitled to co-authorship of papers that use those data?” Respondents were pretty evenly split between those who answered “No” (42%) and those who answered “Maybe/not sure/it depends” (47%). Only 11% answered “Yes”.

More senior respondents were more likely to say “Yes”. 17% of faculty and 13% of non-academic professional ecologists said “yes”, vs. only 8% of postdocs and 4% of grad students. It’s possible that’s just a sampling error blip, but I highly doubt it.


These results indicate substantial yet circumscribed disagreement among ecologists as to the question asked. Substantial disagreement in that there’s no single view that commands a substantial majority of opinion. Rather, the bulk of ecologists are split pretty much down the middle between those who think that data on public repositories are fair game for others to use as they see fit without involving the original data providers, and those who think that there are circumstances in which the original data providers are entitled to co-authorship. Circumscribed disagreement in that few ecologists (and even fewer junior ecologists), think that you’re entitled to co-authorship of any paper based on data you uploaded to a public repository. (So if that is your view, well, whether you’re right or wrong, I think you’re going to need to find some way to reconcile yourself to the fact that most ecologists don’t agree with you.)

I can imagine various hypotheses to explain the gradient of opinion with respect to seniority. Senior ecologists are more likely to have accumulated long-term datasets that will form the basis of many future papers, and so are more likely to worry about scooped if they have to share those data and give up any claim to co-authorship. Senior ecologists also are more likely to have formed their views on data sharing and authorship during a time when data sharing wasn’t widely encouraged or practiced. A colleague suggested to me that more junior ecologists may be more likely to see their own research programs (and perhaps scientific progress as a whole) as data-limited, and so highly value easy access to (and easy use of) data. Whereas more senior ecologists are mostly likely to worry about being time-limited rather than data-limited (Meghan has a good post on this). Any other hypotheses?

There may of course be other variables besides seniority that might predict differences in opinion on this issue. I wasn’t sure how to poll on them, plus I wanted to keep the poll short. Comments welcome on this.

The poll results don’t surprise me. I think they resonate with results of our old poll asking about whether providing data originally collected for another purpose is on its own a sufficient contribution to a paper to merit co-authorship. There’s substantial disagreement on that, with the level of disagreement depending on how much data is provided.

The poll results do raise a big question for me, which I hope commenters will help me think through. If you think that providers of data on public repositories are sometimes or always entitled to co-authorship of any paper using the deposited data, what do you think public data sharing is for? Because if you put your data on a public repository, but yet others have to get your permission to use the data for their own purposes (at least in some circumstances), isn’t that basically the same as the old system of “I keep my data private; if you want to use my data, email me and we’ll talk”?

I mean this as a sincere question, not a rhetorical one. I tried to answer my own question–what is public data sharing for, if not to make the data available to others to use without having to get permission from the data providers? Here are the answers I came up with, in no particular order. Did I miss any?

  1. Publicly sharing your data is a way of advertising it to potential collaborators. It’s a bit like a company offering a “try before you buy” option on its products. People who have a look at your data and like what they see can then contact you to offer you co-authorship in exchange for your permission to use the data.
  2. Publicly sharing your data (and analysis code, if there is any) allows others to reproduce the results in your papers. This can help them better understand those results, and can reassure them that the results are free of (certain kinds of) errors.
  3. Publicly sharing your data (and analysis code, if there is any) allows others to check for anomalies, particularly anomalies that might indicate fraud. In theory, this increases the probability of detecting anomalies because you can get lots of eyes on the same dataset. Although in practice, public data usually only gets checked for anomalies if there’s some independent reason to suspect anomalies. Public data sharing also increases the speed with which purported anomalies can be investigated, because investigators don’t have to wait around for the author to provide the data. And in theory, public data sharing deters fraudsters, because they’ll know their data might be scrutinized, and that they won’t be able to claim that the data have been lost. Although in practice it’s not clear how many scientific fraudsters are deterred by having to post data publicly. Possibly, it’s only people who would never commit fraud who see public data sharing as a deterrent to fraud. Fraudsters often make choices that look irrational to others.
  4. Publicly sharing your data on a depository might be convenient for you. You can just tell people who have your permission to use the data “go download it from the depository”, rather than having to email it to them.
  5. Publicly sharing your data on a depository might signal your personal commitment to (one form of) “open science”. You might want to signal that commitment as a matter of personal integrity, and/or because you hope that others will observe your behavior and do the same themselves.

To my mind, #1, 4, and 5 are all purely personal reasons for putting your data on a public repository. That’s not to downplay them, it’s just to say that they’re not reasons for any journal or funding agency to encourage or require public data deposition. #2 and 3 are reasons why journals and funding agencies might encourage or require public data sharing. But journals and funding agencies could achieve goals #2 and #3 in other ways besides encouraging or requiring public data deposition. They could for instance require authors to provide their data and code to the journal data editor. The data editor would check it for reproducibility and anomalies, and hold the data in escrow in case any questions were raised about it in future. So I dunno. I’m struggling to understand this, and I’m looking forward to learning from commenters with different perspectives on this.

43 thoughts on “Poll results on co-authorship of papers using publicly available data

  1. I think #2 and 3 are valid reasons for publishing data. I don’t believe any journal editors can be paid enough to literally check and validate code or data for every paper that comes across their virtual desk (Do any?). I have seen cases where this would have been helpful when the author is claiming a more revolutionary result than might be expected (thus the need for being able to check for error) or some other anomaly was noted.

    Data are hard fought and won. They are the product of uncounted amounts of sweat, blood, and tears. Watching someone slurp up your data, uninvited, would be a bit demoralizing. Certainly, we have all realized that there may be a very interesting and exciting answer lurking in our old data, and we have all scurried back to the books to pull out the old data and look. I can see a lot of reasons for hanging on to it.

    Somehow, I feel that being an ecologist, particularly a field ecologist, means more than you can you run R-code. That new field ecologists might be granted PhD’s or tenure based only on other’s data that they have surfed off the repositories gives me pause. A big pause.

    As an evolutionary ecologist what mechanism is there that at least ensures that everyone pays into the data bases to the same degree they take out? Parasitism would seem to be a newly available academic niche and strategy.

    I much prefer the old fashioned, interpersonal methods of contacting colleagues and negotiating a collaboration for shared data. It builds a lot more than just another quick publication on the spoils of someone else’s toils. But that boat has now sailed. We aren’t going back, but we could do a much better job of managing the future of data sharing.

  2. I am surprised seeing the results of the poll. For me, the answer seems an unequivocal “grant authorship whose data you use” argument. My logic is, one: would I give authorship to the person in my lab if s/he were to generate that dataset? Or to the person outside my lab from whom I sought data? My answer to that question would be a ‘yes’. The second reason is: you may want to consult with the person who generated the dataset whether it is suitable for the hypothesis you are testing. S/he may have something pertinent to say which might not have appeared in the M&M section. Moreover, s/he may give some additional insight into the system and the question which I myself may not have. This is how traditional collaborations worked, and they worked.

    Why this aversion to granting authorship when authorship lists are anyway becoming longer by the day? Authorship credits are likely given at the drop of a hat, so why not to the person(s) who ha(s)ve partaken in an important part of the study? If one really wants to be that strict and fair regarding authorship, are we making it sure that every author in the author list deserves to be credited? Aren’t a lot of people in many papers given authorships for a lot less?

    Furthermore, if we really tread on this path, I believe it will likely devalue the empirical part of the eco-evo research even further. The designing of experiments, troubleshooting and collecting data is a dying art because of the tediousness of generating real data and the clerical connotations around data collection. Data collection is already considered the least important part of research work. Many times the task is given to interns or BS/MS students, indicating that experts like grad students, postdocs and faculty are meant for more superior tasks. This attitude of not giving credit to the data collector seems similar.

    Finally, in tune with the recent burst of paper retractions due to inexplicable/fraudulent datasets, should the collector of data not be held responsible for the data s/he has generated if there were to be questions around it in future? What if the scientist in question in these recent retractions was not a part of those papers? Where then would the buck stop?

    I think for all these reasons, authorship needs to be given to the source.

    • “My logic is, one: would I give authorship to the person in my lab if s/he were to generate that dataset?”

      According to the old poll results linked to in the post, only a minority of ecologists would grant authorship to someone in their lab whose sole contribution was to collect data as directed by someone else. And the authorship policies of many (not all) journals forbid granting authorship to someone whose sole contribution to the paper was to collect data as directed by someone else.

      “Finally, in tune with the recent burst of paper retractions due to inexplicable/fraudulent datasets, should the collector of data not be held responsible for the data s/he has generated if there were to be questions around it in future?”

      They ordinarily *are* held responsible, at least in the most egregious cases: https://dynamicecology.wordpress.com/2020/08/10/what-happens-to-serial-scientific-fraudsters-after-theyre-discovered/. Afraid I don’t quite understand your point here. Are you suggesting that someone who fabricates data should suffer additional penalties if other people use those data (say, in a meta-analysis on which the fraudster is not a co-author)?

  3. I have been involved as a collaborator on several projects that use genuine long-term data sets. The experience leads me to wonder whether what is missing here is a critical look at what “the data” from “the paper” actually are.

      • As an example, consider half a century of intensive mark-recapture data, including a great deal of other measurements, on Antarctic seabirds. A single paper from those researchers is like dipping a toe into a very large lake. What constitutes “the data” that should be made available? I get the impression that some aspects of this issue have a subconscious image of data as consisting of a few measurements of a single thing taken over the course of a typical NSF grant to evaluate a single hypothesis. (Kind of like the view that underlies the IMDR outline of papers). I suspect that if the data are huge, multidimensional, addressing a host of issues and hypotheses, the questions become more complicated.

  4. #1 and #2 are both good answers.
    But I think you are ignoring the possibility that “Maybe/not sure/it depends” include lots of cases where people are very happy to have their data used.
    For example I suggest most are happy if their data goes into some larger synthesis (where their whole dataset is one of dozens being used together to try to understand a more general/global pattern). Support of such analyses is one reason to share your data in an easily usable format. But you might be less comfortable with a study being published solely about your data in which you are not an author and thus not contributing to the interpretation. For example they may publish incorrect conclusions about your system if they don’t really understand the ecology and make biologically unrealistic assumptions in their analysis.

    • “But you might be less comfortable with a study being published solely about your data in which you are not an author”

      Does that ever happen? How often? Honest questions. I feel like that’s extremely rare.

      • It is rare because many people think the person responsible for the data should be a coauthor in these cases. Thus the person responsible for the data usually IS a coauthor. It is not rare that someone writes a paper using previously collected data (and including the data collector as a coauthor).

      • Personal anecdote: I had a group publish a paper in which my data formed 25% of the paper (one of four species analyzed). They 1) didn’t contact me, and 2) published it in a very high impact journal (far higher impact than anything I’ve published in).
        I thought it was great! Frankly, they were testing a hypothesis I would never have tested, so conceptually the paper was theirs, not mine. Sure, I collected that data set, but I also shared it so it could be used by others.

        As Hal was getting at above, I also faced a question of which data to share (I agree with Hal that it’s not always obvious). My raw data was sound recordings, but the foundation of the analysis was “annotations”, i.e. tags demarcating what happened, when, in each recording. In the analysis, the sound recording itself isn’t needed, only the annotations… So I could easily have just shared those as “the data”. Yet, I shared the raw sound recordings as well as the annotations. This was riskier in the sense that it opened up the possibility of being scooped on future papers, but it’s also much more useful to others.

        Overall, I’d say I experienced a net benefit for three reasons: 1) my published work was cited and heavily relied upon in a high impact paper, which helps it get into the “canon”; 2) I learned things about my own study system that I might not have learned otherwise (and I can talk about their results in future talks, etc.) ; 3) I feel good about sharing data, because it’s in the spirit of science.

        I’m not 100% behind the idea that “everyone should be forced to share their data all the time”, but I think this experience made me focus less on the negatives of sharing and more on the positives. I think people treat the act of sharing as a sacrificial act, when sharing useful datasets can actually pay dividends beyond authorship.

  5. Honest question: I want to hear from people who’ve been scooped by putting their data in a public repository.

    By “scooped” I don’t mean “somebody used my data in a way I think is mistaken; they wouldn’t have made mistakes if they’d contacted me first to ask me about the dataset”. And I don’t mean “somebody used my data in a meta-analysis without inviting me on board as a co-author.” I mean “somebody wrote a paper that I possibly/probably/definitely would’ve written myself, based solely or primarily on my data.”

    I ask in part because both my own anecdotal experience, and data from other fields, indicates that getting scooped is both very rare, and doesn’t have nearly as big an effect on one’s career as most people think. See https://dynamicecology.wordpress.com/2011/12/07/on-getting-scooped-in-ecology/ and https://dynamicecology.wordpress.com/2019/11/01/friday-links-eagles-vs-sms-roaming-charges-should-you-worry-about-getting-scooped-and-more/

  6. Missing from this discussion so far seems to be 1) attribution through citation and 2) the extent to which the data are a public rather than a private good.

    Attribution: Any published data should, by definition, be citable with a DOI, and doing so gives credit to the authors that collected the data. I think it is generally good form to consult with the data generators who will often know more about the data than the user does. In many cases, the data collector can add value to the data that would rise to an invitation for authorship. But I don’t see how it would make sense to obligate coauthorship for using published data that can be cited.

    Public good. In most cases, the tax payers fund data collection. For decades, I have routinely use public data (weather, topography, sat imagery), without inviting the data providers to be authors. For instance, long-term temperature records from NOAA are considered a public good, and this means that I don’t need to put out my own weather stations. Because weather data are collected by an agency, users don’t think about offering NOAA meteorologists co-authorship. But is a dataset on pollinators generated by a faculty member using NSF funds really that different from NOAA data? Both data sets were obtained with public funds, and are not the private property of the individual that collected them. Certainly, the public that supports our work actually believes that they own the data they fund, not us. We accept their funds in exchange for control.

    So, if you accept public money to collect data, publish it and expect to be cited in return. If you use someone’s data, cite them and consider inviting their collaboration. For those that don’t want to share the data they collect, then perhaps a career in private industry makes more sense.

    • Personally, I agree with you re: attribution.

      I have mixed feelings about the idea that, if the data were collected using public funding, they should be published in a public repository, with no constraints imposed on their use by others. Your example of NOAA long-term temperature records is a good one. On the other hand, it was long the case that there was no general expectation–not even on the part of NSF–that anyone would share raw data collected using (say) NSF funding. Your obligation to share with the public the results of NSF-funded work was (and still is, I think?) satisfied by providing a mandatory report to NSF summarizing the outcome of your research. So I don’t think that the fact that data collection was publicly funded necessarily implies that the raw data ought to be made public. Any more than it implies that papers based on publicly-funded data collection shouldn’t be published in paywalled journals. I think there are other, stronger arguments for encouraging or mandating data sharing than “the public paid for the data collection”.

      Relevant recent-ish post: https://dynamicecology.wordpress.com/2020/10/29/scientific-fraud-vs-financial-fraud-is-there-a-scientific-equivalent-of-a-market-crime/

      • I would mention two things. One is that the “public good” argument becomes more complicated when the data and the investigators are international. The other is that there is a movement in Europe to require that papers based on publicly-funded research indeed can’t be published in paywalled journals.

      • NOAA temperature records reminded me of about 5 years ago, resistance by scientists to releasing communications regarding the pre-processing of climate data. In this case it seemed to be mostly a worry over the comments or decisions taken out of context for political reasons (a corollary of the “I dont release my data so people don’t misuse it” rationale above). Now I wonder if I am being logically consistent between my staunch advocacy of open science, and sympathy toward the position of the scientists at NOAA.


        It also brings up a question, what is “raw” data? In genetics, “data” can be thought a continuum from assembled genomes/contigs, or raw reads, or individual chromatogram/sequencing files, or the raw electrical signal output. I suppose in many ecological studies, linear measurements or categorical data are pretty clear cut.

    • Regarding the public good and that data collection is usually funded by tax payers: one issue is whether the parts of data collection that are funded justifies it being fully public. For instance, the funding I received did not include things such as life insurance (but fieldwork is dangerous) and even when I had funding for transport, I used by own car and my own resources to pay for repairs. And I think it’s rather common to use one’s own resources to pay for part of data collection, as there are restrictions to what can or cannot be funded and also because the funding available is not always sufficient.

  7. I recall having two of my large datasets used by others. In the first instance, I was asked if I was okay with them using my data. I had no intention of using them myself and I said yes, and was acknowledged in the publication. In the second case, I only became aware of the use of my data when I was sent the paper to review for the journal to which it had been submitted. My data constituted the majority of the data used in this study. I was cited, but since they had not asked me if I was okay with them using my data nor had I been invited to collaborate, I was not acknowledged. As it turned out their use of my data in conjunction with the smaller amount of their own was logically flawed, as both I and another reviewer (anonymous to me) spotted. They were asked by the editor to not use my data and in a subsequent iteration of the paper my data were not included. Whether I would have agreed to collaborate or not, if they had asked me, I don’t know. The key to me is ethical. Obviously we are moving towards open access to data, but I would hope that, just out of politeness, those who put in the hard work (proposal writing, huge amounts of field work, perhaps followed by hours of grad student/postdoc/etc lab work) ought to be asked at least whether they were planning or even currently undertaking a similar use of the data, or perhaps even had a paper submitted or accepted that did so. And ideally, whether the original person who generated the data was comfortable with the use of their data, and whether they would they care to collaborate. As there was more than one person (six in all) who generated the data in this second case – the first author was a grad student – ideally all authors of a study should be asked if it would be okay for the data to be used (not necessarily all would feel the need to collaborate). So I would fall in the middle camp – it may be ok to use my data (if I am not planning to use them myself) but it would be ethically appropriate to ask before simply going ahead. Full disclosure – I am a university researcher in the ‘senior’ category.

    • I was in the “maybe” category on this because of ambivalence, but I think from your story I have found a convention that works for me: the poster of the data used need not be a co-author, but if not they really should be a reviewer.

  8. It’s probably worth pointing out that your poll went beyond the ecologist community. For example, I’m a civil engineer and heard I about your poll through the Cambridge University Data Champions mailing list! This might be an extra caveat for the interpretation of this straw poll, though the participant range should still be limited to ecologists and others who are interested in data reuse.

    • Yes, we always get a modest number of non-ecologists responding to our polls. They’re always a small enough fraction of respondents that removing them wouldn’t change the results much.

  9. This result exemplifies something much deeper–and yet more mundane–than the intellectual property and use-rights of datasets. It concerns all academic careers and is a given reality since possibly the dawn of public-funded research: the capacity to pursue our research goals and keep our very livelihoods depend on how institutions evaluate our CVs and particularly on how many (and how popular) are our publications. I see this as the main factor behind most rationalisations about why one should be protective of datasets and thus claim/push for co-authorship when data is re-used. From an individual’s perspective, that might make the difference when applying for the next position/grant/etc.

    That said, I believe that our job as researchers is NOT to write and publish papers but, in fact, to do the research. Doing research, whatever its nature, means to search for knowledge, answers (or better questions) on a subject that you are utterly and personally invested in. Part of progressing this goal collectively is to share your experience and findings with as many people and through all channels as possible. Papers should not be an end but only one of the means through which researchers communicate with each other and the public.

    I agree with Kevin Lafferty’s point about public accountability of public-funded research, which is the basis for all open access policies. Moreover, the “no” answer to the co-authorship conundrum is also supported from the strictly literary/editorial perspective: one should not be considered an author if one did not authored the article which cites a published dataset, the same way one would cite Darwin in a textbook on evolution, not add him as co-author.

    Last, I believe it is worth to remember that it is the current scenario we live in that exacerbates competition: multimillionaire scientific publication sector, escalation of academic degrees and certifications required for jobs in academia, the growing mass of PhD and postdoctoral researchers working under precarious contracts, hours poured by the best among us into administrative procedures and self-promotion to get funded, etc. We must remind ourselves and our institutions of our true shared mission as researchers and act upon it, even though it might cost us individually in terms of competitive disadvantage in the job or funding market. Funders (and ourselves, as peer evaluators) will not bother to give credit to data generation, stewardship, or any other scientific activity that is not directed to a paper (such as scientific blogging), if everyone is desperately collecting publications through incidental co-authorship. Some of us need to stop to change it. This is of course easier said than done. As a non-tenured researcher myself, I feel this is the great daily challenge we face as a generation.

  10. Sorry Jeremy, just wanted to let you know that there may have been some non-ecologists who have crashed the party. I forwarded this poll and the results to the Cambridge data champions (https://www.data.cam.ac.uk/intro-data-champions) because I thought it was of interest to us.
    Most of the participants seem to come from other places though according to your entry here, but there may still be a slight bias in there because of this (sorry!)

  11. I would make more of an economics argument. As publications become more valuable to an individual researcher, the less likely they will want to lose credit through data sharing. So instead of a split along seniority lines it would split along percent research appointment or perceived dollar value per publication.

  12. Very interesting discussion in the comments. I would like to add the following point. I am a modeller and in my lab we had a discussion about models and co-authorship. I think there is a good parallel to make here. If someone uses a model from someone else for their own research, no one expects them to add the person as a co-author or even ask them permission. The model is published and accessible to anyone despite the authors having put a lot of thoughts, effort, funds into it. The model can be misused like a dataset could. The authors could do more with it, like test a different hypothesis, analyze it in a different context, … But if someone does it before them, everyone will probably think it is fair game in most of the cases. Moreover, restricting access to a model is almost impossible as it should be described in details in the Materials and Methods in order for the paper to be reproducible. And I think all of this is ok, the authors get credits for their model but not for the work that is done with it that they had nothing to do with. Why should it be different for a dataset?

  13. To expand on a point Carl Boettiger made on the posting of the poll: we are all going to be dead, and quite soon in the big scheme of things. Don’t we want our data to be used after we die? And doesn’t it seem absurd to expect post-mortem coauthorship? Doesn’t it also seem to be absurd to make our expectations of data use etiquette conditional on whether the data collector is still alive or not? Beyond the taxpayer funding issue, I would argue that it is our responsibility to the collective enterprise of science to make our data available to others, within some reasonable time frame, along with sufficient metadata to minimize the risk of data misuse.

    • I have slightly mixed feelings about this argument. On the one hand, yeah, I too feel the same professional responsibility you articulate. On the other hand, nobody wants to use my data right now as best I can tell! And after I die, the odds that anyone will want to use it are only going to decrease further, surely. I can hardly be alone in this. Most people’s data just doesn’t matter much to “the collective enterprise of science”. That’s why even quite high-profile papers that turn out to have been based on bad data can sit in the literature for years before being retracted, without those retractions undermining any other work besides that of the authors of the now-retracted papers (https://dynamicecology.wordpress.com/2020/10/05/how-much-damage-do-retracted-papers-do-to-science-before-theyre-retracted-and-to-who/).

      All of which I guess is a long-winded way of saying that the professional obligation I feel to make my data available in some reasonably usable form, within some reasonable time frame, is an obligation to do something that’s almost certainly of little benefit to either me or science as a whole. But it’s also little cost to me. And I do think science as a whole would be appreciably worse off if *nobody* shared their data.

  14. The people who believe authorship is always warranted in exchange for their data would never put it in a public repository if it wasn’t required by funders or journals, so the reasons you list probably aren’t relevant to them. I think it’s astounding that ecologists funded to do work by the taxpayers, even in some cases government employees, feel that the data they collect in these activities is their personal property. Of course, as Kevin Lafferty says, attribution is important, and there is a grace period for publishing. In my experience, the in-between people often think that their data is going to be misused/misinterpreted somehow. That’s possible, but this sometimes suggests that they are worried that conclusions they’ve drawn from the data will be challenged. That’s healthy. In my opinion data ought to speak for themselves, without needing expert opinion to interpret it, which may be biased toward a specific view.

    • Re: worrying that your data will be misinterpreted or otherwise abused by others: that’s the others’ problem, not yours. At least, that’s my view.

      I speak as someone who developed an analytical method that subsequently was badly misused in an Ecology paper. I’ve also been badly miscited a few times–people citing papers of mine in support of claims that my papers in no way support. I find that annoying–but no more annoying than I find any other mistaken bit of science that doesn’t involve “my” data or methods. I wasn’t asked to review any of the papers in question, so it’s not my responsibility if someone else abused a method of mine, or miscited my work. Mistakes happen in science! The fact that someone used a method I developed, or miscited me, without first seeking my guidance may have been unwise on their part (though it’s not as if “seek guidance from me” is either necessary or sufficient to avoid mistaken applications of my methods). But people sometimes make unwise choices, me very much included. That’s just life! I don’t see why I should expect, demand, or even hope that other people will check with me first before using “my” data or methods, lest they use them unwisely. Or why they should check with me before citing me, lest they miscite me.

      • I totally agree. A different view can be found here:

        I would argue that this is just normal healthy scientific exchange. No one should be obligated to check with data providers to make sure they understand all the subtleties the providers think they should, IMO. To do so is to stifle science and it will become more and more untenable as data accumulates over time.

      • To clarify, I do think it’s my responsibility, when sharing data, to provide sufficient metadata that my data is reasonably interpretable and usable by others. If I don’t do that, I haven’t really shared my data at all. And yes, there’s always some scope for disagreement as to exactly how much or what sort of metadata is “sufficient”. But if I’ve done my bit and provided decent metadata, I don’t see how it’s my problem if others still misinterpret or misuse my data.

  15. I am from a very different field (Neuroscience) and am not a scientist but would say most basic thing that should be done is very clear recognition of where the data came from. If that is the sole basis of the paper then it needs to be there right at the very top. If not then just in the acknowledgements and Bibliography but even then contacting the body/institution providing/hosting the data if not on a centralised repository or the individual listed on a repository to let them know something is coming out using their work. And providing the AAM as well as the finalised version to them for their own REF compliance/promotion/assessment.

  16. I am not an ecologist. I am a health sociologist (now retired). I have advocated for a very long time that all research data should be saved and archived, particularly if it has been paid for by public funding. Then they can be shared, and the conditions of sharing can be negotiated, but one hopes that they will be widely so. There are numerous reasons for this, including public good (if publicly funded), plus the usual science integrity arguments. Also, in the health area there are many met-analyses where access to multiple data sets is of immeasurable benefit to the research community and the beneficiaries of that work (e.g. patients). But the problem is that very few people routinely save, archive and share their research data. So, for me, one of the strongest arguments for allowing some kind of acknowledgement and potentially co-authorship is actually to increase the incentive to share data. Otherwise, why would you do it? I mean, you have sweated blood to get a grant, collect the data, and analyse it, why allow others to access that data without at least some acknowledgement, and possibly co-authorship? There is also the issue that such data sets are not usually straightforwardly understood, so people accessing those data will need your help in working around the inevitable pitfalls in using the data. I also think data producers should gain something like a citation score for the use of their data, so that it adds to acknowledging the creativity, imagination, effort etc of the originator. A data producer should gain something like a citation credit for their work. That’s the least they can expect. Again, otherwise why share if your efforts are not recognised? Increasingly funders are requiring data producers to share their data as a condition of funding (after initial publication of resultant work), and this should be encouraged – but there is no reason why the originators of the data should not receive some kind of acknowledgement, citation credit, and possibly co-authorship (depending on the amount of contribution they make to any publication).

  17. Hi all,

    I’m a long time reader, and very interested in this topic as Editor in Chief of Ecological Applications, a journal with a strong open data policy.

    I have many thoughts, many of them already stated above, but my one overriding comment, as the editor of an applied journal (which nonetheless publsihed some very fundamental science in “Pasteur’s Quadrant” (https://en.wikipedia.org/wiki/Pasteur%27s_quadrant), reserch that is both fundamental and rapidly applicable, is the discussion has for far raised only those issues within ‘the academy”. The Ecological Applications policy is motivated by the real-world significance of much of what ecologists do. When scientific results are used to support management, policy and decisionmaking, having the results be freely and easily available is now essential. It is often essential for all parties to have access to the data, and be able to convince themselves of the result, or not. Sometimes, this is adversarial, sometimes the decisonmaker needs to ask a question of the data the author didn’t think of, or couldn’t fit in. if the data are unavailable except on request, or the author is for some reason no longer available, the citable home for the data is crucial. Doing this as matter of course very greatly increases the credibility of the science and the author, not doing it, the converse. This is not hypothetical, it happens a lot.

    I also have a foot in the climate community–the longterm unavailability of raw data documenting the warming rend in surface observations which reflected not only the compiler’s attitudes but also the attitudes of the met services that contributed data on the grounds the raw data not be shared, contributed significantly to the contentious debate around those data, and reduced the credibility and staus of climate scientists, who were perceived, rightly or wrongly, as being more worried about their academic status activities than the well-being of the world, or that they were hiding things.

    Ecological Applications and the other ESA journals grant many exceptions to open data, for human subjects, endangered species, commercial fish landings, but the argument that “I have three more papers” never washes. And, I agree with the comments above that scooping is incredibly rare. See various editorials over the years, and Powell et al (Powers SM, Hampton SE. Open science, reproducibility, and transparency in ecology. Ecological applications. 2019 Jan;29(1):e01822.).

    Also, many of the most used data now have dozens if not hundreds of authors. We need innovative ways to give credit for providing data! I encourage publications in data papers for cwertain types of data sets, so they are fully citable, but this doesn’t satisfy all situations.

    I don’t disagree with any of the other arguments, data paid for with public funds, the value of synthesis, the need for reproducibility, but to that I’d add, our work matters, it is not just valued within the academy.

    I personally have worked within NASA Earth Science’s open data policy since 1985, including for field and experimental work, not just remote sensing and it has opened many more doors than it has shut.


    • Thank you for your comments David. I agree with you that the convo about this post has been very academia-focused so far. Your comments broaden the convo in interesting and important ways. Very good points.

      • Thanks, Jeremy! I always find your posts interesting, and useful, and this is a big subject! I think we all need to take our work more seriously! It matters, not that credit and reward don’t without that the whole machine would grind to a halt, but what we do, even quite theoretical-seeming work really matters! In my group, I’m the equivalent of a department chair at JPL, we do give explicit credit in promotion for data, not just papers, and you know, it’s not so hard, not harder than counting papers and subjectively assessing their significance, maybe easier, if that were more widespread, it would reduce much of the discomfort.

  18. Hi Jeremy, interesting results! Would you be interested in re-running the survey in another field (neuroscience/neuroinformatics)? I could help with distribution and nagging people to respond.

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.