My first experience with GitHub for sharing data and code

As I’ve written about recently, I’ve been trying to think more about how my lab archives and shares data and code. After getting a little bit of experience with GitHub at a Software Carpentry workshop in January, I decided to try using that to archive data and code related to a paper that was just accepted this morning. (Hooray!) I figured I’d share the experience here, especially since the only thing that saved me from giving up on the whole experience was a well-timed tweet.

Prior to submitting the manuscript the first time, I wanted to upload the data and code to GitHub, so that reviewers could access it if they wished. I made sure the code was reasonably well annotated. (Well, I think it is. I have no real training in this, so it’s possible I’m wrong.) I organized the data files and code in a single folder on my computer, and I created a text file with the metadata. (Again, I’m not sure whether my metadata file is up to par, but I think it gives the important information.) This whole process was useful just for making me go through everything one last time, making sure everything looked good. To me, that incentive to go through everything with a fine-toothed comb one more time is one of the biggest benefits of sharing data and code.

Having made sure I had all the files in order, I then went to GitHub and created a repository. So far, so good. Next, I went to RStudio and opened a new project. I then proceeded to bash my head against the wall for a while. Not literally, of course. But I couldn’t get GitHub and RStudio to talk with one another. It was incredibly frustrating. After a bit of head-bashing, I took to twitter to ask for help. People offered lots of suggestions, which I appreciated, but I still wasn’t getting anywhere. Fortunately, Pat Schloss then replied with a link to this assignment from the course he teaches. Once I followed those directions, I got GitHub and RStudio talking. I was then able to stage and commit the files. Success! My experience with GitHub wasn’t as frustrating as Emilio Bruna’s was (thanks to Pat’s well-timed tweet), but I was also left with the feeling that this probably wasn’t going to become mainstream in ecology unless the whole process seemed a lot less intimidating.

One thing I wasn’t sure of was whether the reviewers would be worried about accessing a GitHub repository – I wasn’t sure if they would think I could see who they were if they accessed it. So, I added this to the Acknowledgments:

Data and code related to the field survey and life table studies can be found at https://github.com/duffymeg/BroodParasiteDescription. (For reviewers, note that it is possible to look at and pull files from github anonymously.)

Fast forward about a month. (It was a fast review process!) There were three reviewers on the manuscript. None of them mentioned anything about the data or code, so I have no idea if they looked at it. When I look now, it says the repository has had 33 views from 5 unique visitors in the past two weeks. So, some people looked at it, but that may have been in response to me tweeting about it. That seems especially likely given the big pulse on 5/20:

Capture

Now on to editing: One of the reviewer comments prompted me to report summary statistics a little differently. I modified the code to do that, and then pushed the modified code to GitHub. That part went totally smoothly. Hooray!

The last thing I wanted to do was to get a doi for the repository, so that the data would be citable. To do that, I followed this guide, which was really straightforward. So, now my data and code have a doi. I updated the Appendix in the revised version of the manuscript to say:

Data and code related to the field survey and life table studies can be found at http://dx.doi.org/10.5281/zenodo.17804

One thing I haven’t done is add a license to the GitHub repository yet, mainly because I don’t know how to do that or what license to use, and I haven’t invested the time in figuring that out yet. I’m not entirely sure what that means right now in terms of other people’s ability to use the data – I should probably figure that out, too. (Links/info welcome in the comments!)

So, overall, I’d say my experience went reasonably well, once I had the magic tutorial from Pat Schloss on how to get RStudio and GitHub talking. (I may well have given up if I hadn’t received that.) If I hadn’t taken a Software Carpentry workshop, I think I still would have found GitHub too intimidating. And I still haven’t quite done everything I should have done, given that I haven’t added a license. But, overall, I’m optimistic that my lab will use GitHub to share data and code for manuscripts in the future.

12 thoughts on “My first experience with GitHub for sharing data and code

  1. Congrats on the acceptance (your week has gone better than mine) and congrats on pushing the code and data to GitHub! I use it almost daily (for ~ 1 year) and am not sure where I’d be without it now.

    Couple of suggestions.

    1.) Learn some of basic git for the command line. I still use the RStudio Git integration, but I also use the command line just as much. Learning some of the basics (pull, push, add, commit) can help a lot when RStudio isn’t working the way you want. A good place to start is https://try.github.io/levels/1/challenges/1 (HT to software carpentry mailing list)

    2.) Licenses are tricky and I won’t pretend to understand them all. What we use (I’m a fed and we can’t actually “license” our work) is Creative Commons Zero (CC0). It essentially says you can do what you want with the code, data, ideas, etc. without any restrictions. Most creative commons licenses are not appropriate for code, but CC0 is. Another option with almost no restrictions is MIT License. Both of these are easy to add to a repo. You can choose it when you create it or add it later. Look at https://help.github.com/articles/open-source-licensing/#how-can-i-go-back-through-my-public-repositories-and-give-them-licenses for details. Also another good place to read up on this is Karl Broman’s page on licensing R Packages at http://kbroman.org/pkg_primer/pages/licenses.html. It has info relevant beyond packages. And last word on licenses, if you have questions, I am sure some of your institutions legal staff would be willing to take a look. I had to do this before assigning licenses to my work. It was a useful learning process for me.

    And the offer still stands… Questions about R or GitHub, I am willing to help (I missed your tweet).

  2. Timothee Poisot just led a small workshop at CSEE on sharing code. He suggested http://choosealicense.com/ as a helpful site and said the MIT license was one of the more commonly used ones for scientific code. I hope that is helpful.

    I completely agree about GitHub. It is a great tool but it was a frustrating experience when I learned how to use it. That Pat Schloss link seems like a really good tutorial though.

  3. Maybe it’s worth adding a comment about the fact that a commercial website like GitHub isn’t a permanent repository for your repositories. We all know that eventually GitHub will go out of business and their repositories could disappear, making the link in the published paper useless. Hopefully your institution or library provides some kind of truly permanent online archive for data and code as they appear in the final version of the paper that can be referenced in the published version. Github is a convenient place for the working copies, and serves as a secondary backup, but it shouldn’t be treated as a permanent archive for the finalized published data and code in future decades.

    • GitHub isn’t the permanent repository. Zenodo is; it’s the issuer of the DOI and they archive a snapshot of the GitHub code.

  4. Interesting about GitHub and RStudio. I don’t use RStudio, but you should know that you can use GitHub directly to archive your R scripts (like I do), without needing it to talk to RStudio. (And without needing to use command line git, which has its own learning curve. GitHub provides a nice GUI interface for both Windows and Mac.)

    Conversely, I glad to see that you had no trouble integrating with Zenodo. I just tried to use Zenodo for the first time to archive a GitHub repository a few weeks ago, and ran into all sorts of problems. I eventually gave up, because you can use Zenodo OR FigShare to create DOIs for GitHub repositories. I used FigShare and it was straightforward.
    https://mozillascience.github.io/code-research-object/

    (However, I later found 7 bugs in one attempt to use FigShare to archive another repository, so it’s not necessarily any better than Zenodo. Both organizations have been receptive to my feedback. But I think, on balance, I’ll give Zenodo another shot because it’s not-for-profit, unlike FigShare.)

    As for licenses, you’re supposed to add a license *before* you publish (i.e. get a DOI). Think about if you published a book and then later decided you wanted a particular type of copyright. At the very least, a license holds you not liable if anyone else uses your data or code; you should pick something rather than nothing. As mentioned above, http://choosealicense.com/ is useful for choosing which one. And also, as mentioned above, I’ve seen it encouraged that CC0 be used for *data*, but I wouldn’t recommend it for code. (And I don’t even like using it for data.) Code is copyrightable; actually there are some fascinating legal cases about the status of computer code. Worth a blog post in its own right. If you want to require that someone give you credit for using your data or code (whether modified or not), but don’t care much about anything else, you should probably just use the MIT license.

    You might also think about archiving data and code separately. I realize that in this case the code is tightly coupled to the data, so it makes some sense to archive them together. But I see them as different sorts of entities. I wonder what other people think about this.

    Another comment, looking through your repository: your metadata.txt file uses rich text instead of plain text, which might cause all those quotation marks to become weird characters in some cases. See, for example,
    https://raw.githubusercontent.com/duffymeg/BroodParasiteDescription/master/Metadata.txt
    The content of your metadata.txt file is good. You describe each data file and each column in each data file. I haven’t read your paper (obviously), but I have a decent idea of what the data is that you collected.

    And your R code I would describe as “reasonably” commented. (And I do have training!) You could comment more to make it easier for everyone to understand each step, but it’s clear enough what each section of code does. In particular, I imagine that it’s quite clear what your code does once you’ve read the manuscript! I personally would divide that R file up into smaller files, each with its own analysis. But that’s due to my coding style trained in the software engineering world where you develop huge amounts of software and putting it all in one file would be crazy. I suspect that many ecologists just dump everything in one R file.

    Also, I think you’ve surmounted the learning curve, so I think it will be pretty easy for you to use a similar process in the future without much headache. Congrats!

    • Great post Meg! I found it very helpful. Git is something I am trying to tackle now – I would love to have it part of my workflow, both for the version control and then later for sharing code.

      Margaret- I would love to hear more about your workflow. I found RStudio to be a nice GUI especially when I was first learning R, but currently I perfer not to use it so I am really interested your process. What text editor do you use? Where do you recommend archiving data?

  5. First of all, great post and I’m glad to see you taking the leap and embrace these nontrivial new technologies.

    A suggestion I have is to put your code and data on a git repository from the start of the project. This will give all the benefits of version control in case you ever need to back to some older version.

    This doesn’t mean that your code and data need to be public from the start. You can create private repositories on Github with a paid account. Alternatively, you can create an organization account for your lab and request a free plan at https://education.github.com . You’ll get 5 private repositories for free. Once the paper is submitted you can open source the repository and have it accessible to reviewers. This is the approach we’ve taken at my lab http://www.pinga-lab.org Most papers have a git repository with code, data, and the latex source files.

    There are many groups experimenting with these things and there is no real standard yet. Looking forward to reading more of your experiences on this!

  6. Cool Post! One quick addendum to the “frustration-rant-post” of mine you linked too. That wasn’t actually my first experience with GitHub – I’d already been using it for a while to store and version-control my code and frozen versions with Zenodo for papers (though not much beyond that). I’d been doing this mainly within RStudio, and for some reason one one particular day I couldn’t push anymore, and couldn’t figure out why even with help. THAT’S what frustrated me over the edge. As soon as the advice became “got to the command line and type “complicated git statements” I think many, many users will drop out. But some will go and learn how to use Sourcetree, which is on my to-do list for the summer.

    Don’t forget to add to the list of Git Resources for n00bs! http://brunalab.org/blog/2014/08/07/resources-for-github-n00bs/

  7. To those who are just beginning… I learned the basics of git and GitHub from the following youtube intros. I found that learning the background terminology and processes by using the command line helped to understand how it all works. In an afternoon you should be able to handle your own repo, while committing a few days got me to the point I could collaborate with others on the same code and resolve conflicts. Hope these help!







  8. Pingback: resources for @GitHub n00bs (small list, please add)

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.