As I’ve written about recently, I’ve been trying to think more about how my lab archives and shares data and code. After getting a little bit of experience with GitHub at a Software Carpentry workshop in January, I decided to try using that to archive data and code related to a paper that was just accepted this morning. (Hooray!) I figured I’d share the experience here, especially since the only thing that saved me from giving up on the whole experience was a well-timed tweet.
Prior to submitting the manuscript the first time, I wanted to upload the data and code to GitHub, so that reviewers could access it if they wished. I made sure the code was reasonably well annotated. (Well, I think it is. I have no real training in this, so it’s possible I’m wrong.) I organized the data files and code in a single folder on my computer, and I created a text file with the metadata. (Again, I’m not sure whether my metadata file is up to par, but I think it gives the important information.) This whole process was useful just for making me go through everything one last time, making sure everything looked good. To me, that incentive to go through everything with a fine-toothed comb one more time is one of the biggest benefits of sharing data and code.
Having made sure I had all the files in order, I then went to GitHub and created a repository. So far, so good. Next, I went to RStudio and opened a new project. I then proceeded to bash my head against the wall for a while. Not literally, of course. But I couldn’t get GitHub and RStudio to talk with one another. It was incredibly frustrating. After a bit of head-bashing, I took to twitter to ask for help. People offered lots of suggestions, which I appreciated, but I still wasn’t getting anywhere. Fortunately, Pat Schloss then replied with a link to this assignment from the course he teaches. Once I followed those directions, I got GitHub and RStudio talking. I was then able to stage and commit the files. Success! My experience with GitHub wasn’t as frustrating as Emilio Bruna’s was (thanks to Pat’s well-timed tweet), but I was also left with the feeling that this probably wasn’t going to become mainstream in ecology unless the whole process seemed a lot less intimidating.
One thing I wasn’t sure of was whether the reviewers would be worried about accessing a GitHub repository – I wasn’t sure if they would think I could see who they were if they accessed it. So, I added this to the Acknowledgments:
Data and code related to the field survey and life table studies can be found at https://github.com/duffymeg/BroodParasiteDescription. (For reviewers, note that it is possible to look at and pull files from github anonymously.)
Fast forward about a month. (It was a fast review process!) There were three reviewers on the manuscript. None of them mentioned anything about the data or code, so I have no idea if they looked at it. When I look now, it says the repository has had 33 views from 5 unique visitors in the past two weeks. So, some people looked at it, but that may have been in response to me tweeting about it. That seems especially likely given the big pulse on 5/20:
Now on to editing: One of the reviewer comments prompted me to report summary statistics a little differently. I modified the code to do that, and then pushed the modified code to GitHub. That part went totally smoothly. Hooray!
The last thing I wanted to do was to get a doi for the repository, so that the data would be citable. To do that, I followed this guide, which was really straightforward. So, now my data and code have a doi. I updated the Appendix in the revised version of the manuscript to say:
Data and code related to the field survey and life table studies can be found at http://dx.doi.org/10.5281/zenodo.17804
One thing I haven’t done is add a license to the GitHub repository yet, mainly because I don’t know how to do that or what license to use, and I haven’t invested the time in figuring that out yet. I’m not entirely sure what that means right now in terms of other people’s ability to use the data – I should probably figure that out, too. (Links/info welcome in the comments!)
So, overall, I’d say my experience went reasonably well, once I had the magic tutorial from Pat Schloss on how to get RStudio and GitHub talking. (I may well have given up if I hadn’t received that.) If I hadn’t taken a Software Carpentry workshop, I think I still would have found GitHub too intimidating. And I still haven’t quite done everything I should have done, given that I haven’t added a license. But, overall, I’m optimistic that my lab will use GitHub to share data and code for manuscripts in the future.