Your primary collaborator is yourself 6 months from now, and your past self doesn’t answer emails
The quote above*, which is commonly used in Software Carpentry workshops, is a great, succinct way of reminding people to annotate code well, and to give thought to how they organize (and name!) data files and code. It is similar to something that I emphasize to all people in my lab regarding lab notebooks: write things in there in so much detail that it pains you, because, in six months, you will have completely forgotten** all the things that seem totally obvious now. But it’s probably not something I’ve emphasized enough in terms of analyses, and I should fix that. For me, the biggest unanticipated bonus of my shift to R has been how much easier it has made redoing analyses and remaking figures.
As I whine about on twitter periodically, I have slowly shifted from using Systat and SAS as my primarily stats and graphics tools to using R. The main motivation for this shift was because it seems obvious to me that my students should learn R, given that it is powerful and open source. My students working in R meant that I felt like I needed to learn R, too. It’s been a sloooooow process for me to shift to R (it is so frustrating to learn how to do something in R when I know I could have the results after 5 minutes in SAS!), but I finally view R as my default stats program. When I first made the shift, I mainly just wanted to get to the point where I could do the same things in R that I could already do in Systat and SAS. But I now see that a huge advantage of the shift is that my analyses and figures are much more easily reproduced in R.
Prior to shifting to R, my basic approach was to enter data in Excel, import the data into Systat, and then use the command line to do a bunch of manipulations in there. I generally viewed those as one-off manipulations (e.g., calculating log density), and, while I could have saved a command file for those things, I didn’t. This meant that, if I discovered an error in the original Excel file, I needed to redo all that manually. For making figures, I would again use the command line in Systat; in this case, I would save those command files. I would then paste the figure and command file I had used to make it into Powerpoint, and then would get the figure to publication-quality in Powerpoint. (For example, the tick marks on Systat figures never show up as straight once the figure has been exported, so I would manually draw a new tick mark over the old one to make it so that the tick mark appeared straight. That sort of thing was clearly really tedious.) For analyses, if it was really straightforward (e.g., a correlation), I would do it in Systat (again, usually saving the command file). But if the analysis was more complicated, I would go to SAS to do the analysis. That would mean importing the excel file to there, and then doing the analyses there. I would then paste the output of those analyses into a Word file, along with the code (which I also saved separately).
Overall, on a scale from completely hopeless to completely reproducible, my analyses were somewhere in the middle. I at least had the command files (making it more reproducible than if I had used the GUI to do everything), but I would end up with a folder full of a whole ton of different command files, different Systat data files plus the original Excel files, and with some results in a Word file, some in a Powerpoint file, and some just left as output in Systat. And, if I later needed to change something (or if a reviewer asked for a change in an analysis), it took a lot of effort to figure out which was the relevant command and data file, and I would have to go back through and manually redo a whole lot of the work. One paper of mine has a ton of figures in the supplement. I no longer recall exactly what change a reviewer wanted, but that change meant that I had to remake all of them. It took days. And, yes, surely some of this could have been improved if I’d come up with a better workflow for those programs, but it certainly wasn’t something that arose naturally for me.
Now, with R, I can much more easily reproduce my analyses. I think I do a pretty good job of annotating my code so that I can figure out in the future what I was doing (and so that others who are looking at the code can figure out what I was doing). I recently was doing a long series of analyses on field data and, after working on them for a while, realized I had forgotten an important early filtering step.*** With my old system, this would have been immensely frustrating and resulted in me having to redo everything. With my new system, I just went back, added one line of code, and reran everything. It was magical.
But I still haven’t reached full reproducibility yet, and I am surely far from many people’s ideal for reproducibility. (For starters, I haven’t used github or something similar yet.) For the manuscript I’m working on now, I still exported figures to Powerpoint to arrange them into panels. I know that, in theory, I can do this in R, and I could get the basic arrangement worked out in there, but I couldn’t figure out how to get it to arrange them in a way that didn’t include a lot more white space between the panels than I wanted. I imagine that, with enough time, I could have figured that out. But, at the time, it didn’t seem worth the effort. Of course, if something comes up and I need to remake all of them, I might come to a different conclusion about whether it would have been worthwhile! (Update: I decided to add two new panels to the figure, and spent all day Monday working on it. I got to the point where I could get them arranged nicely in R, but never did figure out why two of the y-axis labels weren’t centering properly. So, that last step still happened in powerpoint. Sigh. I was so close!)
That brings me to the topic of my post for tomorrow: how do you learn new analyses and programming skills? I’m looking forward to hearing what other people do! And next week I’ll come back to the topic of making figures.
* This is the version that Christie Bahlai used when she taught at the Women in Science and Engineering Software Carpentry workshop at UMich in early January. The quote has become a standard in SWC workshops. In an email thread that included an amusingly expanding number of SWC instructors, Paul Wilson pointed me to this tweet by Karen Cranston as the original motivation for the quote:
** It is embarrassing to me how often I forget not just details of experiments, but entire experiments. For example, for the manuscript I am working on now, I forgot that we had done an experiment to test for vertical transmission of the parasite. Fortunately, the undergrad who has been working on the project remembered and had it in his writeup!
*** I remove lake-date-host species combinations where we analyzed fewer than 20 individuals for infection. Our goal is to analyze at least 200 individuals of each host species from each lake on each sampling date, but sometimes there are fewer than 200 individuals in the sample. If we have fewer than 20, I exclude that lake-date-host combination because infection prevalence is based on so few animals that it is impossible to have much confidence in the estimate.