Your primary collaborator is yourself 6 months from now, and your past self doesn’t answer emails
The quote above*, which is commonly used in Software Carpentry workshops, is a great, succinct way of reminding people to annotate code well, and to give thought to how they organize (and name!) data files and code. It is similar to something that I emphasize to all people in my lab regarding lab notebooks: write things in there in so much detail that it pains you, because, in six months, you will have completely forgotten** all the things that seem totally obvious now. But it’s probably not something I’ve emphasized enough in terms of analyses, and I should fix that. For me, the biggest unanticipated bonus of my shift to R has been how much easier it has made redoing analyses and remaking figures.
As I whine about on twitter periodically, I have slowly shifted from using Systat and SAS as my primarily stats and graphics tools to using R. The main motivation for this shift was because it seems obvious to me that my students should learn R, given that it is powerful and open source. My students working in R meant that I felt like I needed to learn R, too. It’s been a sloooooow process for me to shift to R (it is so frustrating to learn how to do something in R when I know I could have the results after 5 minutes in SAS!), but I finally view R as my default stats program. When I first made the shift, I mainly just wanted to get to the point where I could do the same things in R that I could already do in Systat and SAS. But I now see that a huge advantage of the shift is that my analyses and figures are much more easily reproduced in R.
Prior to shifting to R, my basic approach was to enter data in Excel, import the data into Systat, and then use the command line to do a bunch of manipulations in there. I generally viewed those as one-off manipulations (e.g., calculating log density), and, while I could have saved a command file for those things, I didn’t. This meant that, if I discovered an error in the original Excel file, I needed to redo all that manually. For making figures, I would again use the command line in Systat; in this case, I would save those command files. I would then paste the figure and command file I had used to make it into Powerpoint, and then would get the figure to publication-quality in Powerpoint. (For example, the tick marks on Systat figures never show up as straight once the figure has been exported, so I would manually draw a new tick mark over the old one to make it so that the tick mark appeared straight. That sort of thing was clearly really tedious.) For analyses, if it was really straightforward (e.g., a correlation), I would do it in Systat (again, usually saving the command file). But if the analysis was more complicated, I would go to SAS to do the analysis. That would mean importing the excel file to there, and then doing the analyses there. I would then paste the output of those analyses into a Word file, along with the code (which I also saved separately).
Overall, on a scale from completely hopeless to completely reproducible, my analyses were somewhere in the middle. I at least had the command files (making it more reproducible than if I had used the GUI to do everything), but I would end up with a folder full of a whole ton of different command files, different Systat data files plus the original Excel files, and with some results in a Word file, some in a Powerpoint file, and some just left as output in Systat. And, if I later needed to change something (or if a reviewer asked for a change in an analysis), it took a lot of effort to figure out which was the relevant command and data file, and I would have to go back through and manually redo a whole lot of the work. One paper of mine has a ton of figures in the supplement. I no longer recall exactly what change a reviewer wanted, but that change meant that I had to remake all of them. It took days. And, yes, surely some of this could have been improved if I’d come up with a better workflow for those programs, but it certainly wasn’t something that arose naturally for me.
Now, with R, I can much more easily reproduce my analyses. I think I do a pretty good job of annotating my code so that I can figure out in the future what I was doing (and so that others who are looking at the code can figure out what I was doing). I recently was doing a long series of analyses on field data and, after working on them for a while, realized I had forgotten an important early filtering step.*** With my old system, this would have been immensely frustrating and resulted in me having to redo everything. With my new system, I just went back, added one line of code, and reran everything. It was magical.
But I still haven’t reached full reproducibility yet, and I am surely far from many people’s ideal for reproducibility. (For starters, I haven’t used github or something similar yet.) For the manuscript I’m working on now, I still exported figures to Powerpoint to arrange them into panels. I know that, in theory, I can do this in R, and I could get the basic arrangement worked out in there, but I couldn’t figure out how to get it to arrange them in a way that didn’t include a lot more white space between the panels than I wanted. I imagine that, with enough time, I could have figured that out. But, at the time, it didn’t seem worth the effort. Of course, if something comes up and I need to remake all of them, I might come to a different conclusion about whether it would have been worthwhile! (Update: I decided to add two new panels to the figure, and spent all day Monday working on it. I got to the point where I could get them arranged nicely in R, but never did figure out why two of the y-axis labels weren’t centering properly. So, that last step still happened in powerpoint. Sigh. I was so close!)
That brings me to the topic of my post for tomorrow: how do you learn new analyses and programming skills? I’m looking forward to hearing what other people do! And next week I’ll come back to the topic of making figures.
* This is the version that Christie Bahlai used when she taught at the Women in Science and Engineering Software Carpentry workshop at UMich in early January. The quote has become a standard in SWC workshops. In an email thread that included an amusingly expanding number of SWC instructors, Paul Wilson pointed me to this tweet by Karen Cranston as the original motivation for the quote:
** It is embarrassing to me how often I forget not just details of experiments, but entire experiments. For example, for the manuscript I am working on now, I forgot that we had done an experiment to test for vertical transmission of the parasite. Fortunately, the undergrad who has been working on the project remembered and had it in his writeup!
*** I remove lake-date-host species combinations where we analyzed fewer than 20 individuals for infection. Our goal is to analyze at least 200 individuals of each host species from each lake on each sampling date, but sometimes there are fewer than 200 individuals in the sample. If we have fewer than 20, I exclude that lake-date-host combination because infection prevalence is based on so few animals that it is impossible to have much confidence in the estimate.
I agree completely about the reasons for moving to R and how hard it is to motivate myself to slog through the R learning when I could run the *%&# thing in SAS in 10 seconds.
But I think I disagree about reproducibility: I find R analyses less reproducible, not more. I think your discussion confounds two things: R vs SYSTAT/SAS, and command-line vs. script execution, You can run R command-line and not have any record of what you did; or you can run SAS and keep your code and it’s 100% reproducible (that was my technique). The difference you are finding is not because you’ve switched to R, but because you’ve switched to using, annotating, and saving code.
If that were all it was, R and SAS would be equally reproducible. But R is in an earlier stage of evolution than SAS or SYSTAT, and that means that code that runs today may not run a year from now after an update or two. SAS was like that once, but it matured! This is not a huge deal, but it can be annoying.
But like you, I will stick with it, because it’s the Way Our Field Is Going To Work (TM)!
I saved scripts in Systat and SAS, but I ended up with lots of different scripts for a single analysis (which I completely agree is partially just because of how I used them). For me, saving everything as a single script has been much easier in R than in Systat and SAS, leading to increased reproducibility.
Do you make figures in SAS? I don’t know anyone who does that. So, at a minimum, I needed two files — the one in SAS for the statistical analyses, and the Systat one where I plotted things (and, in my case, a third file where I usually did the more basic statistical analyses).
So, yes, I’m confounding script analysis with R, but that’s because I find it easier to have everything in one script in R. I wasn’t really expecting that when I shifted.
Good point about figures. I never used SAS for that, and I don’t use R for it either. I really like SigmaPlot (cue chorus of hissing from open-source folks, including my own students!)
Core R commands are pretty stable already.
If you really worry about reproducibility after updates—a problem that now mostly affects R packages, and not R base itself, which is mature enough for quite some time now—you should definitely give a look at the Reproducible R Toolkit provided by Revolution Analytics, notably the checkpoint package:
I haven’t tried it myself (yet), but the idea is that you can install locally the exact version of the packages used for analysis. Quite neat. To do so, the “Managed R archived network” (MRAN) now keep daily snapshots of CRAN packages, which makes sure that any version of a package published on CRAN can be retrieved (the project started in September 17, 2014). Of course, it means relying on Revolution Analytics server (let’s hope the initiative get some support from CRAN).
The `Packrat` package from Rstudio is a good option if you want to make sure your codes work all time. http://rstudio.github.io/packrat/
Oh, of course! I forgot about the packrat package… Gives a very flexible approach to deal with package versions locally. In an ideal world, packrat and checkpoint would merge to use a snapshot of your R install either based on already installed or remote packages (think export your install for others vs. replicating locally another state)…
The problem with re-running R code from a few years ago is not really with R any-longer but with R packages. It’s important to differentiate the two as RCore are notoriously stubborn about adding new things to the base R package because they want a stable base.
I agree that some package authors aren’t as careful about not breaking backwards compatibility (I know, I did it on one of my packages just this weekend and was thankfully talked around by a colleague). Another problem is that some package authors don’t maintain the code and R does change (add features, which CRAN then expects you to rely upon, like importing and namespaces) and packages that don’t adapt get pulled from CRAN. That can affect reproducibility.
As others have mentioned here, there are tools that can help with maintaining local package libraries specifically for analyses/projects which should aid in at least being able to repeat an analysis. Containers are another way that analyses can be packaged to maintain an air of repeatability but perhaps not something a novice R users is going to be doing on a Windows PC just yet for example.
The counter argument to the one you make is that SAS and SYSTAT are closed source; how can you trust that there aren’t bugs in any of the code they wrote that you used?
checkout the R packages ‘checkpoint’ and ‘packrat’ that address your point on ever-changing packages
Great post! Thanks.
I was also worried about that last year for two reasons: not only myself six months time, but also the reader of my papers. Then I decided to invest some time in learning the R package knitr for making my analyses reproducible. I totally recommend it. It took me a lot of time to figure out all those LaTeX commands in the begining, but later on it was worth it.
I second Diogo’s suggestion: knitr is the way to go. Instead of commenting code, which can still be rather obscure several months/year later, the idea is to embed code in a more detailed narrative… And if you can “compile” the knitr file, that means the script works entirely!
* Bonus point #1: You can easily export in PDF for a report, in HTML for a website, etc.
* Bonus point #2: With R Studio, it’s now super easy to use it—no excuse!
If you are just doing an gradiosly-annotated version of the analysis, consider using Markdown instead of LaTeX. Where LaTeX is exceedingly verbose, markdown is succinctness itself. You also don’t need the power that LaTeX provides for this basic sort of report/annotation.
The rmarkdown package now makes it much easier to do some of this stuff too; you write in Markdown, embed R code as you would with knitr (but using the markdown code block syntax, rather than the no-web/Sweave one), and then Pandoc renders the markdown to PDF, HTML, DOCX you name it.
I linked to this in a recent linkfest, but thought I’d throw it out again here because it’s relevant. It’s interesting that your primary reason for wanting reproducible analyses is so that *you* can reproduce them. Not so that others can. This kind of jives with a recent post on Simply Statistics about how reproducibility (i.e. making it so that others can reproduce your analyses) is overrated:
Yes, I’m talking about self-reproducibility here. And, more specifically, ease of self-reproducibility. Having to spend days redoing everything in response to a suggestion from a collaborator or reviewer can be so frustrating!
Don’t worry too much about working with graphics in R. Despite what people like to claim, it is not really a program for graphics. If it was, then graphic designers would be working in command line constantly. But they are not, and the reason is that efficient design of graphical elements is (for most people) done with reference to the visual layout of the product. In this way, your gravitation to PowerPoint is natural and quite logical. PowerPoint actually has a fairly capable, and more importantly, intuitive, vector graphics engine built in.
If we are talking about 2D plots, R is great for arranging the elements in a way that makes sense for the data, but for all further alterations, a vector graphics package such as Adobe Illustrator is much more capable. Changing line weights, selecting colors and fonts, hiding or showing elements, all of these can be done with mouse clicks, and more importantly, with reference to the product.
I recommend working in R to produce the “meat” of the plot, exporting in a vector format (e.g. PDF) and then taking care of design in Illustrator or the open source alternative Inkscape. All further stylistic changes are better accomplished with these types of tools.
I have a post about this in the queue for next week! When talking about this recently, I’ve had several people be like “Oh my god, I’m so glad to hear someone else also exports things to powerpoint and finishes up the figure there.” So, I made a post for next week asking people how they do this. I use ppt for it, but I know lots of people prefer Illustrator (in part because of the ability to maintain the figure in a format other than a bitmap). But it’s good to hear that you advocate this approach. I know I am not alone among those switching to R, in having it feel like a failure not to be able to do everything in R!
You will feel the pain when you find some error in your raw data or your code if you modify figures using Ilustrator. If you really need that kind of stylistic changes, do that after your paper has been accepted.
And if you really need to do it “manually”, for the sake of reproducibility (which includes access to software), try to use Inkscape instead of Illustrator! (and svg instead of PDF, much easier to manipulate)
Still a pain if you have to do a bunch of manipulations repeatedly, but at least even your students could do it since Inkscape is FLOSS.
This is a good thread. Like Meg, I’ve switched to doing analyses in R, so that I have an easy way to go back to what I did before. But I also agree that R is *horrible* at graphics. That said, I disagree that you should do graphics in a non-reproducible way. I am saving a *ton* of time doing revisions on a current modeling paper by having my graphics coded up. The numbers have changed, but all I have to do is change the name of the file in my code and then rerun to reproduce the graphic exactly. Because R is so painful when it comes to graphics (and just about everything else, really), I’m migrating to using Python, which has a couple really elegant graphics libraries. After using it once, I swore I’d never make a graph in R again, it was that easy.
Don’t think I’ve ever before heard the opinion that “R is horrible at graphics”! By graphics do you mean vector illustrations or figures and charts? If the former I understand your anguish, it’s the wrong tool for the job, but for the latter I fail to see how anyone could call matplotlib elegant and R base/lattice/ggplot2 horrible—I’d be very interested to hear your reasoning!
Whilst I can see you point if one were producing figures for a magazine or such like, I can recall only two types of instance where I’ve needed to resort to a vector graphics package to touch-up a plot. The first type is with messy ordination diagrams – vegan can help with decluttering them but it is often impossible for the label algorithm we use to find sufficient free space to stop labels overlapping. The second is when I needed to do more of a layout figure for a paper in a glamour-mag journal. I did as you suggested and got the basic plot ready but I waited until the last possible moment to finalise the figure in Inkscape because it would have been very frustrating to have to go back to the R plot or update data in the final figure because our analyses changed during the review process (which they did).
So, to be the dissenting voice, I would caution against doing what you suggest routinely: unless I’m producing types of figures no-one else is, I doubt you need to go to Inkscape or Illustrator for most publication quality figures required by journals.
I really do think that most journal-figures can be done relatively easily in R if you are prepared to brave the multitude of graphical parameters in
?parand battle with
split.screen(). Lattice, and more recently ggplot2, have also reduced the amount of time I needed to fiddle with multiple panels, but there is more again to learn with those packages.
Great post, Meg! I feel like all the blogs I read are starting to converge!
As for my way of learning new tools, it is very much like you describe and it is how I also first learned R, “do something new each time around.”
An all or nothing approach would have failed for me. So I essentially forced myself to do one more thing with R than I did last time and eventually got to the point where almost all work I now do is in R (perhaps more that it should be?!?!?).
As an aside, are you doing your plots in base graphics or ggplot2? I have found ggplot2 to be a bit more verbose for quick graphics, but it is much easier to tweak plots than in base R.
Again, nice post.
Thanks! 🙂 I’m using ggplot2. When I was first switching to R, Jarrett Byrnes and others recommended that. The recommendation, as I recall it, was that it might take a little longer to get the hang of it, but would be so much more flexible and powerful once I had.
Good advice from Jarrett. I only recently switched, but that is because had many years invested in base, but it is true that the flexibility is fantastic. While you can do similar things with base graphics, it is MUCH more difficult to figure out and the syntax is painful.
Good luck with your continued transition to the daRk side! Any questions, give a holler. I am always excited to talk R.
@Jeff: You might regret that offer! 😉
I do most of my plotting in R using R Commander. If you just need a plot of means and standard errors, that’s the easiest way to do it. And it shows you the commands you would’ve typed if you’d been doing the typing yourself instead of clicking buttons, which is how I learned what little graphing in R that I know. And then if I need to fiddle with line weights or arrange a multipanel plot the way I want it or etc., I pull the figures into PowerPoint.
What is it using to do the graphics? Is it the base graphics program? I didn’t know this was an option! Systat had a feature like this, which was really handy for learning how to do things.
Not sure if it’s using base R or some package to do the graphics; can’t recall if there’s a graphics package among R Commander’s dependencies.
Megan: it depends on what packages are being used under the hood. Some will use base, some lattice, perhaps others ggplot. It really depends what package a GUI plugin is using to do the analysis.
I’m working toward R scripts for each project in which I hit “run” and the entire analysis including publication ready plots are done. I use ggplot2 for this and find its flexibility awesome. I would never go back to a canned plotting package (or systat like software).
I agree that reproducible is most valuable for oneself. I stalk lots of people’s methods and try to reproduce their results frequently. Usually this is a way for me to either learn about their method or their data). So sometimes I’ll simply simulate data that roughly matches their real data to see how it behaves with different analyses. I always code from scratch to see if I can reproduce their results, although I might borrow a very specific algorithm from github. I haven’t really gained anything if I get someone’s code and hit “run” and I get the same results as they did.
Related to this, a link to this article showed up on R-bloggers yesterday: https://districtdatalabs.silvrback.com/intro-to-r-for-microsoft-excel-users
and it’s related to reproducibility. They have the script:
Referencing a variable by column number instead of column label puts one at risk of accidentally processing the wrong columns. What if at some later date, I have written code that re-arranges the column order, or inserts or deletes columns? Then I get different results…it’s not reproducible. Much better is to use the column label and not number to index what you want. I prefer data.table, which doesn’t even allow you to index by column number:
means_table <- diamonds[, list(caret=mean(caret), depthperc=mean(depthperc), table=mean(table), price=mean(price), length=mean(length), width=mean(width), depth=mean(depth), cubic=mean(cubic)]
More code but the columns can be in any order so I think its more reproducible code (and yes, using the column numbers, the means are associated with column labels so one should see right away that the wrong columns were processed but there are many scenarios where this isn't the case).
Ugh, that’s horribly verbose, not to mention inefficient use of repeated calls to
mean(). This is much simpler:
(You could put the vector of names inside the brackets, but I find this is easier to read.)
True! (still the point was avoiding misreferencing columns and not elegance or speed so I wasn’t too focussed on those).
I am learning how to organize a project from raw data to final manuscript. And here (https://github.com/daijiang/workflow_demo) is what I have learned so far. Hopefully it will be helpful for others also. Suggestions and comments are more than welcome.
Great post Meg! I am at the early stages of investigating R and look forward to your future posts!
FWIW: There is a package being developed that can read SAS data into R to make the transition easier for SAS-y people (a C compiler is needed to install it currently): https://github.com/hadley/haven
For what it’s worth, there is a package that will ease the transition to R from SAS called haven (it’s only on github right now, so it requires a C compiler): https://github.com/hadley/haven/
Great post, but it seems that you’re giving R more credit than it deserves. It seems your primary increase in ‘reproducibility’ is that you’re creating and saving scripts for analyses rather than running in the command line. You can definitely do this in SAS. I go from raw data to complicated analyses with a single hit of the run button, and keep well organized, well annotated code in SAS, so that I can always manipulate data in an earlier step and rerun analyses with ease. I’m not saying switching to R was a bad idea, it just seems that you’re confounding two things here! But you’re right that SAS sucks for graphics, so I do analyses in SAS and graphics in R. I am in the process of switching to R for all, but as you say, it is SOOO painful!!!
This came up earlier in the comment thread; see Stephen Heard’s comment and Meg’s reply above.
Interesting the lack of R love in these comments despite it’s popularity in Ecology and developing new analytic tools! I’m going to try to balance this by stating that I love R, thought it was easy to migrate to, and find ggplot2 graphics outstanding because of its flexibility and power. A little googling solves everything. This quote from Li “You will feel the pain when you find some error in your raw data or your code” is spot on but I’ll add that it doesn’t take error. In my projects, it takes a lot of experimenting to get a graph to represent the idea that I want to represent (how to illustrate effect as a function of multiple parameters), in R this is a slight change of code to a few ggplot2 parameters. In fact, I’ll often go back and forth between graphs and analysis (that is the graph will drive the analysis). Other than a quick histogram or scatterplot, I don’t bother with base graphics but instead go right to ggplot2.
I think that Jeff Walker just raised another good reason to migrate to R: the immense amount of information and help forums you can find on the web. The process of migrating to R as your ‘only’ stats program can be painful, especially if you don’t know other programming languages, and I could not thank enough the people that contribute to these forums. Sometimes it takes some time to find the answer, but I’ve learned a lot by browsing through StackOverflow and others.
As for tools to learn new programming skills, well, R is a language, and as such it takes practice, practice and practice. But for beginners, I found some of the courses on data analyses in Coursera extremely helpful. I particularly appreciated several stat courses by Johns Hopkins University (https://www.coursera.org/jhu).
This comment is great, and fits well with my new post from today! I’m going to link to it from there. 🙂
The pipe-command (%>%) from the magrittr-package is also very helpful to make your R code more reproduciable (for your self). Not in all cases, but in a lot.
With this command you can write R commands in a chain, which makes your Code more simple, like dataset %>% lm(response ~ exp. variable, .) %>% summary()
It is also implemented in the dplyr-package, which I a really, really great package for data subsetting and makes it really easy.
I think it is definitely worth looking at and if you google it you will find a lot of blogs/intros about it.
+1 to all of this. Using dplyr can completely transform your R experience for the better.
I just started using dplyr last week! I really like the pipe commanding. I haven’t tried the magrittr package yet, but have heard others say good things about it, too. I should check it out.
Pingback: How do you learn new skills in R? | Dynamic Ecology
Pingback: How do you make figures? | Dynamic Ecology
Pingback: Collecting, storing, analyzing, and publishing lab data | Dynamic Ecology
I’ll just leave this here:
Pingback: Friday links: vaccine prioritization vs. conditional probability, Am Nat vs. PubPeer, and more | Dynamic Ecology