My (and other people’s) best random R tips

Last week, some of us here at Calgary had a “best R tips” meeting. Everyone was asked to bring their best tips for using R. Here’s a compilation. Add yours in the comments. Because the intertubes can never be clogged with too much R content. ๐Ÿ™‚

865vp2w81kr31

(image source)

Hadley Wickham’s free book, R For Data Science.

esquisse package โ€“ plot with drag & drop options. Works with ggplot2.

praise โ€“ the praise function just gives you random praise

The Rcmdr package makes R menu-driven for most common statistical operations and associated graphing. And it has a window that shows you the code you would’ve typed to do what you just did by clicking buttons in drop-down menus. I used to use it more than I do these days, but I still like it for making simple plots of group means and their standard errors (say, to illustrate an ANOVA). It’s the only easy way I know to make a plot of means and standard errors in R. It also makes it easy to make a scatterplot with differently colored/shaped points for different groups, without having to learn ggplot2 or do any complicated coding in base R.

If you’re going to be filling in a matrix/array/etc. by repeatedly running some chunk of code (say, in a for loop), first create an empty matrix/array/etc. (or one filled with obviously-wrong data) outside the for loop. Then fill it in inside the for loop. This is much faster than creating the first row of the matrix/array/etc. inside the for loop and then growing it by one row each time through the loop. Because when you “grow” a matrix/array/etc. inside a for loop in R, R actually creates a whole new matrix/array/etc. from scratch rather than just adding the new row onto the existing matrix/array/etc.

ICEC โ€“ โ€œignore code evaluate commentsโ€. The principle that the code should be interpretable just from the comments. Also: try writing all the comments, and the write the code that does what the comments say the code should do.

If you just want to find all the elements of a vector that match a condition, you don’t need the which() function, you can just put the condition in brackets. For instance, x[x>8] will return all the elements in the vector x that are >8. And if you want to count the number of elements in a vector that match a condition, you can just use something like sum(x>8).

If you’re in a hurry to quickly pull some data from an Excel spreadsheet into R, say for purposes of a quick exploratory analysis, just copy it and then read it in from the clipboard using x<-read.table(file=”clipboard”,sep=”\t”,header=T). Don’t @ me, reproducibility zealots. ๐Ÿ™‚

Write a function for anything youโ€™re going to do at least 3 times. Especially for ggplot2, if youโ€™re making the same kind of plot over and over.

If you have a bunch of functions you like, put them all in one R script, then import it at the start of every session: source(path.to.your.R.script)

%ni% is the negation of %in% (aside: My jaw hit the floor when I learned this. How many other R commands or functions are negated by spelling them backwards?)

styler package: makes your ugly R code look pretty. Itโ€™s an add-in in RStudio. Splits long lines of code for you. Makes other aesthetic changes.

To comment out a block of code in R Studio: highlight it, ctrl-shift-c

If you download devtools, you can then run a function from a package without loading the package. For instance, lme4::lmer() means โ€œrun the lmer function from lme4 without loading lme4โ€. This keeps you from having to load a whole package, which will mask functions you donโ€™t want masked.

list.files() will list all the files in your directory. Can use it to read in all the files in your directory, or all the unique files using the unique() function.

There’s an R package to make your graphs look like xkcd. Yes, really. Yes, this stretches the definition of “best” tips. ๐Ÿ™‚

R Easter eggs:

  • Run ????””
  • Run example(readLine)

Looking forward to hearing your best tips. Also looking forward to the first comment saying that one of the tips in the post is the Wrong Way To Do It. ๐Ÿ™‚

41 thoughts on “My (and other people’s) best random R tips

  1. Hi! A caveat about not needing to use “which()”: x[x>8] will return all the elements in the vector x that are >8 *or NA*, whereas x[which(x>8)] will exclude those NAs. I’ve found this to be an important distinction at times!

  2. I think my one frustration after switching to Linux is that read.table(“clipboard”) doesn’t work. I think there is a way but it seems so much more complicated. (I’d love to hear suggestions, if anyone has them!)

    And I think I’m going to learn ggplot just to use the xkcd package. I so very much need it.

    One advice I always give is to not use attach. Just use the data argument, or $ ou good old indexing with brackets, as attach is more likely than not to lead to some confusion in the future.

    Another tip would be to find a text editor you like, learn to use it and write your code in it. I use Sublime Text (and I just’t don’t feel comfortable with R Studio, sorry!) ๐Ÿ™‚

  3. Coincidence! Did some plotting of group means with PG students today.
    One answer in base R is tapply() & arrows():

    mu <- tapply(mydata$response.variable, mydata$grouping.variable, mean)
    SE <- tapply(mydata$response.variable, mydata$grouping.variable, sd) / sqrt(tapply(mydata$response.variable, mydata$grouping.variable, length))
    CI = 1.96*SE

    hh <- barplot(mu, ylim = (0.9*(mu-CI), 1.1*(mu+CI)), ylab = "Change in leaf area, cm^2")
    # tapply keeps grouping names so no need to mess around with names.arg
    # might need to fiddle with ylim if working with neg values.
    arrows(hh, mu-CI, hh, mu+CI, length = 0.05, angle = 90, code = 3)

    I know there's a big discussion about ggplot being better in the long-run than base R, but I honestly can't see how having to learn the whole new gg-syntax is easier than the above for new users.

    • This feels like confessing a crime, but I struggle to wrap my head around tapply(). So if I ever want to make a plot of means and SEs without Rcmdr I’m probably just going to copy and paste your code! ๐Ÿ™‚

      As long as I’m confessing things, I’ll put my hand up as someone who’s never learned to use ggplot. Though of course I can google for chunks of ggplot code to modify/steal just like anyone else.

    • The gg-syntax is simply more efficient because it can execute those tapply lines in simple statements with the stat_summary() function.

      You could do the same in ggplot with:

      ggplot(mydata, aes(x = grouping.variable, y = response.variable)) +
      stat_summary(fun.y=mean, geom=”bar”) +
      stat_summary(fun.data=mean_cl_normal, geom=”errorbar”, width = 0.3)

      New R users should obviously learn how to summarize data sets and calculate summary statistics by hand. But ggplot provides some nice shortcuts for visualizing statistical summaries of data, among other nice features that would be more tedious in base plot (e.g., faceting).

  4. I’d mention that %ni% is not a function in base R. However, it’s dead-simple to create it as a new function:

    `%ni%` = function (x, table) match(x, table, nomatch = 0L) == 0L
    1:10 %ni% 5:15
    #this will give this as a response:
    [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE

    That’s just the inverse of what `%in%` is doing as a function. It’s also handy to know how to write your own binary operators, and to understand that any of the binary operators (like %in%, or +, or *) are really just functions which you can pass arguments to in a special way. So

    `*`(10,2)

    is the same as:

    10*2

  5. Using RMarkdown in R Studio is the best. It gives the ability to combine formatted text, code, and results into a very readable HTML file. You can also add latex formatted equations and even put the code in drop-down windows to alternately hide/show it. Great for communicating with collaborators who do not use R. Great for lecture notes. I even have students use it to submit assignments. I can read their code and comments easily in HTML and only open the R scripts when something seems wrong and its not immediately clear what the problem is.

  6. I would add 3 tips to the list:
    First, organize each scientific project (article, thesis chapter) into a R Project. Opening R by the R project, you won’t have to set the directory (R will already know the directory automatically) and
    won’t need to use attach function. Also (without wanting to look like a reproducibility zealot), if you send a zipped folder with the R project to someone else, they will be able to run all scripts on any computer without changing anything at all.
    Second, use the power of keybord tab to simplify and avoid mistakes while programming in R. When using, e.g., read.csv (file = “”) command, use the keyboard tab inside the quotes. R will give you a list of all the folders and files in the directory, and then you just need click on what you want to read.
    Third, use pipe code for nested or long data manipulations. It makes code much more unterstandable, also for who is coding.

  7. My tips, which are more about using R over your lifetime than a quick tip to get something done now. The concern is, starting out with limiting resources just makes it harder to learn more flexible and powerful tools later. I coached high school xc skiing and most of my time was spent de-training their cerebellum from the horrid techniques they picked up in middle school.
    1) skip Rcmdr – I know it’s comfortable but it is very limiting with real data sets and impedes progress learning R. For this reason, I don’t teach beginners Rcmdr
    2) skip base graphics. Yes people do amazing things with base graphics. But ggplot2 and all of the packages that build on it are an amazing resource. Any time spent learning base graphics impedes progress with ggplot2, which you’ll probably end up using anyway. ggpubr (which uses ggplot2) is probably all that many researchers need. Using ggpbur, It is very easy to create plots of means and standard errors, for example https://www.middleprofessor.com/files/applied-biostatistics_bookdown/_book/plotting-models.html#unpooled-se-bars-and-confidence-intervals
    3) learn data.table. Once you are comfortable with it, you’ll understand why I suggest it.

    • Completely agree with #1 Jeff. We don’t tell our undergrad biostats students that Rcmdr exists. As I say, I mostly use it for producing a few specific kinds of plots that it’s good at making and that I often want to make.

      Re: 2: I’m old enough that it may be too late for me to use that tip. But I hadn’t heard of ggpubr and will try to look into it.

    • Two more thoughts about Rcmdr
      1. reproducibility — archiving the sequence of actions to get from raw data to output is just good scientific practice…I cannot tell you how many years I’ve lost to scratching my head over obscurely titled, semi-processed data files or results files that I had not clue how the numbers in the files got there. R markdown files achieve this archiving beautifully. Even Excel is okay at it. This practice should start at the very beginning of biology education. Maybe Rcmdr does this, If so, kudos for implementing it. If not, its a problem with teaching (and using) Rcmdr.
      2. Scripting is a part of modern science. I recognize that its out of the wheelhouse of many biologists my (and your) age, but that doesn’t mean that we should’t emphasize it from the very beginning. Using ez crutches at the beginning just reinforces false comforts.

      • Rcmdr does have a window showing you the code you would’ve typed to get R to do what you made it do by choosing menu options in Rcmdr. So I do think you could incorporate Rcmdr into a reproducible workflow. But I’m a bad person to ask about this because the Data Dryad archives associated with my papers don’t always include a completely reproducible workflow. I do try to teach my students to be better than me, and I am gradually getting better about it myself. But I am far from a shining inspiration to others on this!

    • Just a little note on the (otherwise amazing) ggplot2: It’s not that good for reproducibility and long-term stability of code. This holds for a lot of packages from the “tidyverse” – they are under constant development and so a year-old code might not work anymore, because some of the key functions has changed a bit.

      ggplot2 and tidyverse are nice for the actual analyses and for data explorations, and for publication-ready graphics, but you might not want to use them in R packages or repositories of code which are intended to last, or to be a solid base for something else. Base graphics has the advantage that it is rock solid. It may seem like a specialized thing, but it is a dilemma that I have with almost every paper for which I am making my code open.

  8. I suppose a general tip when it comes to teaching R (at least, at university) is to consider your audience very carefully.

    If they’re all PhD students who will likely spend many hours on relatively complex analysis & plotting issues over many years, you might consider starting to teach them following one particular direction.

    If they’re all undergrads who will use it less frequently for less complex problems, another starting direction might be more suitable.

    Many of the arguments about what is the ‘right’ way to teach/learn R seem to be (implicitly) based on how much time users have to spend on learning/using it.

    I don’t expect most of my group of mixed Masters and PhD students (with 0 to intermediate previous experience) to use R in the same way I do, so I prefer to keep things as simple as possible for them (while still allowing them to develop the required skills/comprehension). This basic foundation can then easily be built upon later if required.

  9. I’m alone in this, but I use (1) base R, (2) Notepad++, and (3) NppToR. Advantages: *all* the code tabs and easy parallelization (NppToR executes highlighted text in R console you most last clicked on).

    I also enjoyed writing my own package of core functions (very easy nowadays). Partly for speed and coding efficiency, but mostly for a satisfying distraction. ๐Ÿ™‚
    (probably more useful for theory, given that data-focused packages are excellent)

    • *NppToR executes highlighted text in R console you last clicked on, so you can run code in different workspaces simultaneously

    • I used NppToR until about 5 years ago, when I switched to a Linux environment and RStudio added some really useful features. The more recent versions of RStudio lets you open a large number of command terminals, which you can send code to directly with Ctrl + Shift + Enter; I generally use this for long-running background tasks I don’t want to use in my main R session.

  10. A quick note: accessing a function from an unloaded package with `::` is built into base R and doesn’t require devtools.

    For me, realizing that you can use lists as data frame columns was revolutionary. This isn’t particularly easy to do in base R, but the tidyverse’s tibble package makes it a lot easier. You can use it to nest your data, so that one column is a list of subset data frames:
    library(tidyverse)

    iris %>% group_by(Species) %>% nest()

    # A tibble: 3 x 2
    # Groups: Species [3]
    Species data
    <list>
    1 setosa [50 ร— 4]
    2 versicolor [50 ร— 4]
    3 virginica [50 ร— 4]

    This is particularly powerful with the pmap() function (also in the tidyverse), which lets you apply a function to each row of a data frame.

    save_species_plot % group_by(Species) %>% nest() %>% pmap(save_species_plot)
    # Creates and saves one scatterplot per species

    • Another thing I found helpful when I was first getting deeper into R is do.call(). Instead of calling a function the normal way x (e.g., rnorm(n = 10, mean = 2, sd = 3)), you can use it to call a list of arguments: do.call(rnorm, list(n = 10, mean = 2, sd = 3))

      One other note: you can combine or add to lists with c(); e.g.,

      base_list = list(a = 1, b = 2, c = 3)
      c(base_list, d = 4, e = 5)

      # Output:
      $a
      [1] 1
      $b
      [1] 2
      $c
      [1] 3
      $d
      [1] 4
      $e
      [1] 5

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.