Something I’ve heard very little about in ecology (despite the field’s rapid and near complete embrace of R) is the rapidly forking versions of R. For those who aren’t software engineers, forking is when the body of code implementing a program splits into two different, alternate, even competing versions or “branches” (with different people behind each branch).
After years of social cohesion around the “R Core Group” version, R has recently seen a number of forks:
There are a number of differences between these versions. Renjin and FastR are being rewritten in Java (original R is written mostly in C). CXXR is being rewritten in C++. This might not matter to most ecologists, but should lead to some performance and memory advantages. TERR is also a bit of an outlier in that it is being developed commercially and is targeted at big data (bigger than fits in memory) which is a well-known weakness of R (yes, before I get flamed, I know there are a number of open source packages in R that try to deal with this, but it is not built into R from the ground up). Some are clearly at more advanced stages than others (e.g. FastR and Riposte just take you to GitHub=source code pages, while the others have friendlier home page screens with explanations). pqR and CXRR are building on top of core R and therefore have very high odds of working with whatever package you want to use. TERR and Renjin are not innately compatible but have put a lot of effort into building compatibility with common R packages. FastR and Riposte don’t yet seem to have good answers on package compatibility but they are still in early stages. In general, pqR is the most conservative – just tweaking the core R product for speed – and probably the best at compatability. A nice review of these six alternatives (if you’re a programmer) is found at 4D Pie Charts in part 1, part 2 and part 3 (skip to Part 3 if you want the bottom line and not all the computer programming details).
The one thing they all have in common is trying to speed up R. This matches my own experiences (and is why I never use R for my own personal research unless pressured into it by group consensus in, say, a working group). It is just really slow. Not a big deal if you have a field dataset with a few hundred rows. But even the comparatively small datasets I work with like the Breeding Bird Survey and US Forest Inventory (few million rows) really bring R to a very slow crawl (and again yes, I know there are tools, but I have better things to do with my time). Matlab and Python are both noticeably faster on most real world tests (no programming language is fastest on every test). Recently I was implementing a relatively complex MLE (Maximum Likelihood Estimation) routine (on detection probabilities – so complex formula but still analytically formulated), something you think R would be awesome at – and to my surprise the same code in Matlab ran 10-100 times faster than R (subsecond vs 30 seconds).
To me, the most fascinating aspect of this is the social. Why all these forks? Now forks happen in almost every open source project (people disagreeing and taking their toys and going home is human nature). But as a long term watcher of open source, I would say the number and seriousness of the forks is unusual. I don’t think this is a coincidence. As stated in the last paragraph, Core R has not risen to the challenge of performance which is something people crave, and there is only so much you can do to fix this by add-on packages. Or put in different terms, the rate of innovation in the 3rd party packages of R has been exceptionally high (one of the reasons for its uptake) but the rate of innovation in the core (affecting fundamental issues like ease of use, performance, memory management and size of data) has been slow*.
As an example, consider pqR. Although pqR author Radford has taken the high road in what he says, he very clearly offered his first batch of improvements to the core R group and they mostly got rejected with not much more than a “we don’t do things that way here” attitude (http://tolstoy.newcastle.edu.au/R/e11/devel/10/09/0813.html and http://tolstoy.newcastle.edu.au/R/e11/devel/10/09/0823.html – and reading between the lines good old fashioned “not invented here syndrome”). His frustration levels clearly reached a point that he decided to work from the outside rather than the inside for a while. While I have never offered patches to the core, my own experiences with trying to submit a package to CRAN shocked me. The number of people involved was ridiculously small (Brian Ripley replied to one of my emails) , the openness to different ways of thinking was zero (the pedanticism about you have to do X even if X makes no sense in my context from obviously smart people surprised me), and the rudeness levels were extreme, epic and belonged in a television soap opera, not science.
All of this has created a muddle for the poor person who just wants to do some statistics! Now you have to figure out which version to use (and package/library writers have to make their software work with several versions of R).
I have not yet tried any of the alternatives (as noted I mainly use R for teaching and Python and Matlab for personal research – I’m not taking students to the bleeding edge of technology without a reason). But, given that I basically don’t know what I’m talking about🙂 , my recommendations would be:
- If R is fast enough for you – ignore the whole thing and wait for the dust to settle (which it will in 2-3 years, probably either with one clear widely accepted alternative or with changes rolled back into the R core after they decide they’d better start listening to other people)
- If you really need a faster R right now, try pqR and Renjin (and maybe CXXR if you’re gung ho). Both are freely available, seem to offer real speed improvements, seem to have high compatability with packages (although pqR is probably higher) and are moderately mature.
I am really curious to hear from our readers. Have you heard about the alternative implementations of R? Do you care about them or are they off your radar? Has anybody tried one?
*The GUI (graphical user inteface) is another area where internal innovation has been slow, but fortunately 3rd parties can and eventually have stepped up. If you don’t already use R studio, you should be.