Something I’ve heard very little about in ecology (despite the field’s rapid and near complete embrace of R) is the rapidly forking versions of R. For those who aren’t software engineers, forking is when the body of code implementing a program splits into two different, alternate, even competing versions or “branches” (with different people behind each branch).
After years of social cohesion around the “R Core Group” version, R has recently seen a number of forks:
There are a number of differences between these versions. Renjin and FastR are being rewritten in Java (original R is written mostly in C). CXXR is being rewritten in C++. This might not matter to most ecologists, but should lead to some performance and memory advantages. TERR is also a bit of an outlier in that it is being developed commercially and is targeted at big data (bigger than fits in memory) which is a well-known weakness of R (yes, before I get flamed, I know there are a number of open source packages in R that try to deal with this, but it is not built into R from the ground up). Some are clearly at more advanced stages than others (e.g. FastR and Riposte just take you to GitHub=source code pages, while the others have friendlier home page screens with explanations). pqR and CXRR are building on top of core R and therefore have very high odds of working with whatever package you want to use. TERR and Renjin are not innately compatible but have put a lot of effort into building compatibility with common R packages. FastR and Riposte don’t yet seem to have good answers on package compatibility but they are still in early stages. In general, pqR is the most conservative – just tweaking the core R product for speed – and probably the best at compatability. A nice review of these six alternatives (if you’re a programmer) is found at 4D Pie Charts in part 1, part 2 and part 3 (skip to Part 3 if you want the bottom line and not all the computer programming details).
The one thing they all have in common is trying to speed up R. This matches my own experiences (and is why I never use R for my own personal research unless pressured into it by group consensus in, say, a working group). It is just really slow. Not a big deal if you have a field dataset with a few hundred rows. But even the comparatively small datasets I work with like the Breeding Bird Survey and US Forest Inventory (few million rows) really bring R to a very slow crawl (and again yes, I know there are tools, but I have better things to do with my time). Matlab and Python are both noticeably faster on most real world tests (no programming language is fastest on every test). Recently I was implementing a relatively complex MLE (Maximum Likelihood Estimation) routine (on detection probabilities – so complex formula but still analytically formulated), something you think R would be awesome at – and to my surprise the same code in Matlab ran 10-100 times faster than R (subsecond vs 30 seconds).
To me, the most fascinating aspect of this is the social. Why all these forks? Now forks happen in almost every open source project (people disagreeing and taking their toys and going home is human nature). But as a long term watcher of open source, I would say the number and seriousness of the forks is unusual. I don’t think this is a coincidence. As stated in the last paragraph, Core R has not risen to the challenge of performance which is something people crave, and there is only so much you can do to fix this by add-on packages. Or put in different terms, the rate of innovation in the 3rd party packages of R has been exceptionally high (one of the reasons for its uptake) but the rate of innovation in the core (affecting fundamental issues like ease of use, performance, memory management and size of data) has been slow*.
As an example, consider pqR. Although pqR author Radford has taken the high road in what he says, he very clearly offered his first batch of improvements to the core R group and they mostly got rejected with not much more than a “we don’t do things that way here” attitude (http://tolstoy.newcastle.edu.au/R/e11/devel/10/09/0813.html and http://tolstoy.newcastle.edu.au/R/e11/devel/10/09/0823.html – and reading between the lines good old fashioned “not invented here syndrome”). His frustration levels clearly reached a point that he decided to work from the outside rather than the inside for a while. While I have never offered patches to the core, my own experiences with trying to submit a package to CRAN shocked me. The number of people involved was ridiculously small (Brian Ripley replied to one of my emails) , the openness to different ways of thinking was zero (the pedanticism about you have to do X even if X makes no sense in my context from obviously smart people surprised me), and the rudeness levels were extreme, epic and belonged in a television soap opera, not science.
All of this has created a muddle for the poor person who just wants to do some statistics! Now you have to figure out which version to use (and package/library writers have to make their software work with several versions of R).
I have not yet tried any of the alternatives (as noted I mainly use R for teaching and Python and Matlab for personal research – I’m not taking students to the bleeding edge of technology without a reason). But, given that I basically don’t know what I’m talking about 🙂 , my recommendations would be:
- If R is fast enough for you – ignore the whole thing and wait for the dust to settle (which it will in 2-3 years, probably either with one clear widely accepted alternative or with changes rolled back into the R core after they decide they’d better start listening to other people)
- If you really need a faster R right now, try pqR and Renjin (and maybe CXXR if you’re gung ho). Both are freely available, seem to offer real speed improvements, seem to have high compatability with packages (although pqR is probably higher) and are moderately mature.
I am really curious to hear from our readers. Have you heard about the alternative implementations of R? Do you care about them or are they off your radar? Has anybody tried one?
*The GUI (graphical user inteface) is another area where internal innovation has been slow, but fortunately 3rd parties can and eventually have stepped up. If you don’t already use R studio, you should be.
As someone who splits his time between R, matlab and python, i have never understood the enthusiasm a lot of people seem to have for R (other than the general observation that people tend to pretty consistently over-hype their favorite programming languages). For pretty much everything i use these tools for, matlab is light years ahead of R. Plotting (scatter plots) are better, curve fitting is better, data import/export via the GUI (very useful if you are copying and pasting small or intermediate chunks of data back and forth between a spreadsheet or some other piece of software) is better, the coding experience (cell mode, the profiler etc) is better, and on and on. The other thing that really does it for me is the documentation. When i google around for R help i usually end up with either 1) archives of mailing lists, 2) step-by-step instructions for how to do task “x,” with no or very little explanation of what the commands are doing or what the intermediate data structures being created are (and therefore no way to extend my knowledge of the process to some other similar task), 3) the “official” documentation for whatever function i am using, which ends up being basically a tersely written unix man page. When i google for matlab help i end up on the mathworks website with professionally written documentation, examples, and explanations of everything. Then there are the actual pdf/HTML manuals which are basically full books that go through in detail how things work and how to do the more common tasks. I mean they obviously spend a huge amount of money producing this stuff. There is no comparison with R. Of course the downsides are it’s proprietary and it costs a huge amount of money. These are pretty significant downsides, and they keep me using R for at least a subset of my work. None of this is to say that R is /bad/ or that everyone should use something else, it’s just that when i hear R users talk about R they seem to gush over it like their lives were empty before it and it’s the best thing out there. I feel this is not the case.
I suspect this comment will start a lively debate! I’ve already tipped my hand that I’m not going to be the one to disagree with you though. All I’m going to add is that I 100% agree with you about the importance of and the differences between tools in documentation. The amount of time I waste figuring out how to do something new in R is vastly larger than the amount of time I spend figuring out how to do it in Matlab (with Python being intermediate).
Brian, I think you overplay the importance of and the traction that these forks/rewrites of R have in practice within the community. At most, among the cognoscenti that I listen too on such things, pqR is seen as interesting, but not something you are going to use day-to-day. Your characterisation of the original receipt of Radford’s patches is spot on. A later approach to R Core with well tested code and examples of the issues being solved has been received more gratefully with (IIRC) Duncan Murdoch offering to look at and port to R some or all of Radford’s changes. This is complicated because Radford is working to sources for 2.15.x branch of R and R Core are working on the 3.1.x branch for development purposes and there have been lots of internal changes to allow “64bit” vectors in R over that time.
Something you don’t mention is that R Core now see R as a mature and stable platform for statistical computing. There aren’t going to be large, sweeping changes to the language because they see stability as a key feature now. So it isn’t the fastest for the 10 million+ row data sets, but what proportion of people actually have problems that large? It is a small proportion. I have had little issue with R in terms of it speed of computation for data sets (and I have worked with some huge data sets that required weeks of computing on cores on my workstation).
One “version” of R you missed off is Revolution’s R flavour which has “Big Data”… (sorry excuse me whilst I vomit)… capabilities and is more widely established than the other forks, and has been pretty good at feeding back to the community.
One development that you don’t mention that I think renders a lot of this debate somewhat moot (mooted?) is Rcpp and the various add-ons (like RccpArmadillo and RccpEigen for fast linear algebra). This allows users to write C++ code and interface it to R far more simply and quickly than before. The functions and macros provided by Rcpp allow one to write C++ in a manner that is not too much removed from R (you can treat vectors like you would in R for example), so you don’t have to relearn a new language fully to appreciate the benefits. Rcpp is now used by a large number of packages on CRAN and I see that trend continuing. The speed-ups you can achieve are quite staggering compared to base R.
I don’t buy the documentation argument that @ben put forward above. There is tonnes of documentation for R and its packages, of varying quality. The Rd pages (the unix man pages @ben mentions) are not meant as documentation in the sense of “learning how to use” they are there to document what arguments and function takes and what object it returns. There are a plethora of books and other contributed documentation for R now, and mailing list and websites like StackOverflow (where there are well over 45,000 questions on R alone) mean that there is a sizable resource base out there. Unfortunately, “R” itself is not make for the best search term… Given that one doesn’t pay anything for an R licence, it is not unreasonable that the glossy books/manuals are not also provided for free.
Python has begun to catch up to R in terms of higher-level functions and infrastructure for working with data. One of the truly nice aspects of the R language is just how easy and expressive it is to work with data; it doesn’t feel like programming. This of course comes with a price and that is efficiency.
For all the talk of R and its forks, or even Python, I don’t think any of these will be the next big data science / stats language. That heir-apparent currently is Julia, which has C-like speed in a much more expressive language than C, which is closer to R. It is far from ready for everyday use unless you like coding up basic stats routines, but things are developing fast. I’d look to Julia for the longer term successor to R’s crown than any of Python, R forks etc.
Not sure we really disagree about the forks. I said in my first sentence they’re not a significant player in the ecology user community right now. But I do think they’re a commentary on some shortcomings in the current level of satisfaction with how R is managed and I hope they spur needed change. I’ve been in the software world for 30 years now (my first paid job was in 1982). While stability is important (and it has been a problem for R in my opinion), in my experience, if you don’t innovate you get passed by. And performance is a central feature. I am seeing hints of R maybe getting passed by if they don’t do something different.
Towards this end, I’ll stand by my claim that Neal Radford got blown off when he offered small but useful improvements (what you say R is looking for right now) and all of his changes would have been lost (not exactly a positive advertisement for open source) if he hadn’t taken his own time and energy to go public with a fork (which is a positive advertisement for open source but also kudos to Neal).
I dont’ see rcpp as changing anything for the 98% of ecologists who don’t know and never will want to know how to write in C/c++.
I can only speak from my own experience, but as a frequent user of Matlab, Python and R I am going to 100% back Ben’s claim that having carefully written (covers Python) and especially professionally written (Matlab) documentation is worth money. The only “real” documentation in R is all of the books being published in the Springer series now. Some of those are great but some of them are terrible. And they’re not free – its pretty easy to sink several hundred dollars into getting adequate coverage of R documentation, while I can get it free online if (big if of course) I’m at a university with a site license to Matlab (and all of Python’s higher quality documentation is free online). Having a top down push for good documentation as the face of the product is worth a lot in the currency I care most about (time).
Thanks for the pointer about Revolution R.
Within the R community, I don’t think these forks are that influential or having much bearing yet; I doubt they will for a good while either. And I don’t agree that they reflect broad dissatisfaction within the R community. Perhaps R is not great for a small proportion of people with large data sets, but that doesn’t mean that R is suddenly no longer a great tool for those with up to 1 Million records for example.
As I mentioned, Radford’s changes & demonstration of the improvements and bug fixes in pqR are now being taken on board by R Core, so they are learning. R’s development model is not the epitome of openness, I agree. And note that I agreed with you about the way his original approach was handled by R Core.
Writing C++ in Rcpp with the templating and sugar that Dirk and Romain have added really means that you probably aren’t writing C++ of old. There are significant proportions of ecologists out there who will never want to know how to write R or Python or Matlab code, so I’m not convinced by that argument.
You misunderstand my point about documentation. Good documentation is important, & you can expect it where you or your institution has paid thousands of dollars for licences for Matlab. You can have no expectation to get it for free just because R or Python are free. I’m sure you can buy crappy documentation for Matlab; R is no different here. I really don’t follow your point about sinking $100s into R books; you’ve sunk nothing into the software so why complain about spending far less than a Matlab licence on a few R books to get you going? You seem to be ignoring the fact that someone is paying MathSoft some pretty big wodges of cash so you can access this documentation on-line. I can access all of Springer’s UseR books online because of my Library’s subscriptions, so by your reckoning I get those for free too!
“I dont’ see rcpp as changing anything for the 98% of ecologists who don’t know and never will want to know how to write in C/c++.”
I might say the same thing, “I don’t see pqR as changing anything for the 98% of ecologists who don’t know and never will want to switch to Linux and build R from source.”
Actually, I say that rather tongue in cheek, not only because I think it can be installed on a Mac (Unix environment), but because 98% of ecologists probably don’t notice the problem of speed that most forks are intended to address. R is fantastic because of the central repository of user contributed packages (CRAN). R sucks because of speed, memory use, and excessive flexibility creating a MASSIVE learning curve that doesn’t level off for a long, long time. Researchers working with big data and computationally challenging manipulations/analyses have a number of solutions through R but none are entirely satisfying. In those cases, it is probably best to try out a better fork or go to Matlab/Python/C++ depending on the specific problem (desire for a specific R package, not wanting to learn another language, comfort with parallel processing in R, etc.).
I think you are absolutely right in that the number of forks represents a problem with the developer side of things. The rigidity and tone has and will continue to push developers away. I will be very curious to see how that plays out over the next decade with R as well as with Julia.
Thanks for another great, thought-provoking post Brian!
Not sure where Brian overplayed the forking trend – he was simply bringing it to attention and wondering whether ecology folks were aware. I personally had not seen any of this forking stuff, and most solutions to computation speed that I’ve read about and used myself address the hardware side (e.g., bigger/faster servers or parallel processing). And Python does not necessarily result in significant gains, depending on the details of the computation.
I do agree that R documentation has improved drastically over time and with sites like StackOverflow, most users can easily find answers to just about any problem. The up-voting structure is great at sorting the good answers and, for problems with multiple solutions, there is often a thorough discussion of the how and why. But, part of the reason there can be so many solutions to an R problem has to do with inefficiencies in the language which cannot be readily fixed. I guess the forks represent attempts to fix those inefficiencies without abandoning the language altogether and losing the valuable developer contributions (and knowledge base) that have accrued.
I like to see the so many solutions to a given problem as reflecting the dual nature of R. It is a programming language, so there are lower-level functions and approaches that are more complex to use but gain you efficiencies. Then there are the higher level functions that are designed for people to work interactively with data, which need to be more expressive and intuitive, but that comes at a price. Then there are the add on packages that add more sugar or add efficiencies; for example, Hadley Wickham’s plyr package is great and useful but dog slow on moderately sized data and above, or the data.table package which is less intuitive to get to grips with at first but is super fast, or Hadley’s newer beta package dplyr, which add data.table-like speed to plyr’s user-friendliness.
Where I feel the point was overplayed is in the suggestion that R is now this divided community with competing factions. It’s like the media’s oft-portrayed image of climate science consensus; A few insignificant forks/contrarians doesn’t make for a totally divided community. (By insignificant I mean in their user-base etc, not what they are trying to achieve, and certainly don’t mean it in a derogatory manner.)
Until those forks work seamlessly with the huge number of add-on packages for R, I don’t see that they will gain much traction inside the R user-base. It is just too much hassle to have to port or recode things yourself.
None of this is to say that R doesn’t have its faults, it does.
I just wanted to note that there’s clearly an analogy here to coexistence theory in community ecology or evolutionary biology. Different versions of R (and more broadly, different programs and languages like R, Python, Julia, C/C++, and Matlab) are subject to trade-offs among different “fitness components”, with no global optimum. Which suggests that none is likely to ever “sweep to fixation”. It might be fun (for some value of “fun”) to try to pursue this analogy as far as it can go. For instance, thinking about “forking” of R by analogy with lineage splitting and diversification in evolution. Thinking about “anagenic” change within each “R lineage” in terms of things like “frozen accidents of history”. Etc.
Ok, less frivolously, one reason I think core R is going to be rather hard to displace in ecology, especially for the majority of ecologists who aren’t working with million-line datasets or whatever, is that undergraduate biostats and ecology courses increasingly are taught in R. This is new, I think, though I freely admit I only have my own anecdotal impressions to go on. It used to be that all sorts of different stats packages were used in undergrad biostats and ecology courses, but that’s increasingly not the case (and insofar as it’s still not the case, I suspect it’s because of holdouts who use older commercial stats packages like JMP, not because some people are teaching undergrad biostats students Python or Julia or whatever). And people who aren’t “power users” are going to be much less inclined to change to some completely different software package down the road. So while I suppose it’s possible that Python or Julia or some R fork will mostly replace core R among power users at some point down the road, it’s harder for me to see core R being displaced among “garden variety” users.
But all of the above is written by someone who is not only not a power user of R or anything else, but who still uses something called MathCad from time to time, so take it with a grain of salt. 🙂
There is conventional wisdom in business that any one time there are only two dominant players in a market (think Coke and Pepsi or Hertz and Avis). There can be niche refinement (think Dollar rent a car for cheap rentals) and as a result others can hang around, and sometimes something can come on strong and replace one of the two incumbents (like HP and Dell did to IBM in the PC market). I rather suspect this applies to programming languages in a community too. It is interesting to think about whether this is the same or different in ecological coexistence.
I agree that R is going to dominate in mainstream ecology for a long time (and as I noted I suspect we will see a renormalization of the R forks, probably with Core R back on top after being prodded to change and be more open). What I think is less obvious is what is going to dominate in the hard core community (ecoinformaticians or data scientists or whatever you want to call them). And in the very long view, what they adopt usually comes around to dominate globally (in 10-20 years – you can laugh but it is what happened with R). Personally my money is on Python but betting on R or Julia is also rational.
I’d like to +1 Gavin’s comments. The way I’ve dealt with speed issues on R are as follows. 1). Parallelize it and forget it. Sometimes I do this formally, sometimes I just write PBS jobs that read parameter sets and farm out the same routine to different nodes. Running big jobs on the cluster overnight solves this for me. 2). As Gavin says, just profile your code and use Rcpp. Here’s my question, though, if you’re using such huge datasets where you actually need speed boosts and are concerned about memory, why not collaborate with a colleague who has those skills?
Vis a vis R documentation, I also have to disagree with Ben. At rOpenSci we work hard to create good documentation, package webpages and tutorials. Others also work hard. The Lavaan package has fantastic docs, as does all of Dirk Eddelbuettel’s rcpp family of packages. Hadley also has fantastics docs (http://docs.ggplot2.org/current/). I think many good package authors work hard to provide good documentation as well as support.
I don’t want to drift away from Brian’s original point though. I agree that the R core project is curmudgeony, and I also don’t really appreciate the snide comments I’ve gotten back on CRAN package submissions. Another way to think about it though is that ecologists need to take ownership over their computational skills. The reason CRAN has such a stranglehold over the package infrastructure is that in my experience ecologists use R in a very proscribed way that is more cookbook following than being creative chefs. It’s also worth noting that python has a variety of iterations to meet specific needs (ironpython, jython, cython) as well.
Ted – I agree with you about documentation on individual packages. To my mind there is enormous innovation in the package world. Probably a whole paper there on applying Darwin’s postulates to software – where is there variance and mutation and selection …? There are some great individual packages. You mention some. I’ll also highlight Jari Oksannen’s vegan and Rob Hijman’s dismo and raster.
It would be interesting to use evolutionary principles to figure out optimal design of open source projects …
And I agree that speed has not so far driven people to massively adopt R forks (hence the lack of selection pressure, hence the lack of evolution …). But I do think in the long R will need to open up or be replaced.
Thanks for the nod to Vegan’s documentation; we do try to make the documentation located via `?foo` contain more than just simple descriptions of inputs and outputs, plus Jari took time to write some vignettes to provide a tutorial and explain some of the more esoteric implementation details. Writing this documentation is difficult and time consuming to do well. It is always the last thing that I get round to with my packages…
I doubt R will ever open up to the extent that I think you are referring. I would take a huge change in the mindset (not quite the right word, sounds too negative) of the development team, who increasingly come to view R as a stable platform that they add to with things that pique their interests and the user community provides the extra things.
One main reason why I don’t think R will change radically is that there may be some tweaks here and there that can add some efficiency, but they probably aren’t worth much effort or the risk of instability in what is a mature platform. Conversely, starting from scratch learning the lessons of what worked well in R & other languages, and what didn’t, seems a better approach. People are free to explore new approaches without being weighed down by the legacy of 20+ years of S, S-PLUS, R. One of the two originators of R (I forget which one now) is on record in print as saying as much and has ideas about how to improve scalar operations that R is painfully slow at. Julia is another, tangible, example of this. Python is a general programming language with the data analysis stuff being layered on top as a secondary thought. Julia is striving for an expressive interface/syntax but native C like speed – the best of both worlds. You probably don’t want to be doing day-to-day work with it just yet though.
Despite agreeing with Brian and Ben regarding both the documentation shortcomings and the real issue of speed (even field ecologists like me can run into large data sets or time-consuming simulations!), I find myself using R more and more for programming tasks I would have previously tackled in python or c++. The major reason: I want to write code that my students can understand and modify. If they are going to learn one language, it is going to be R. Maybe someday it will be Julia, but we’ll see.
I even teach a programing course for grad students that uses R. I would love to teach the course in Python, but the students are already being exposed to R in statistics courses, by advisors, etc.
You describe my own situation pretty well. Both when interacting with students and when attending working groups with peer scientists I usually end up using R. But so far I have been dissatisfied enough with it on various fronts that I’ve been willing to bear the costs of working in multiple languages to go down other paths when working on my own.
One thing that may help some is an increasing number of students are learning Python to program ESRI’s ArcGIS.
Good point about python hooks for ArcGIS — that has been a point of entry for some students.
A part of the reason I’ve moved more tasks to R is simply that I forced myself to do so. I needed a deeper understanding of the nutty language so I could teach my programming course (and the decision to teach the course in R was based on social factors). Until I did that, there was a lot about the language I didn’t really understand despite having used it for graphs and stats for years. Yes, there are packages with great documentation, but I had somehow missed a lot of details on the base language itself (blame it on my own haphazard approach to R). I still prefer and use Python and I rarely do anything in C/C++ anymore. I may go a long time between coding, so my monkey brain has trouble keeping multiple languages and associated idioms easily accessible.
When I compare how I’ve taught my R course to its predecessor course, I do believe my course does a better job at being a general programming course than its predecessor which was originally taught with Matlab and moved to R. My colleague did a great job with that course and he is a hard-core matlab user. But I think that one can teach a general programming course in R despite some its nuttiness. I hope that students can then take those lessons on to other languages. Although I think that R strikes the right balance for most of the students, for some, python would make a lot more sense as a first language.
Interestingly, i am the opposite way. It’s those “other sorts of programming tasks” that initially pushed me away from matlab/R and into python. As John Cook says on his blog:
“I’d rather do math in a general-purpose language than do general-purpose programming in a math language.”
I agree with that general sentiment and I think Dylan does too (and I’ll stipulate that Python is also a much better general purpose language than Matlab too). We were both just commenting on another factor – social pressure – which in ecology all points to R.
Of course R started out as a successer to S+ which was a successer to S which was intended to be the “best-ever general purpose programming language” so it is all in the eye of the beholder. Personally, having programmed in dozens of languages including some outliers like Forth and Lisp and assembly, I find the actual programming language of R to be one of the least intuitive I’ve encountered. But that is a personal opinion. Others rave about it. I also find C++ counterintuitive and wish that Objective-C had beat out C++ and mostly stick to just plain old C, so that pretty much just proves that I’m just weird.
In the end for the vast majority of us that don’t spend 90% of our time programming, the single biggest driving force is what we have time to learn and who we can share code with.
P.S. here’s a recent article comparing some of the many Python versions. It seems to me that the variety (and lack of interoperability) between them is much higher than between versions of R. I’m not aware, for example, of a flavor of R that can be compiled, or a version with JIT baked into the interpreter. http://www.toptal.com/python/why-are-there-so-many-pythons
Thanks that’s a helpful reference.
To me the interesting thing isn’t the number of forks. Its the sociology of the forking process and the types of forks. Rewriting a beloved language into Java to have access to the Java or similar for .NET stack makes a lot of sense to me (whether it’s Python or R) as does turning an interpreted language into JIT or compiled. (Although for what it’s worth Matlab has managed to accomplish all of these things except compiled without forking or losing backward compatibility – I also was once the chief software architect of a programming language and we did similar scopes of change without forking – forking to me is kind of a cheap and unfortunate way out).
To me its the ones like pqR and to a lesser extent the “big data” versions that are tackling lack of innovation in the core R product that are somewhat unique to the R context in my view.
To be fair, Matlab’s proprietary license makes it hard to fork 🙂
Ruby has a pretty similar history to R, where the “official” Ruby is Matz’s interpreter, also called the “reference implementation,” but there are others like Rubinius that are designed to be faster and are fairly widely used.
I think this pattern is so common that it’s just a natural occurrence once a (open source) language reaches a certain critical mass.
But my main point is forking is actually suboptimal from an end user point of view and unnecessary from a technical point of view.
Unlike most, from what I infer, I jumped into the thing by immediately trying to program pretty complicated simulations with lots of iterated calculations, loopings, conditional statements, etc., without ever having used it for just basic routine statistics or graphing. This was, shall we say, a mistake of certain proportions, but I’m committed now. I did learn a lot though, including that there’s no way in purgatory or hotter that I’m going through that process again to learn something else, unless it’s very similar. R it is, and R it shall be, till the end of time, amen.
I just follow the common tricks for keeping things fast, like heavy use of the “apply” group of functions instead of loops, and other whatnot, and I run anything really time consuming at night if need be. Other options involve swearing, drinking, avoidance, and outright denial.
Complete tangent but unfortunately the apply() function isn’t faster than a for() loop in R (although maybe lapply is a bit faster): http://yusung.blogspot.com/2008/04/speed-issue-in-r-computing-apply-vs.html
If you’re stuck with R they have some high performance suggestions: http://cran.r-project.org/web/views/HighPerformanceComputing.html
Hmmm, don’t know what I’m thinking of then. Pretty sure I ran a test looping through a big array, vs some other method, and there was a big speed difference, but it was a couple years ago and I don’t know if I documented it.
Under the hood `apply()` is just a `for()` loop. The only things it does extra for you is allocate memory for the result. You can always do as quick as `apply()` by allocating storage by creating an object of the correct size, then looping and fill in the object you created at each iteration.
@Daniel mentions that `lapply()` can be a bit faster, but only because the `for()` bit is actually coded in C, calling back to R code. This only makes a difference really if the execution time of the code for the actual loop is expensive relative to the code that is executed in the body of the loop.
A lot of the hatred of `for()` loops comes from S and S-PLUS days where they were at one time slow, but this hasn’t been true in R for as long as I have been using it (pre version 1.0). People still think R’s loops are slow, but one quickly realises it is because they don’t preallocate, grow objects at each iteration, and/or don’t use vectorised operations. The advice to use `apply()` et al has actually gone far too far and now people try to stuff complex code into functions that would be much easier to read, write and understand if they;d written out the steps in a `for()` loop. As I have said many times, there is a time and a place for both `for()` and `apply()` and people should learn to use both properly.
Thanks Gavin, I just learned something important. The importance of pre-allocating the size of the result-holding array rings a definite bell now.
One way to look at the proliferation of forks and add-ons is that a language is mature and finding more specialized uses; another is that the language is reaching the limits of its original design. To be just a little flip, will every script I write for the rest of my career need to have “import scipy” or “library(plyr)” on the first line?
There are a few mentions of Julia up-thread, but I’ll make an explicit plug for it here. I’m a heavy R and Python user, but over the past year or so I’ve been experimenting with Julia and moving some of my work there. It’s still in beta form, but already has a lot of great features, not least of which is a rapidly-developing and well-organized ecosystem of contributed packages. Julia isn’t for everyone yet, but if you’re an R user who regularly finds yourself frustrated by performance (or any of R’s cludge-ier “features”), it’s worth checking out.
I have heard a lot lately regarding the resistance of the R Core team to change, rigid guidelines for packages, and getting “Ripley-d” when violating CRAN rules. I wonder what everyone thinks about Hadley Wickham’s devtools package, which with its install_github() function provides a way for packages to distribute packages via github to bypass CRAN altogether. Is that a net positive or negative?
It is really nice as an end user to have a one-stop shop for published packages for R. I think GitHub is a little more “developery” than most users want to get and also it will expose to you lots of stuff not ready for prime time.
None-the-less I think having only one place to publish where like 3 or 4 people have absolute power over what gets published there is not healthy for the community. And its not even like that tight control guarantees quality (unlike say the idea between the iApp or Google App stores). All they gate-keeper on is whether it has been compiled a certain way (a good thing as it allows standard installations and user expectations) and whether it successfully runs a unit test (a bad thing in my case as my package was intended to allow Matlab to easily call R which there was no way to come up with a meaningful unit test on their machines that didn’t have Matlab).
It is important for end users to know that being on CRAN is no guarantee of statistical or numerical accuracy. I can name (but won’t) several packages on CRAN that produce egregiously wrong answers. Reputation of the package and authors is still important.
So bottom line – my opinion – yes alternative sources are good. There’s nothing to stop people from just posting an R package on their website either – it is only slightly harder to install than off of CRAN. That is what I intend to do when I get the time.
As an alternative something I would strongly advocate for CRAN is to take themselves out of the police role and leave it to crowdsourcing (i.e. let anybody upload anything but put in machinery for users to vote for or against particular packages and add comments).
A little late to the game, but I want to comment anyway as I work for the company behind Renjin.
There are many reasons why you may want to fork R. Increasing efficiency by improving execution speed and reducing memory usage is one of them and, as @eloceanografo mentioned, adapting the interpreter for specialized uses is another. I would say pqR definitely is focused on greater efficiency and speed. This also holds for the data.table package, although this package also introduces new syntax that the author believes improves the ease of programming. Most of these third-party packages will require you to use a specialized syntax which reduces code portability.
Besides improved efficiency and speed, Renjin’s design has two other goals which are equally important:
1. integration into other (Java) applications, in particular web applications that can scale such as Google App Engine, and
2. greater abstraction between the code and the data.
The second point is important to us, because it allows us to better separate the modelling aspects from the implementation. In other words: we want analysts to focus on their model and not worry about performance and data storage. These issues can be dealt with in the backend by others with a tailor-made solution or using an integration with, for example, a database or Hadoop. These integrations are currently not generally available, but are planned for the future.
For now, Renjin should be of most interest to people who are looking into using R in a Java (web) application. At some point we hope to increase Renjin’s compatibility with GNU R and to make it easier to do interactive analyses such that the interpreter will also be interesting for “regular” R users. And in the mean time we continue to work with GNU R and submit patches to R Core which will remain the reference implementation for obvious reasons.
interesting you mentioned Matlab but fails to mention Juia which is syntactically close to Matlab but many times faster than Matlab. Go figure!
I think its a mistake to evaluate just the programming language. Users care about the whole ecosystem including libraries and documentation. On both these fronts R and Matlab are both way ahead of Julia. Julia is an interesting phenomenon I’m watching closely, but I doubt the average reader of this blog will use Julia anytime soon.
Pingback: When Julia replaces R as the preferred open source stats package for ecologists… | LUKE BARRETT | Ecologist