There are too many overspecialized R packages

I use R. I like it. I especially like the versatility and convenience it gets from add-on packages. I use R packages to do some fairly nonstandard things like fit vector generalized additive models, and simulate ordinary differential equations and fit them to data.

You can probably tell there’s a “but” coming.

There are a lot of R packages now.* And in ecology, I’ve noticed that many of them are very specialized. I’d say overspecialized. For instance, there is now an R package to fit a small number of functional response models to predator feeding rate data. As another example, there’s now a package to simulate the dynamics of the Yodzis-Innes food web model. Many other examples could be given (aside: I emphasize I’m not writing this post to pick on the authors of any particular package. I’m interested in what seems to me to be a broad-based trend.)

I call these packages overspecialized because they’re just doing a narrow subset of the things that existing, broader packages can do. For instance, fitting predator functional response data is just a special case of nonlinear regression. R already has packages for parametric, nonparametric, and semiparametric nonlinear regression. If you want to fit a nonlinear regression to your functional response data in R, you should use a nonlinear regression package. The convenience you gain from using a highly specialized package is a false savings. Instead of having to think about what functional response model(s) you might want to fit, learn how to specify the nonlinear regression(s), and then evaluate the fit(s), you just let the package authors do your thinking for you by restricting yourself to the limited range of options they offer and sticking with their (often debatable) choices of defaults. Which isn’t how you learn. As another example, simulating the Yodzis-Innes food web model is just a special case of simulating ordinary differential equations, a task for which R packages already exist.

I also worry about the often-debatable choices of package authors becoming field-wide defaults, purely by virtue of the package’s convenience. For instance, the Yodzis-Innes model is a perfectly good food web model for many purposes–but so are lots of other food web models. If you need to simulate a food web model as part of whatever project you’re doing, you should think about which one you want and why. You shouldn’t just pick whichever one is most convenient to simulate because it happens to have a dedicated R package.

More broadly, package authors often say that they pick defaults so as to prevent or discourage inexperienced users from making technical mistakes. But what if the package authors themselves are the ones who are mistaken? The dream of reducing the rate of technical mistakes in the literature by imposing default choices on inexperienced statisticians and modelers is a false dream, I think (see here for further discussion). You reduce errors by teaching users good judgement. I don’t know if R packages can be written so as to help teach users good judgement–but I’m pretty sure that writing highly-specialized packages with debatable defaults doesn’t help.

By the way, I say this as someone who’s written code that could (I assume) be converted into a highly-specialized R package. For instance, I have a bunch of code I wrote for simulating various standard simple discrete-time metapopulation models. I could convert that code into a R package called DiscrMpop. And these days, I could even get a paper out of it too, in Methods in Ecology and Evolution. But I confess that I have no urge to do so. Because I think the people who really would benefit from the slight convenience of being able to use that hypothetical package are vastly outnumbered by the people who would be better off having to think about which metapopulation model they want and then code it up themselves.

Convenience is great. So is relying on code written by people who write much more reliable code than you ever could. I’m certainly not arguing that everyone should stop using R and its packages and go write their own code in assembly language! But the gains in convenience that come from using some highly specialized package that only does a subset of the jobs of some more general-purpose package are pretty minor or even nonexistent, I think. And I’m not sure there are any gains in reliability either. Indeed, I suspect just the opposite, because I bet R’s popular general-purpose packages are among its most reliable.

My worry here is related to but slightly different than Ben Bolker’s worry about whether statistical software is harmful. Ben was worried about people treating powerful, flexible software as a black box without really knowing what it’s doing under the hood. My worry is about people treating inflexible, highly-specialized software as a black box without really know what it’s doing under the hood.

I am aware that I have probably just annoyed some large fraction of you. Looking forward to learning why I’m wrong in the comments.

p.s. Here’s a live shot of me right now. 😉

*Thanks Captain Obvious!

45 thoughts on “There are too many overspecialized R packages

  1. Via Twitter:

    I’m sure this is right as a description of how R works. The question is whether it’s worth worrying about. I think it is. There’s no way you’re ever going to stop people writing overspecialized packages, because you can’t and wouldn’t want to stop people from writing packages. But at the margins you might be able to slightly slow the rate at which people write overspecialized packages, by encouraging critical thought about what sort of package to write.

  2. One reason to write packages is standardization, and reproducibility.
    Most people still do not post their code let alone data online, and you will have to go by a description as provided in the paper. At best these are understandable, at worst they are completely opaque. It’s not the first time I debug code as described in papers.
    I often assume that this “opaqueness” is there to keep an edge on the rest of the community. You increase the threshold of using the software / routine / analysis to others by not providing enough information to easily and transparently recreate you model / function. The same goes for code which is provided without any documentation.
    This opaqueness has to end. And a good way to do this is to formalize things in packages, if not for personal use and to increase transparency to others. So yes, I would continue writing R packages which are formalized and will occasionally get published if they represent a sizeable piece of work / research. Should people use them? That’s up to them. But, at least the routines used are transparent and well documented – which increases the speed of my research and leaves nothing to chance.

    • “I often assume that this “opaqueness” is there to keep an edge on the rest of the community.”

      I would not assume this. Nobody is so Machiavellian as that. Ok, maybe *somebody* is, but that doesn’t warrant an assumption that people are “often” that Machiavellian! People not posting their code and data, or doing so in a form that’s not easy for others to make sense of, just reflects the fact that we’re currently in a transition period. Not so long ago, nobody was expected to post their data and code. Now, everybody is (which is great, btw). So lots of people are having to adopt new ways of working. Some of those people are going to be reluctant to adopt new ways of working, or are going to make mistakes because they’re learning something new.

      EDIT (sorry, had another thought just as I hit ‘reply’): “Should people use them? That’s up to them.”

      As I said in reply to other commenters, I think that’s a defensible stance. Different users have different needs, and unfortunately sometimes users are mistaken about their own needs, but there’s no way to address those issues with software, beyond everybody just writing and sharing whatever software they want to write and share and then letting users pick and choose for themselves. I do think that’s it’s healthy (rather than concern trolling) to recognize the issues raised in the post even if there’s no way to address them by writing software differently. If I were going to rewrite the post after having read the comments, I’d probably write something like “the reason you write specialized R packages is the convenience of automating your own workflow and to make your analyses reproducible by others, not to try to encourage others to adopt your workflow or prevent them from making what you see as technical mistakes.”

      • ” Ok, maybe *somebody* is, but that doesn’t warrant an assumption that people are “often” that Machiavellian! ”

        My experience is mainly with the remote sensing community when it comes to inquiries about sharing code and has been but Machiavellian. And maybe this field is an exception as their funding rides on algorithm development, hence they do not share. Sadly, by excluding others from seeing their code they slow down research. But so is the field of CS where sharing is the norm.

        My inquiries in the RS field (for code) have always been met with hostility, disdain or utter silence. In hindsight I’ve ‘solved’ some of these issues myself but excluding people from potentially fruitful collaborations. Ironically the RS community is often to first to come knocking on a field ecologist’s door to ask for validation data (e.g. inventories of all sorts).

        This behaviour is often seen as a line in publications where people share code as: “For access to the code used in this manuscript please contact author xyz.” This is a cop out I feel, and leaves author xyz to be gatekeeper of the code / secret sauce. Hence, although it’s often promulgated as “open code” it isn’t (by far). If I can’t see it it ain’t open. It can’t be reviewed, it can’t be reproduced.

        So to conclude this rant, I rather see people post R packages (or code in general) in heaps, rather than using opaque practices where authors serve as gatekeepers. Obviously some people will misuse the code if they do not comprehend what they are doing, but this goes for all statistics and isn’t inherent to packaged software. Again, all the more reason to look under the hood and share code openly.

  3. “The dream of reducing the rate of technical mistakes in the literature by imposing default choices on inexperienced statisticians and modelers is a false dream…” – Couldn’t agree more, and I’m also an R enthusiast. In addition to that, something that worries me a lot is the growing bad habit of ordering take-out code from colleagues. Many people run those magic codes without even thinking about the adequacy of the analysis, models, and parameters.

  4. I agree with Koen. This post seems to miss the whole purpose of an R package, and how that differs from a script. You create a package when you have one or more routines that are repeated and want a simplified and organized workflow — and this threshold is much lower than you might think (given how easy it is to create a package). The beneficiaries of a custom package can be any number of people ranging from 1 to the R user base.

    Now, if you want to simply vent about how many packages are currently available on CRAN, then so be it. I don’t see why that number matters since any Google search can help you separate the wheat from the chaff. There are many more R packages being shared in places other than CRAN (e.g., GitHub) or not shared at all. I guess it’s not clear why this number matters.

    There is no question that people will learn more about an analysis when they have to code it themselves from scratch. But as with everything, the types of tasks and the reasons for doing those tasks in R fall on a spectrum. You could code up some matrix algebra instead of using the “lm” function for a regression but aside from learning what’s the point? Taking that a step or two further, some analyses are standardized enough that if you don’t want to reinvent the wheel and simply want to learn from your data, you might choose to use a relevant package.

    I do agree that any unvetted R package could potentially have coding mistakes or default choices that don’t make sense with some better context. That’s the beauty of sharing with the user base, your troubleshooting is crowdsourced! Also, as Koen said you can’t beat the transparency.

    • ” You create a package when you have one or more routines that are repeated and want a simplified and organized workflow”

      Fair enough. Except that that’s not the stated motivation of at least some package authors. Some package authors come right out and say “We see others in the literature doing highly specialized task X [e.g., fitting a nonlinear regression to functional response data] differently than we think it should be done. So we wrote an R package that does task X as we think it should be done, so as to encourage others to do task X in this way.” And why write a paper in Methods In Ecology and Evolution about your package if your only purpose in writing the package is to automate your own workflow and then share that workflow so that others can reproduce the results in a paper that used that workflow? But I freely admit that I don’t have data on the motivations of R package authors, so perhaps I’m worrying about the motivations of a minority.

      “Now, if you want to simply vent about how many packages are currently available on CRAN, then so be it. ”

      That’s not what I’m venting about. I agree that people can just Google to find the package they want. I do wonder a little if the sensible attitude of “I can just google for it” feeds into, or plays badly with, the less-sensible attitudes of “I don’t need to think too hard about what this package is doing or why it’s doing it” and “I don’t want to stop and think about what task I should be googling for in the first place” (e.g., I’d want a student analyzing functional response data to google for “nonlinear regression R package”, not “fitting functional response models R package”.) But as I said in reply to Dylan, if you want to argue that there’s no way to write software so as to nudge people away from those less-sensible attitudes, I think that’s a defensible stance.

      • Jeremy: with your last point, aren’t you making a huge assumption that users are just using the defaults or not being critical about what they are doing/fitting? (I mean, I’m sure there are people out there doing that, but that’s not the pkg author’s fault – unless their package is flawed.) Unless you have numbers to bring to the table I don’t see why one has to follow the other. Users may have thought long and hard about topic X and decided what they want is option Y, which happens to be conveniently coded in package foo. So why go to the trouble of coding up said model yourself?

      • “aren’t you making a huge assumption that users are just using the defaults or not being critical about what they are doing/fitting?”

        As I said in my reply to Dylan, I see undergrad and grad students at my uni do this all the time, and I have no reason to believe that they’re atypical. But no, I don’t have data on how often this happens. And yes, it’s certainly defensible to say that however common or uncommon this problem is, it’s not one that can be fixed by discouraging people from writing certain sorts of R packages.

      • ” And why write a paper in Methods In Ecology and Evolution about your package if your only purpose in writing the package is to automate your own workflow and then share that workflow so that others can reproduce the results in a paper that used that workflow?”

        Because often others want to benchmark your own model against theirs! Also don’t dismiss the potential community effect.

        In CS there are model zoos and people share and compare eachother’s models. This requires a framework. If you look at deep learning Caffe or TensorFlow are such a frameworks. No equivalent exists for ecological problems which might merit one.

        You could argue that this doesn’t require an R pacakge, but you quickly end up with a bundle of scripts to achieve the same thing (i.e. a package).

      • “Because often others want to benchmark your own model against theirs! Also don’t dismiss the potential community effect.”

        I don’t follow, can you elaborate? Why can’t people benchmark their own model against yours if you just publish your code as an R package on CRAN or github or etc.? Why does there also have to be an MEE paper? And what do you mean by “community effect”?

        Perhaps we’re talking past one another a little. My question “why write an MEE paper about your R package” really is specific to MEE papers. That’s not a shorthand way of asking “why publish your R code at all?” or “why write an R package at all?” I’m asking “why *also* write an MEE paper publicizing your R package, as opposed to *just* writing your R package and putting it on CRAN or gitbhub where people can find it?”

      • “I don’t follow, can you elaborate? Why can’t people benchmark their own model against yours if you just publish your code as an R package on CRAN or github or etc.? Why does there also have to be an MEE paper?”

        You could but might not get the visibility, especially early career people.

        ” And what do you mean by “community effect”?” ”

        Where as in OSS people can not only use your package but also contribute. It’s often forgotten that people will contribute (or catch bugs for that matter). Building a community has to start somewhere with someone providing a framework.

  5. I share this worry. I have been surprised sometimes when I ask a student why they used a particular approach, and the answer was: “that’s how the package works”. But on the other hand, learning from other people’s code is a great way to learn and packages do provide cleaned up code. I’ve encouraged students to look at the actual package code, think about their own problem, and perhaps remove that dependency.

    • Yup. Makes faculty tear their hair out when (as happens not-uncommonly at my uni) a grad student gives a talk in which they say that their method was to use R package X. As if R package X *was* a method, as opposed to an implementation of a method. And then they can’t explain what the method does, or why they chose that method rather than some other one.

      In fairness, one could argue that this isn’t a problem that cannot be made even a bit better (or even a bit worse) by people writing fewer R packages, or writing packages and the associated documentation differently. You could argue that the only way you can force people to stop and think is to…force them to stop and think. Which is something that supervisors and supervisory committees and peer reviewers can do, but not something that software can do.

      • I would just like to note that it is easy to point fingers at grad/ugrad students and say that faculty tear their hair out, but I would be very rich if I had a nickel for every time I saw an exalted *faculty* member do exactly what you are describing. In fact, many professors who have little coding experience may be more guilty of this (but are challenged less often).

      • Could be! I have no idea. My experience is that faculty tend to be better about this than students, but I freely admit my experience is a small and non-random sample of the universe of faculty and students.

  6. This whole thread so far, in one tweet:

    I guess I’d only add that, if education can encourage package users to stop and think about what they’re doing and why, it can presumably do so for package authors as well. Before you write that R package, or that MIEE paper, stop and think about your goals. Are you trying to automate your own repetitive workflow? Make the code for the analyses in your last paper available to others? Or are you trying to get others to do a very specific analysis, and to do it in a very specific way? If the latter is your rationale, you should probably think again.

    • You keep mentioning software papers in MEE; you aren’t going to get a trivial R package into MEE easily at all. Having reviewed enough of these and related things at MEE and elsewhere, I don’t see going this route as a trivial thing to get done.

      • Interesting. I’d like to hear more. What are the criteria by which MEE reviewers and editors evaluate papers reporting R packages? In particular, is there any attempt to evaluate how broadly useful the package is likely to be? (analogous to how selective journals ask if the paper is “interesting”, “novel”, and “important”) If memory serves, there’s an MEE paper about at least one of the packages mentioned as examples in the post.

      • For my experience, publishing a method/R-package in MEE (however I do not think that this was an overspecialized R-package but with a broader application). This is by no means an easy task.
        First, the reviewers/editors were all amazing, but very strike. They all had concerns and I think the paper would have been rejected if we did not address these concerns. They really took the time to check everything.
        Second, regarding the evaluation. We had to show and compare that our method/R-package is an improvement over other similar tools and that it could be an analytical step forward that can create novel insights (of course novel insights are not dependent on the method, but I hope you understand what I mean). So therefore, from my experience there is an attempt to address how broadly useful a package is.

  7. Gavin Simpson says what I’m seeing in the post is the Unix philosophy at work:

    Could be in some cases. In other cases, that’s not the package authors’ stated rationale. But as I said in a previous comment, I lack data on the frequency of different motivations among R package authors.

  8. This is the first post of ours that I can remember about which the Twitter conversation is at least as active as the conversation here in the comments. Usually there’s no conversation about our posts on Twitter, just people sharing them and liking them (perhaps with non-substantive comments like “good post” or “lots to think about here”). Presumably that just shows that lots of R users are on Twitter and like tweeting about R. Anyway, going to keep pulling bits of the Twitter conversation into this thread.

    Colin Robertson disagrees with a previous commenter (and me) and says that the problem is that having too many R packages makes it too hard to discover the right one for you:

    And Kirsty Lees agrees with the general thrust of the comment thread so far:

    • As the author of the Environmetrics Task View on CRAN, which were started to address the problem of the rapidly increasing numbers of packages being hosted on CRAN, I can attest to this discovery problem. I’m months behind on adding new pkgs to the task view and that’s just for pkgs that people email me about. And the Task Views don’t even begin to cover other code repos, like github where some of my ecologically-focussed pkgs reside.

      This is a real problem for users and they may just end up using pkg bar with bad defaults because that’s what they found and either couldn’t or wouldn’t want to write the code themselves.

      • Interesting to hear this perspective. As you’ve seen in this thread, we’ve been getting the full gamut of comments on this point so far. Perhaps I’m overgeneralizing from my own experience (and Carl’s and that of a couple of other commenters) when I say that package discovery is a solved problem, and that the solution is googling?

      • Googling can work, but you do have to work quite hard at times to narrow down the results to a useful subset. I doubt you’ll have issues looking for an R package that includes the term “Yodzis-Innes” (although I just came across two pkgs in the first few hits I see, which is one more than I was expecting to see). For other models or approaches, the problem can be that there are too many implementations, so which one is a user to go with? Googling doesn’t solve that problem, which is part of the broader discoverability problem. Task Views were thought up to tackle this broader issue; given n pkgs that do thing X, which would the Task View author feel merits particular attention? They’re supposed to be a curated list — which was initially easy as there wasn’t much to see on CRAN of an environmetric nature. It’s quite a different proposition now.

        Anyone want to volunteer to help out?

  9. Hi Jeremy,

    You raise a lot of very good issues here, As you probably felt to begin with, the real issue has very little to do with the raw number of R packages on CRAN, but rather the more subtle issues of exactly what those packages do, how they are implemented, and how they are used.

    For instance, I think we can agree that there’s really not an issue about how many packages or how specialized packages are in general with the following thought experiment: I don’t think you’d have any objection if the maintainers of deSolve or another package you see as well scoped and well implemented decided to split some of it’s functionality out into 10 separate sub-packages, all on CRAN, which were all then imported by the main `deSolve` package? (splitting a large package into separate packages is a common practice to streamline maintenance and assist other developers). Clearly this means more packages and more specialized packages, and I think equally clearly this has very little impact on users.

    In my experience, methods that I think are trivial to implement in code often aren’t. It’s not uncommon to find issues (some unlikely to impact results, some more so) even in some of these packages on CRAN that so annoy you. If mistakes can crop into implementations of someone taking the trouble to publish their package on CRAN, how much more so will such issues appear in when everyone does their own implementation (probably without even the benefit that the R package structure and check mechanism provides in catching the most common bugs)?

    As an aside, another objection you make is that people are lazy and choose to use methods for which an implementation already exists (say Yodiz food web model). One might conclude this is an argument for more packages (e.g. implementing more alternative models), not less, as the most expedient way of increasing representation of other model forms in people’s studies. While eliminating any specific implementation of a food web model might decrease the bias of students/literature to working with a particular model, it’s not clear if that happens only by also decreasing the denominator.

    • Yes, as I said in reply to another commenter, I don’t think number of packages per se is an issue at all.

      Something I should perhaps have clarified in the post is that I’m not even concerned with specialized packages that automate some little task that lots of people may well want to do. For instance, if base R didn’t already have a command for unwrapping a matrix into a vector, I think it’d be great for somebody to write a package just to do that little task. I take it that’s the sort of thing that Gavin Simpson was thinking of in his tweet that I copied into this thread, about the Unix programming philosophy. What I’m concerned with is people writing specialized packages to automate some specialized task *that’s specific to work on a particular narrow scientific topic*. For instance, “fit a nonlinear regression *to functional response data*”.

      “As an aside, another objection you make is that people are lazy and choose to use methods for which an implementation already exists (say Yodiz food web model). One might conclude this is an argument for more packages (e.g. implementing more alternative models), not less, as the most expedient way of increasing representation of other model forms in people’s studies. ”

      I was wondering when someone would make that argument! And I guess it’s defensible, though it leads to some weird places. At least, they seem weird to me. For instance, do we ideally need different packages for fitting nonlinear regressions to functional response data, each of which does model selection in a different way? So that users can’t just default to using whatever model selection method is implemented in the one specialized package for fitting functional response data? I dunno, I guess this just reinforces the conclusion that you can’t nudge users to think by how you write software.

  10. The words “I used package X” and “I used a pipeline” immediately throw a flag when in any presentation. Andy Royle’s MIEE paper on MaxEnt is a great example of the phenomena you write about. Pretending MaxEnt is not a GLM subject to sampling bias, as often novice MaxEnt’ers are wont to do, is ridiculous and wrong. But a GUI, some specialized R packages, and pipelines have made pulling data, poorly implementing a statistical routine, and making pretty maps exceptionally easy to the end of a proliferation of crappy science. There is a struggle to making hypotheses into math and math into code. That struggle is necessary for learning. As many commenters pointed out, the burden is not so much on R packagers but instead on teaching how to think of analyzing. The current graduate environment, from the two I’ve been intimately exposed to and the many others I have interacted with, is failing students in this regard. As Spiderman’s Uncle Ben said, “with great power comes great responsibility”.

  11. Thanks for posting this Jeremy; whilst I clearly have a different viewpoint on this to you, it has certainly been interesting to think about this a little more deeply.

    I’ve thought about this a lot, on and off, over quite a few years, and not just in relation to ecology-focussed pkgs; I originally had concerns about the proliferation of pkgs now in the tidyverse for example and whether we should be teaching those higher-level tools instead of the, admittedly inconsistent, implementations of the same ideas already available in base R. Your concern is similar in nature to this.

    My main observation is that this has post and related concerns really doesn’t have anything to do with R packages; the issue is really more about how we go about our scholarly endeavours. A student or colleague using pkg foo because that’s what was available, or someone that just sticking to the defaults, has nothing, or at least very little, to do with R packages or their number. It has a lot more to do with things likes training in stats, math, and computing, about expectations (perceived or real) as to productivity, output, focus required to make the next step on the career path, about the time people have available to just think without worrying about those expectations.

    How do we work to solve those issues?

    Others have raised several rebuttal points that I would have made;

    * having code in the open is a good way to get eyes on it and fix issues or as a jumping off point to implement other methods; your script locked away is an unknown entity and is little use to anyone but yourself
    * R pkgs or functions within pkgs often exist for pedagogic reasons; decorana() is in vegan not because Jari thinks it is a good thing (he in fact thinks it is a terrible method to use), but people asked for it for a variety of uses (I use it bc I like to show what it does in trying to remove the arch from a CA)
    * There can often be debate or discussion as to the right way (or even whether there is a right way) to do a particular approach.

  12. As a fish ecologist who tends to find himself listening to birds, turning over rocks and logs in the riparian zone to see what lies beneath, and generally wondering what goes on in the areas beyond my “zone of interest”, I tend to be believe that your specific example is simply a symptom of a larger illness that besets our discipline. To make yourself stand out among the multitude of colleagues striving for funding, recognition, or that first full-time position requires laser focus and a clearly identifiable niche in your field. The opportunities to reflect on the broader world around are increasingly constrained and as a result the opportunities for true breakthroughs in our own understanding of that world (in the Kuhnian sense) are few and far between. I realize that this comment is a bit beyond the intent of the original posting, but the semester is over, the halls are quiet, the coffee tastes good, and the mind moves where it wants.

    • Similar thoughts here. For many early career people a published R-package in Methods in Eco & Evo will be one of their highest Impact Factor papers for quite a while. Not hard to see the incentive here. (although this should by no means imply that writing & publishing a package is easier than any other type of publication)

      • I share this worry, but my perhaps-overoptimistic hope is that this incentive is somewhat self-damping in the long run. As people write more and more R packages, and get more and more MIEE papers out of doing so, other people Iincluding authors of other R packages and MIEE papers!) are going to be increasingly less impressed with you for merely having written an R package, or an MIEE paper about your R package. People are only going to be impressed if that R package is widely used, or that MIEE paper is heavily cited.

        Which arguably already happens to an extent. One way to push back against my worries in this post would be to point to how many users there are for general-purpose packages like lme4 and nls2, vs. for highly-specialized packages that only do small subsets of the things those general-purpose packages do. And to point out that, anecdotally, authors of widely-used general-purpose packages like lme4 seem to be much higher-profile and much more widely-regarded as leaders at the interface of ecology, statistics, and software (think Ben Bolker) than authors of little-used, highly-specialized packages.

  13. This has some echoes of the never-ending, mostly good-natured Twitter debate about base plotting versus ggplot2. To paraphrase the pro-base plotting stance: “it’s good to make things hard, because it makes people think”. 😆 In which I reveal my allegiance with ggplot2.

    “You shouldn’t just pick whichever one is most convenient to simulate because it happens to have a dedicated R package.” This is a really interesting point for me. Because one interpretation is that a lot of our statistical methods don’t “truly exist”. Especially the more advanced ones. They exist in some technical sense, but having a reasonably convenient implementation, with readable docs, is arguably part of what it means to really, really exist. So one interpretation of the proliferation of specialized packages is that it’s too damn hard to take those models you see in a book or paper and implement them on real data. Why is that? Is it a lack of statistical understanding or programming skill? Or both?

    I’m not in ecology so can’t make an informed comment about the specialized package phenomenon. But I wonder if these packages are usually tethered to a specific paper and they are basically documenting computational work in a single publication, using the R package format. I think we are starting to see people using R packages and CRAN for a really diverse set of goals.

    • “So one interpretation of the proliferation of specialized packages is that it’s too damn hard to take those models you see in a book or paper and implement them on real data. ”

      I see what you mean, though in the case of the sorts of packages I’m thinking of in the post that’s definitely not what’s going on. The cases I’m thinking of are specialized implementations of fairly-to-very *basic* methods, not advanced methods.

      “I wonder if these packages are usually tethered to a specific paper and they are basically documenting computational work in a single publication, using the R package format.”

      As other commenters have pointed out, yes, I’m sure that’s often the case. Though at least in some cases the package authors’ state a different rationale.

      “I think we are starting to see people using R packages and CRAN for a really diverse set of goals.”

      Yes, this thread has made that very clear to me!

  14. I tend to think there are two issues here:

    1) Many people think appearing on CRAN is more of a stamp of validity than it really is – R packages are very much buyer be ware. If I don’t know the reputation of the author, I won’t use a package until I test it myself. And believe me I’ve found several on CRAN that produced garbage.

    2) The issue of defaults and code (regardless of source) that lets people do things that they don’t really understand. In Jeremy’s example, there are a lot of ways for NLS to go wrong and a careful scientist needs to diagnose them. I don’t know the package but it is quite possible it hides the internals enough that people think they get results when convergence was poor, just for example (again and hypothetically – I don’t know this particular package but I do know this issue has occurred on some other package son CRAN). I’m not sure letting people reach over their heads with some chance it worked and some chance it went wrong and they didn’t notice and publish it is good for science.

  15. I’ve a question concerning something kind of not directly related to your topic, but thought with all the stats geeks tuning in I might get some help. I’ve been hunting for a reliable tool for calculating Moran’s i. I do not feel comfortable trying to write code for this statistic, as it is rather involved. I’ve tried a couple of available tools, but get disparate results, so I do not know if I can trust one over another.

    Does anyone have a suggestion? It’s the last piece I need for a particular project.

    Thanks!

  16. I’ve recently been thinking about this in terms of a paper that I just had published in Ecology (http://onlinelibrary.wiley.com/doi/10.1002/ecy.1802/abstract; I hope you don’t see the abstract and yell ‘statistical machismo!’ 😉 ). Often, when I would tell others about this project, I would get a response like ‘O, are you going to make an R package for that analysis?’ or ‘A lot more people will actually use the method you are suggesting if you would make an R package’. I don’t necessarily disagree that people are probably more likely to use the method proposed in the paper if I were to build an R package to implement it, but I ultimately decided not to for various reasons that I think touch on some of your comments about specialized R packages.
    1) If anyone wants to use the method, all of the data that were analyzed in the paper, the code for running the models used in the paper, the code for simulating data, and the code for analyzing simulated data are all in the supplementary material. Therefore, modifying the code to one’s own data is hopefully straightforward and circumvents needing a package.
    2) Bayesian hierarchical models, especially those like the ones implemented in the paper, are complicated, both in terms of fitting the models and in terms of analyzing the output. I wouldn’t feel comfortable creating a package where people weren’t putting deep thought into the parameters estimated by the models, what they mean, etc. Not to mention possibly ignoring MCMC diagnostics to evaluate how well the model ended up being fit in the first place.
    3) One of the main advantages to the sort of models proposed are their flexibility. I feel like if I were to write a package, I would have to sacrifice some of that flexibility (the functional response package is a good example here; it only fits ‘standard’ functional response models to relatively simple, run of the mill functional response experiments) . Moreover, one of the exciting things about this sort of method is that hopefully people will end up extending it to new analyses. If I were to create a package and people weren’t defining models themselves, some of the creative extensions to the models might not happen.
    I don’t necessarily have a problem with specialized R packages (I in fact use one to implement a different method for analyzing the data), but I just wanted to give my perspective as someone who was considering writing a specialized R package and decided against it.

  17. On the other hand, a benefit to the open-source culture of R and its many packages is that it may encourage dialogue between package developers and novice users. I’ve been under the impression that it was fairly standard practice to contact package authors with questions, and in my (limited) experience, they typically are happy to help, since it means others are benefitting from their work. Further, this dialogue can benefit both parties- packages may be improved due to the identification of bugs or by expanding their applicability, and novice users may learn a lot from discussing their analyses with more experienced users. Ultimately, it seems that such interactions might benefit the entire community and are part of what makes R so great. Of course, this requires that people actually care about getting their analyses right, but I believe this to be true with the vast majority of people. (*caveat: I have no experience with R packages that I don’t believe to be uniquely useful and have wide applicability; but I wonder whether most would say that about the packages they use?)

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s