Note from Jeremy: This is a guest post from good friend and ace quantitative ecologist Ben Bolker. A while back I joked that the way to make any difficult statistical decision is to ask “What would Ben Bolker do?” The great thing about asking that question is that sometimes Ben Bolker himself will answer! 🙂 Today’s guest post is a case in point. Ben takes on a question I’ve been bugging him to post on: are there downsides to making powerful statistical software widely available, and if so, what can we do about it? Ben writes R packages, so this is something he’s thought a lot about. Thanks for taking the time to share your thoughts, Ben!
I liked Brian McGill’s post on statistical machismo, although I didn’t completely agree with it.* It inspired some thoughts about the perspectives of statistical developers, rather than users, and particularly about the pros and cons (from a developer’s point of view) of providing easy-to-use implementations of new methods.
Regular readers of this blog probably agree that any researchers who use a method should provide enough tools for a (sufficiently committed) reader to reproduce their results (paging rOpenSci…) But what about researchers who propose new methods? What is their responsibility to provide code that implements their method? Is there such a thing as too much software?**
The author(s) have a choice of providing:
- no code, just the equations (common in technical statistical papers), perhaps with general discussion about implementation issues
- a text file with code implementing the method, perhaps commented
- an R package (or one for Python, or Julia, or … but R is overwhelmingly the most likely case), possibly with useful examples, user’s guide (e.g. a vignette in an R package), etc.
- a graphical user interface/standalone program
Is easier always better?
Obviously, providing more complete code is more work for the developer, and they might benefit themselves more by spending their time working on better methods rather than on software development. There is a completely selfish calculation here: as a method developer, will I get more fame/fortune/glory by doing more technical work (e.g., I work in a statistics department where my colleagues will be most impressed by papers in Journal of the American Statistical Association) or by having lots of people actually use my methods?
However, I want to ask whether friendlier software is necessarily better for science (whatever that means). At one extreme, methods that never actually get applied to data may help the developer’s career, and may inspire other statisticians to produce useful stuff, but by definition they aren’t doing any good for science. At the other extreme, though, friendly software can just make things too easy for users, enabling them (in the psychological sense) to fit models that they don’t understand, or that are silly, or that are too complex for their data — that is, to display statistical machismo. It’s a slippery slope; every step in convenience increases the number of people who might use your methods (which could be good), but it will also dilute the savvy of the average user. (A colleague who develops a powerful but underused (non-R-based) tool routinely laments that “R users are idiots”. He’s right — but if his tool had 500,000 users***, many of them would probably be idiots too.) Is there a virtue in making methods difficult enough that users will have to put some effort in to use them, or this is just a ridiculous, elitist point of view? Andrew Gelman attributes to Brad Efron the idea that “recommending that scientists use Bayes’ theorem is like giving the neighbourhood kids the key to your F-16″. (I’ve heard a similar comment attributed to Jim Clark, but referring specifically to WinBUGS — but I couldn’t track it down, so I may have made it up). I wouldn’t dream of banning Bayes’ theorem, or WinBUGS, but I can certainly appreciate the sentiment.
A lot depends on the breadth of cases that a particular method can handle. It seems hard to go too horribly wrong with a generalized linear model or a BLAST search (although I’m sure readers can suggest some examples…). WinBUGS is on the other end of the spectrum — it can be used to construct a huge range of models, although it does also require at least a little bit of training to use. I worry about automatic software for phylogenetic or generalized linear model selection; while in principle they only alleviate the tedium of procedures we could do by hand in any case, and they can certainly be used sensibly, they are also easy to misuse.
You might say that the developers should just put in more safeguards to prevent users from doing silly things; I know from personal experience that this is really hard to get right, and has its own set of drawbacks. First, it’s hard to balance the sensitivity and specificity of such tests — you can accidentally prevent a more knowledgeable user from doing something unusual but sensible, or worry users unnecessarily with false-positive warnings.
(John Myles White comments on the tradeoff between usability and correctness in statistical software.) Second, “make something foolproof and they’ll invent a better fool”. Third, I worry about risk compensation — the more safety you try to engineer into your software, the less your users will feel they have to think for themselves. (That said, good documentation, with worked examples and discussion of best practises, is invaluable, if incredibly time-consuming to write…)
For example, fitting a mixed model when there are fewer than 5 levels per random effect (e.g., you have data from only three sites but you want to treat site as a random effect) is usually a bad idea. It’s analogous to estimating a variance from a small number of data points; it will often result in a singular fit (i.e., an estimate of zero variance for the random effect), and even when it doesn’t the estimate will be very uncertain, and probably biased. At one point I made `lme4` warn users in this case, thinking that it would cut down on questions and help users use the software correctly, but there were enough complaints and questions (“how do you know that 4 levels are too few but 6 are enough? If you’re going to warn about this, why don’t you warn about (… other potential misuse …)?”) that we eventually removed the warning. John Myles White’s post referenced above goes even farther, suggesting that the package might at this point issue a mini-lecture on the topic… Partly out of desperation, and partly because I think it’s better for people to learn statistics from humans than from software packages, I’ve mostly given up on trying to get `lme4` to encourage better practice. But I still worry about all the errors that I could be preventing, for example by warning users when fitting a GLMM with strong overdispersion (but how strong is strong?)
In the end, individual incentives will determine what software actually gets written. However, there are incentives that are more altruistic than “how can I get a lot of citations and get a job/tenure/eternal glory?” There’s nothing like a broad user base for finding new, exciting applications, and having people use your methods to do interesting science may be the best reward.
Some questions for the crowd:
- what characteristics of software encourage you to do good science?
- should software try to teach you what to do, or should it provide tools and let you use them as you see fit?
- have you encountered tools that are particularly prone to abuse?
- how should we decide when to warn and when to remain silent?
- if it were up to you, how should developers allocate scarce resources between new or better functionality vs improving safety or user-friendliness?
*I don’t remember if anyone made the point in the voluminous comment threads, but my main critique was that much of Brian’s argument seemed predicated on the idea that most ecologists would be testing strong effects (e.g. p<0.001) where adjustments for spatial/phylogenetic correlation etc. etc. wouldn’t make much difference … although perhaps we should all be doing Strong Inference and testing strong effects, in my experience that’s not generally true. That said, I do agree with Brian’s (and Paul Murtaugh’s) general opinion that we shouldn’t let our statistics get too fancy and that old-fashioned methods often are just fine.
**There’s a cliché in computer science about things that are “considered harmful”, dating back to a seminal rant in computer science — this genre is so well established that there are now meta-rants about “considered harmful“.
***a wild guess, but in line with these (also wild) guesses from 2009