Note from Jeremy: This is a guest post from good friend and ace quantitative ecologist Ben Bolker. A while back I joked that the way to make any difficult statistical decision is to ask “What would Ben Bolker do?” The great thing about asking that question is that sometimes Ben Bolker himself will answer! 🙂 Today’s guest post is a case in point. Ben takes on a question I’ve been bugging him to post on: are there downsides to making powerful statistical software widely available, and if so, what can we do about it? Ben writes R packages, so this is something he’s thought a lot about. Thanks for taking the time to share your thoughts, Ben!
I liked Brian McGill’s post on statistical machismo, although I didn’t completely agree with it.* It inspired some thoughts about the perspectives of statistical developers, rather than users, and particularly about the pros and cons (from a developer’s point of view) of providing easy-to-use implementations of new methods.
Regular readers of this blog probably agree that any researchers who use a method should provide enough tools for a (sufficiently committed) reader to reproduce their results (paging rOpenSci…) But what about researchers who propose new methods? What is their responsibility to provide code that implements their method? Is there such a thing as too much software?**
The author(s) have a choice of providing:
- no code, just the equations (common in technical statistical papers), perhaps with general discussion about implementation issues
- a text file with code implementing the method, perhaps commented
- an R package (or one for Python, or Julia, or … but R is overwhelmingly the most likely case), possibly with useful examples, user’s guide (e.g. a vignette in an R package), etc.
- a graphical user interface/standalone program
Is easier always better?
Obviously, providing more complete code is more work for the developer, and they might benefit themselves more by spending their time working on better methods rather than on software development. There is a completely selfish calculation here: as a method developer, will I get more fame/fortune/glory by doing more technical work (e.g., I work in a statistics department where my colleagues will be most impressed by papers in Journal of the American Statistical Association) or by having lots of people actually use my methods?
However, I want to ask whether friendlier software is necessarily better for science (whatever that means). At one extreme, methods that never actually get applied to data may help the developer’s career, and may inspire other statisticians to produce useful stuff, but by definition they aren’t doing any good for science. At the other extreme, though, friendly software can just make things too easy for users, enabling them (in the psychological sense) to fit models that they don’t understand, or that are silly, or that are too complex for their data — that is, to display statistical machismo. It’s a slippery slope; every step in convenience increases the number of people who might use your methods (which could be good), but it will also dilute the savvy of the average user. (A colleague who develops a powerful but underused (non-R-based) tool routinely laments that “R users are idiots”. He’s right — but if his tool had 500,000 users***, many of them would probably be idiots too.) Is there a virtue in making methods difficult enough that users will have to put some effort in to use them, or this is just a ridiculous, elitist point of view? Andrew Gelman attributes to Brad Efron the idea that “recommending that scientists use Bayes’ theorem is like giving the neighbourhood kids the key to your F-16″. (I’ve heard a similar comment attributed to Jim Clark, but referring specifically to WinBUGS — but I couldn’t track it down, so I may have made it up). I wouldn’t dream of banning Bayes’ theorem, or WinBUGS, but I can certainly appreciate the sentiment.
A lot depends on the breadth of cases that a particular method can handle. It seems hard to go too horribly wrong with a generalized linear model or a BLAST search (although I’m sure readers can suggest some examples…). WinBUGS is on the other end of the spectrum — it can be used to construct a huge range of models, although it does also require at least a little bit of training to use. I worry about automatic software for phylogenetic or generalized linear model selection; while in principle they only alleviate the tedium of procedures we could do by hand in any case, and they can certainly be used sensibly, they are also easy to misuse.
You might say that the developers should just put in more safeguards to prevent users from doing silly things; I know from personal experience that this is really hard to get right, and has its own set of drawbacks. First, it’s hard to balance the sensitivity and specificity of such tests — you can accidentally prevent a more knowledgeable user from doing something unusual but sensible, or worry users unnecessarily with false-positive warnings.
(John Myles White comments on the tradeoff between usability and correctness in statistical software.) Second, “make something foolproof and they’ll invent a better fool”. Third, I worry about risk compensation — the more safety you try to engineer into your software, the less your users will feel they have to think for themselves. (That said, good documentation, with worked examples and discussion of best practises, is invaluable, if incredibly time-consuming to write…)
For example, fitting a mixed model when there are fewer than 5 levels per random effect (e.g., you have data from only three sites but you want to treat site as a random effect) is usually a bad idea. It’s analogous to estimating a variance from a small number of data points; it will often result in a singular fit (i.e., an estimate of zero variance for the random effect), and even when it doesn’t the estimate will be very uncertain, and probably biased. At one point I made `lme4` warn users in this case, thinking that it would cut down on questions and help users use the software correctly, but there were enough complaints and questions (“how do you know that 4 levels are too few but 6 are enough? If you’re going to warn about this, why don’t you warn about (… other potential misuse …)?”) that we eventually removed the warning. John Myles White’s post referenced above goes even farther, suggesting that the package might at this point issue a mini-lecture on the topic… Partly out of desperation, and partly because I think it’s better for people to learn statistics from humans than from software packages, I’ve mostly given up on trying to get `lme4` to encourage better practice. But I still worry about all the errors that I could be preventing, for example by warning users when fitting a GLMM with strong overdispersion (but how strong is strong?)
In the end, individual incentives will determine what software actually gets written. However, there are incentives that are more altruistic than “how can I get a lot of citations and get a job/tenure/eternal glory?” There’s nothing like a broad user base for finding new, exciting applications, and having people use your methods to do interesting science may be the best reward.
Some questions for the crowd:
- what characteristics of software encourage you to do good science?
- should software try to teach you what to do, or should it provide tools and let you use them as you see fit?
- have you encountered tools that are particularly prone to abuse?
- how should we decide when to warn and when to remain silent?
- if it were up to you, how should developers allocate scarce resources between new or better functionality vs improving safety or user-friendliness?
*I don’t remember if anyone made the point in the voluminous comment threads, but my main critique was that much of Brian’s argument seemed predicated on the idea that most ecologists would be testing strong effects (e.g. p<0.001) where adjustments for spatial/phylogenetic correlation etc. etc. wouldn’t make much difference … although perhaps we should all be doing Strong Inference and testing strong effects, in my experience that’s not generally true. That said, I do agree with Brian’s (and Paul Murtaugh’s) general opinion that we shouldn’t let our statistics get too fancy and that old-fashioned methods often are just fine.
**There’s a cliché in computer science about things that are “considered harmful”, dating back to a seminal rant in computer science — this genre is so well established that there are now meta-rants about “considered harmful“.
***a wild guess, but in line with these (also wild) guesses from 2009
Nice post Ben, thanks for this. In particular, I’d never really thought about optimal design of warning messages in stats software, hadn’t realized how difficult that is to get right.
As I’ve mentioned to you (and will now mention to others), your thoughts resonate with Jeff Leek’s over at Simply Statistics: http://simplystatistics.org/2014/02/14/on-the-scalability-of-statistical-procedures-why-the-p-value-bashers-just-dont-get-it/
Great post Ben.
My favorite analogy for the danger of uninformed users using fancy stats software was when Art Winfree said that point and click stats packages were like “letting monkey’s play with razor blades”.
I don’t know that I have the answers to your questions about software. It was interesting to hear the history of warnings in lme4 that I didn’t know but I couldn’t totally imagine the entire progression. Was there ever any discussion of a level of sophistication switch (could be an argument to the function or a global variable) that a user could set to receive High Advice, Low Advice, No Advice?
I do think the phasing of your question ” I want to ask whether friendlier software is necessarily better for science?” is exactly the right question to ask. The problem of course is it requires somebody who knows both enough science and enough stats to appropriately weigh the balance.
I usually think about this issue not so much from a software point of view as an education point of view. I think one of the core problems is that we teach stats as if there are cookbook recipes and not as if it is a complex subject that requires mastery and judgement. I teach a 2nd semester stats course. And while I teach R and all sorts of statistical procedures (some of which certainly cross my own statistical machismo boundary), more than anything else I try to teach stats as a topic that requires judgement. On assuming normality vs non-parametric and how non-normal the data has to be before you start to worry, on dealing with or ignoring non-independence in the data, on Type I vs II vs III vs MLE ANOVA, MLE vs REML vs penalized likelihood/regularization, type I vs type II regression and on and on and on. On every one of these questions there is no “best” answer – the closest you’ll get is an “it depends” and even there it often depends on subtle interactions between the structure of the data and residuals and the model and the goals. On the whole it succeeds – students come in disappointed I’m not giving them recipes but by the end they get that there is always an argument on either side and that there are different goals of statistics and you have to know what you really want to prioritize They may not always feel confident exercising the judgement in complex cases, but they no longer feel stupid asking for advice. Just unlearning the recipe mentality is a major step forward.
Re: rules vs. judgement, in an old thread I recall making the analogy to teaching children good behavior. You start by teaching them black-and-white rules: “Don’t lie!” “Don’t hit your brother!” Then as they get older they learn that the world’s not black and white, that there are exceptions to every rule, and that good behavior requires good judgement (e.g., about when you *should* lie–or even when you should hit your brother!)
The analogy isn’t perfect, in that even beginning biostats students are much better able than small children to make judgement calls. So I approach the teaching of the subject much as you do.
“Was there ever any discussion of a level of sophistication switch (could be an argument to the function or a global variable) that a user could set to receive High Advice, Low Advice, No Advice?”
That seems like a great idea! A *lot* of work to implement, though (I assume).
And see this old linkfest post for discussion of the need for a “deterministic statistical machine”: statistical software that gives the user essentially no options. You input your data, input the question you want to address, and the software performs a completely pre-specified analysis along with associated assumption checks. A radical idea, but makes the not-so-radical point that maybe we need different software (or different “advice level” options, as Brian suggests) for users with differing levels of expertise. If giving a teenager the keys to your F-16 is a bad idea, well, one response is to make sure you can give the teenager the keys to a 2004 Honda Civic instead. Or even give the teenager a bus ticket or money for a taxi fare. 🙂
Of course, where this analogy to giving teenagers your keys breaks down is that, once powerful, flexible software is software is available, you can’t prevent people from using it. Users choose software in large part based on what other people use. And since everybody uses R these days, it’s hard to see how you could get less statistically-savvy users to go use some other software instead.
[composed before I read other replies below]
Clever as they are, I get a little bit tired of the analogies to (monkeys|teenagers|fools) using (razor blades|F-16s|power tools) (although Neal Stephenson has a brilliantly funny take on this at : search for “Hole Hawg”). As I said in a previously published comment : “These warnings are like Homeland Security threat advisories: they warn of danger, but provide little guidance.”
We have thought about level-of-sophistication switches, at least in passing. There a few hard parts here: (1) simply figuring out exactly which warnings and advice go in at each level, and implementing the damn thing (and the time-and-energy tradeoff with other bug-fixing/polishing/feature implementation). The ‘high advice’ setting in particular would take a lot of work … (2) A certain amount of the time we simply **don’t know what advice to give**: this is perhaps specific to GLMMs (where a lot isn’t known), or perhaps we are just a bit more painfully aware than other statistical software developers how much we don’t know?
(Of course) I agree with you about the need for judgement. I think about teaching a lot, but a lot more people use my software than I will ever teach, unless I go down the MOOC road … The hard part about judgement is how to give people reasonable rules to follow. My favourite example here is assessing normality. Me: “Don’t mindlessly apply a statistical test for normality, you should look at the Q-Q plot instead and judge for yourself whether it is far enough from normality to worry about.” Student: “Can you tell me how I should decide when looking at a particular Q-Q plot whether the deviation from normality is large enough that I should worry about it (i.e., that it may materially affect my *biological* conclusions)?” Me: “Uhhh … I’m not sure. The one you’re showing me now looks OK — there are a few deviations, but they’re not systematic and this is a fairly small data set so I would expect some noise.” Student: “But how could I draw that conclusion for myself?” Me: “Uhhh …” (Then I mutter something about how they should gradually calibrate their sense of how big a deviation is a problem as they work with more experienced people.) I have at least gotten to the point where I’m *honest* with students about this unterminated regress …
You are right of course that snide analogies about the dangers are not constructive.
While not denying the complexity, I personally think advice giving stats software would be a good thing. Code-checking software (e.g. lint) is common place, and so since its advisory, people are free to ignore it or even turn it off, but most good programmers I know see it as a useful tool in their toolkit.
And – yeah – the problem with judgement is it is complex and hard (impossible) to teach in one semester. Honestly, I am happy when I get students to the point they know what the questions are go to consult with somebody. It is perhaps ironic that the main accomplishment of my 2nd semester stats class is getting people to unlearn their recipes and know that it is complicated enough that they need an expert (they also learn the types of analyses possible and the questions to ask and hopefully when they’re in the safe zone and when they need to ask so I’m being a bit facetious). But that may be the optimal path. We don’t buy a house without experts in real estate transactions, home inspectors etc. We don’t even buy a used car without a mechanic checking it out. Yet we spend sometimes millions of dollars on data collection without bringing an expert statistician. This to me may be the hidden cost of “easy to use software” – its not the possible mistakes – it’s the illusion that expert people don’t need to be involved. I expect I’m preaching to the choir giving some of your comments about people and not software teaching stats.
And I suppose that is what my original statistical machismo post was about in a nutshell – specifically reviewers who haven’t learned enough to unlearn their beliefs that there are recipes and absolute guidelines even in complex situations.
– What characteristics of software encourage you to do good science? Good documentation, and a strong user community helps. This documentation is preferably linked to a publication that outlines the methods in detail, along with the usual help files, etc., that explain what the different commands do.
– Should software try to teach you what to do, or should it provide tools and let you use them as you see fit? In my opinion, as long as the documentation is good (as described above), I’m fine with it just letting me use it. It’s even better if there is a published paper involving the programmer that gives some use cases that I can work through, to make sure I understand how things click.
– Have you encountered tools that are particularly prone to abuse? Pipettes. DNA sequencers. Digital calipers. Data notebooks. CT scanners. And statistical software of all sorts. This is a somewhat roundabout way of saying I think that although statistical software is very visible in science, it IS but a tool like any other. And just like any tool, when humans use it things can go awry. In my domain (paleontology), phylogenetic software often is used without thought–e.g., just using the default settings or copying the settings used by another study without thought. That said, the software is widely used enough that most reviewers are savvy to any pitfalls. Now that phylogenetic “corrections” are widely available, you see a fair bit of misuse of software that implements PICs, etc. But again, this is often (not always) caught by reviewers. Probably the best way to short-circuit this is to enhance education – make sure that people are aware of limitations and assumptions built into methods. As journal editors, we can make sure that papers make their way to folks who have deep experience and understanding of methods.
– how should we decide when to warn and when to remain silent? “Big stuff” is nice to warn about – and probably easy to incorporate (e.g., when TNT doesn’t save all of the possible trees, or runs out of memory, or whatever).
– if it were up to you, how should developers allocate scarce resources between new or better functionality vs improving safety or user-friendliness? Documentation, documentation, documentation. There is nothing that annoys me more as a software user when I can’t dig in and learn about the method, or learn what options are even available. This is an area where I think R packages generally blow other software out of the water! If I grab something from the R repository, I know that it will probably come with a ridiculously detailed (to me, at least), literature referenced PDF explaining what everything does, with examples of how to implement it. I’ll take this any day over a shiny standalone GUI program that is accompanied by a worthless help file.
A great post that nicely expands on Brian’s previous discussions.
I think an interesting example of a tool that was created with good intentions but came under scrutiny due to its black box construction is MaxEnt. It’s been discussed here before and Jeremy has previously linked to the following blog post that outlined the debate:
In this case, MaxEnt allows users to easily model and map species distributions with presence-only data despite the importance that various assumptions can have on the resulting inferences. Ultimately it was determined that MaxEnt is mathematically equivalent to a point process GLM with a lasso penalty. Knowing this allows for a transparent assessment of the advantages and disadvantages of MaxEnt, which previously had not been fully described or understood. It seems this was partly because its authors did not fully appreciate what it was doing (yikes?). Fortunately, the consensus appears to be that MaxEnt is useful when properly implemented.
Anyway, in this case the documentation was adequate enough for use by anyone with some data points, but not thorough enough to allow proper scrutiny by statisticians. I suppose the analogy here would be giving the neighborhood kids the key to your Boeing 787 Dreamliner without knowing whether it can fly safely even with an expert pilot at the helm.
The MaxEnt situation is more complicated than that. A lot of users simply throw their data into it, and use the defaults without thinknig about their data. I keep on coming across people who have presence/absence data (sometimes even with replicated samples) and use MaxEnt. The problem is that it is easy to use, and produces pretty maps.
I’m presently implementing SDMs as point processes using INLA, including a spatial random field. Anyone who’s tried the spatial still in INLA will appreciate how likely that is to get general use if I don’t write some wrappers for the analysis.
“what characteristics of software encourage you to do good science?”
Clarity of method and ease of use, in that order. No other criteria are even close really.
As for R, I consider it to be similar to Wikipedia: “anybody can contribute”…yeah, and it shows. Even the functions in base R can be a total pain in the *** to use, e.g. the merge function. Given what I do, if could start over I’d spend much more time finding out about Python and other languages before jumping in. But I’m +/- committed at this point and have literally worn the cover off Ben’s book.
I’m totally in the KISS camp as far as methods go. Always sacrifice power for simplicity and clarity. Complex methods only confuse people in most cases.
The tricky question is whether GLMMs (for example) count as “simple” or “complex” … I like to think that they hit a good spot on the power-for-complexity tradeoff curve. Taken to the extreme, does your advice lead to sticking to t-tests, ANOVA, Mann-Whitney, and Kruskal-Wallis tests and forgoing everything else?
I exaggerated my point for effect– there are definitely methods I will use that are fairly complex if need be. As a general principle, when in doubt, I go with the simplest and/or most robust method. But I recognize that the judgement on exactly what methods constitute simplicity and/or robustness, will vary as a function of several factors, including personal experience, available software, etc.
Ben, excellent post on a complex issue; both your thoughts and many of the comments above ressonante with my own thinking.
Google recently announced that it will be making it’s own self-driving cars, rather than modifying those of others. Unlike the earlier versions, these ones won’t have steering wheels and pedals. Just a button that says stop and a button that says go. What does this tell us about user-friendly software?
I think it is particularly interesting and instructive that the quote Gelman attributes to Efron is about a mathematical theorem rather than about software. That makes it a rather different statement than Clark’s comment about winbugs in my mind. Even relatively simple statistical concepts like p values can cause plenty of confusion, statistical package or no. Consequently, it is unclear that the spectrum between a mathematical appendix and an R package is the right question to ask.
I am very wary of the suggestion that we should address concerns of appropriate application by raising barriers to access. Those arguments have been made about knowledge of all forms, from access to publications, to raw data, to things as basic as education and democratic voting.
There are many good reasons for not creating a statistical software implementation of a new method, but I argue here that fear of misuse just is not one of them.
1) The barriers created by not having a convenient software implementation are not an appropriate filter to keep out people who can miss-interpret or miss-use the software. As you know, a fundamentally different skillset is required to program a published algorithm (say, MCMC), than to correctly interpret the statistical consequences.
We must be wary of a different kind of statistical machismo, in which we use the ability to implement a method by one’s self as a proxy for interpreting it correctly.
1a) One immediate corollary of (1) is that: Like it or not, someone is going to build the a method that is “easy to use”, e.g. remove the programming barriers.
1b) The second corollary is that individuals with excellent understanding of the proper interpretation / statistics will frequently make mistakes in the computational implementation.
Both mistakes will happen. And both are much more formidable problems in the complex methodology of today than when computer was a job description.
So, what do we do? I think we should abandon the false dichotomy of John Myles White between what he calls “usability” and “correctness.”
A software implementation should aim first to remove the programming barriers rather than statistical knowledge barriers. Best practices such as modularity and documentation should make it easy for users and developers to understand and build upon it. I agree with Ben that software error messages are poor teachers. I agree that a tool cannot be foolproof, no tool ever has been.
Someone does not misuse a piece of software merely because they do not understand it. Misuse comes from mistakenly thinking you understand it. The premise that most researchers will use something they do not understand just because it is easy to use is distasteful.
Kevin Slavin gives a fantastic Ted talk on the ubiquitous role of algorithms in today’s world. His conclusion is neither one of panacea or doom, but rather that we seek to understand and characterize them, learn their strengths and weaknesses like a naturalist studies a new species.
More widespread adoption of software such as BUGS & relatives has indeed increased the amount of misuse and false conclusions. But it has also dramatically increased awareness of issues ranging from computational aspects peculiar to particular implementations to general understanding and discourse about Bayesian methods. Like Kevin, I don’t think we can escape the algorithms, but I do think we can learn to understand and live with them.
Ben, you mentioned looking for the Clark quote. I can recall is that a perspective in ESA from 2009 (by Ben somebody) claims that Clark 2007 (Models for Ecological Data texbook) says that ‘‘turning non-statisticians loose on BUGs is like giving guns to children.’’
I wouldn’t argue for intentionally raising barriers to access; rather, I would say that concerns about misuse have sometimes sapped my energy/curbed my enthusiasm for investing lots of effort in making things easier. For example, consider the formula interface in R. If someone can’t figure out how to write ‘response ~ predictor1 + (1|group_var)’, it’s hard for me to get excited about creating a GUI so that they can select fixed and random effects from a menu — it’s not that I think they’re unworthy of using the software, it’s just that if they can’t overcome that barrier I’m not sure I’m comfortable that they will be able to overcome the other technical hurdles required to use GLMMs safely. (More unworthily, I’m not sure I have the time and patience to handle support e-mails from that group of users …)
I *do* think that, all other things equal, someone who’s capable of implementing a method for themselves is *more likely* to interpret the results carefully — at least I know they’ve invested more …
As for the “if you don’t do it someone else will” argument — I do feel that pressure acutely, and I think it’s one of the biggest dangers of adopting a purist stance. I’m occasionally a little horrified by some of the “easy-bake” R packages that put simple wrappers around complex tools, but I have to admit they’re satisfying a demand …
Most researchers do not properly understand the theoretical & technical background of the statistical menu in Excel. Surely there are people with a very strong understanding, but they are the exception. Using this starting point and Dijkstra’s statement we should not let people touch statistical software; however, we still do. Perhaps we hope that people will learn to run their own analyses—and some do very well—and we will lift quantitative understanding for a whole discipline. I think that today we now have a more vocal group of people that have come a long way on understanding, but the majority still doesn’t have a clue of what’s going on behind the code or menus.
As pointed out by another commenter software is just a tool, and as many other tools, is potentially subject to abuse. As an example, this week I am coming to terms with wasting over $10,000 because a postdoc did not properly read the instructions of a couple of tools used in the field. The ‘assessments’ are not much more than random noise. He was supposed to have plenty of experience in the topic and the friggin’ manual was available; still, mistakes. Most statistical software comes with copious amounts of documentation and very often we see the wrong techniques/models/approaches used in publications.
Perhaps most researchers should accept that we are statistical amateurs and should build collaborations with professionals on statistical analyses. Every university/research institute worth its salt has a statistical consulting unit, Why not work with them? There is no substitute for experience and every time we set guidelines the majority of practitioners take them as rigid rules, from P < 0.05 to how you read a QQ plot, passing by when is something random or fixed (term represents a sampling or randomization process I'd say, but 3 levels is probably too small).
These thoughts are quite reasonable, but implementing them will require big cultural changes. As I recall there was some recent discussion on the blog about squeezing budgets, and what things researchers feel they might be able to give up as their budgets shrink. I don’t know what it will take to build a consensus among PIs and granting agencies that statistical consultancy is something worth shelling out for — it certainly won’t happen as long as people can get away with publishing crappy statistical analyses … (And for what it’s worth, my university doesn’t have a statistical consulting unit, at least not that I’m aware of [and I’m in the math & stats department] …)
“Every university/research institute worth its salt has a statistical consulting unit,”
Actually, the vast majority don’t, sorry.
Your university doesn’t have statistical consultants? That sucks.
Well, there’s a mathematics and statistics department, which of course most universities have. And perhaps some of the profs there do some freelance statistical consulting. But there’s no consulting unit or other program to match up scientists to statisticians, and it’s not the job of anyone in the math & stats dept. to collaborate with or consult for anyone else at the university.
As I say, I don’t bemoan this because I think it’s the usual state of affairs. It’s hard to feel too bad about the lack of something most people don’t have.
More broadly, I don’t know that there are enough statisticians in the world for every scientist who might want or need to consult with one to be able to do so…
I’m guessing that stat consulting units are more common at universities that have a significant agricultural research presence (possibly also biomedical, although those are probably more often squirreled away within the hospital/health center and not available for the general/academic research public …) I’ve also heard my share of horror stories about “but the statistical consultant told me to do it this way”. (In defense of the consultants, I don’t know whether I’m hearing a misinterpretation or an outright lie in these cases — sometimes it seems the most likely possibility, as I can’t imagine a trained person actually giving the advice I’m hearing.)
Like other commenters I’d say documentation is key to good use, and the software help should definitely try to teach rather than help push buttons. That’s pretty much the strategy in Matlab. That’s not a statistical software per se but one can do most statistics with it (and some statisticians do use it, I suspect it is because the optimization functions are pretty good).
In R in comparison, some vignettes/docs are excellent while others are really scarce. Huge inequality. In Matlab I find that almost all of them are really good – at least all those I had to use. The undocumented stuff does not make it through. Which brings me to another point helping good science: a clear organization, with a main distribution, toolboxes, …; if you’re a skilled user you can propose your own code on your webpage or repositories but that does not transform into a package one can upload. It can be done for open projects too, see for instance the many packages of the Linux distributions. But that requires lots of programmers (see below).
have you encountered tools that are particularly prone to abuse?
-> This might be too confounded with the statistics skills of the users in that field. I feel there is a bit less abuse in e.g. Matlab than R and other stats software, but that might just be that Matlab is used mostly by engineers / a few areas of physics, and they get real quantitative training.
how should we decide when to warn and when to remain silent?
-> I’d say that depends on who is the method for.
if it were up to you, how should developers allocate scarce resources between new or better functionality vs improving safety or user-friendliness?
-> A simple solution could be a two-step process, with implementation in easy-to-use stats packages not done by the original method developer. In some areas of R this is the case, but not all I think.
Bottomline: having a team of dedicated programmers / engineers who check, re-check and organize the code/software for the users is probably very practical to do good science.
R (and further RStudio, and Github for slightly more techie ecologists) have significantly lowered the bar for creating easy-to-install packages, which is a good thing — however, no one can force you to write good documentation. (If you thought the CRAN submission process was annoying now, imagine if there were editors on staff reviewing your package’s documentation for clarity and completeness …) Like the “more users means more idiots” dynamic I alluded to in my post, easier packaging also means more crappy packages …
The question about “having a team of dedicated programmers / engineers” is: who’s going to create the incentives for that? Someone needs to step up and pay for it, one way or another …
Well, who pays is a good point, but universities already pay crazy sums for softwares like Word and Excel, even though the quality is… debatable. They also pay for softwares like Matlab or Mathematica. Now some uni have funds to invest in open access science of various kinds.
So I am wondering whether the “saved money using R” could not be computed to create some incentives for donations or allocation of programmer time.
There’s a list on the R site of Donor/Supporting Institutions ( http://www.r-project.org/foundation/memberlist.html ) and it looks rather thin compared to how many institutions are actually using it. I’m a little unsure how R contributions work right now, but it seems that some of the heavier R contributors have a permanent position at a research / teaching institution. Looking at the list they are surprisingly few! It sounds reasonable that any institution with ~1000 R users, including students, could decide to allocate at least one programmer or engineer to work on R development nearly full-time?
I’m not sure there’s any solution to these general questions. Even though I bitch about R sometimes, the fact remains that’s it highly powerful, and freely available for all major operating systems, and it’s my responsibility to deal with the frustration, and to know what the hell I’m doing. There has to be decent documentation in the help files, but in the end a significant degree of self-education is necessary for most of us as to how to do good analyses, and that probably means you’re going to have to buy a book or two, use the R-help forum and get after it. Joseph Adler’s (R in a Nutshell) and Ben’s books have been absolutely indispensable for me.
I use one of those easy GUI stats tool (Maxent) and I’m pretty happy with it except I don’t find the Maxent documentation especially helpful once you get beyond the ‘here’s how you make it work’ stage. There are over 30 options to pick between.
I think because Maxent was so new and the methods weren’t clear, too much time was spent on critiquing the tool and too little spent on how to use the tool. I know our results would be better if we had presence-absence data or had random sampling but we don’t and we still need results. I would really appreciate if someone had critiqued AUC and also offered alternative metrics. Or worked on the model selection process. Or determined under what conditions all feature types were used. Unfortunately I feel like the now that Maxent is so easy to run, the statistically-inclined people have left it behind to do other things so the fine-tuning of the settings will be left to less knowledgeable people.
Conservation efforts are going to be made anyway, based on science or politics or money. Maxent lets us put at least a little science into the process.
Unless I’m mistaken, everything you’re calling for to evaluate and use MaxEnt exists (AUC critiques and alternatives; model selection process, etc.). As just some examples:
Model selection and complexity / parsimony:
Criticisms of AUC:
Alternative measures of model performance (calibration as opposed to discrimination):
And there is much, much, much more out there… and those are all valuable pieces of work on actually using MaxEnt (or similar presence-only SDMs) from awfully knowledgeable people.
Oh yes, I mentioned all of those topics because I’ve read some really good papers on them. But there are limited papers per topic. The AUC issue, for example, is still being hashed out – many papers that use Maxent to solve a problem (rather than evaluate Maxent itself) say ‘we used this threshold/these thresholds (citation) although there has been some criticism (citation)’. Which is probably as good as one can do these days. Model comparison using AIC seems like a great tool but that’s not integrated into Maxent at the moment. Running all variables for all species is pretty common.
But I guess this gets back to using powerful stats programs – running the defaults on Maxent is probably fine because the creators were pretty thoughtful about the process. But you can get yourself into trouble pretty quickly.
I guess that circles back to the original post: should GUIs (or R packages or etc.) for advanced methods be developed? What should their documentation include? How often should they/must be updated?
MaxEnt (for correlative SDM work) is a good example, given that its rapid assent to popularity over other methods was likely attributable to its very user-friendly GUI. But I think it’s something of an unfair expectation that that original GUI and its documentation have everything you would ever need for implementation without referring to current (and developing) literature on the method, or potentially needing to eventually go deeper/more customized on your own.
To answer the title, I’d say yes, it can be.
The real harm is that tools that are user friendly can drive statistical analysis towards a particular method that is in vogue at a moment. Given that all people are subject to the sociological vagaries of wanting to be successful, make money, get a good job etc, good science can easily be sacrificed on the alter of ease and rapid publication. Easy software can pigeon hole our way of thinking, leading us to always see an analysis as a place to use method X because that’s what we have software for. It’s already been touched on quite a bit, but the MaxEnt discussions are a great point in the ecological world. But I’d also say that there are other examples of this. One would be Structure, which has struggled with reproducibility in structure (doi: 10.1111/j.1365-294X.2012.05754.x). Another recent instance of structure being possibly misused is in the book “A Troublesome Inheritance” (outlined here: http://www.molecularecologist.com/2014/05/troublesome-inheritance/).
In all of these examples, people are able to access complex algorithms (MaxEnt, clustering) without much understanding of them. When you have software that is easy to cluster with, everything becomes something you can cluster (to borrow the hammer/nail analogy). I think that’s the real danger of statistical software. I’d also say that this sort of misuse is impossible to stop. I’d think the best place for it to be stopped would be peer-review. But then I’m not sure Ben has enough time to do all those reviews :). In the end there are far more monkeys than zookeepers (to not mix metaphors), and peer review or post-publication peer review are really the best tools to prevent this kind of “software chooses how we think about a problem” problem.
To finish out answering the questions though.
– what characteristics of software encourage you to do good science?
I’ll echo a common sentiment that good documentation probably goes the farthest in terms of encouraging good science. WinBUGS is a good example of this. It’s full of esoteric errors that are almost impossible to interpret with not much help in the docs. The software does provide lots of examples though, which one could improperly try and adapt for their own analyses.
– should software try to teach you what to do, or should it provide tools and let you use them as you see fit?
Software shouldn’t teach.
– have you encountered tools that are particularly prone to abuse?
I don’t mean to be a winbugs hater, but it seems prone to abuse or making you want to claw your eyes out. Also MaxENT and structure
how should we decide when to warn and when to remain silent?
I’d be prone to warn liberally. It doesn’t really harm anyone, and if you help even a few people I think it’s worth it. If users want to ignore at their own peril they can.
-if it were up to you, how should developers allocate scarce resources between new or better functionality vs improving safety or user-friendliness?
Ted, I agree with almost[^1] every points you make but I still reach a different conclusion. Even though easy-to-use software means more bad papers, I don’t think the software is harmful. I don’t think progress in science is measured by the number of bad papers, or even the ratio of bad papers to good papers. I think progress is driven by good papers, and software enables that by making methods widely accessible to empiricists. Without that happening, we go nowhere.
Meanwhile, weak or widely miss-used methods attract eventually attention and debate. I think that is most productive when it spurs people to expose flaws and build upon existing methods, as we’ve seen, for example, over the past decade or so in comparative phylogenetics.
While popular software may increase the number of mistakes, it reduces the variety of them, making them easier to catch. If everyone wrote their own methods, even with fewer people involved, it would become much harder to debug both the results and the codes.
Lastly, I think in this thread we continue to confound “coding skills” and “statistical skills”. (not your comments, but the general assumption that people who can code things themselves get the stats correct and those who use someone else’s software wouldn’t understand the stats).
Anyway, thanks for great points and examples. I too find it hard not to blame the software!
[^1]: warnings are good but warning too liberally becomes like trying to teach, and may give false sense of security when no warnings are encountered. I’m honestly not sure how to draw the line. Like you say though, I’d vote for allocating effort for new methods (or human teaching) over writing warning methods.
The BUGS language has its fair share of issues (some of which have been addressed by JAGS), but the tremendous functionality outweighs the lack of ease and safety, especially when you consider the transparency that is possible with shared code. And it addresses Carl’s point about coding vs statistical skills – you need both to use BUGS beyond toy examples. Sure, the MCMC sampler is a black box but with any software there is eventually a point beyond which only developers go.
I got distracted making the point of working on developing collaborations with experts on stats, so I didn’t answer your questions:
> what characteristics of software encourage you to do good science?
A clear explanation of what are the limits of the methods implemented in the software. Something like ‘this package works for data that looks like X & Y, coming from designed experiments or sampling using Z. Results are probably invalid if your data does not meet these criteria’.
> should software try to teach you what to do, or should it provide tools and let you use them as you see fit?
Short answer: no. Longer answer: you can probably teach/prevent abuse with your choice of defaults; a while ago I wrote (Unsurprisingly) users default to the defaults. Some defaults are misleading users into a false sense of security: should someone really believe that P values with 6 decimal places apply to their analyses?
> have you encountered tools that are particularly prone to abuse?
Of course, any open-ended tool will be used in a wrong way: ask my screw drivers. “With great power comes great responsibility” for the user. It is very hard to predict all the possible wrong ways people will use your software; I don’t think you should prevent a lot of time imagining how users will torture your software on terms of statistical application. People are too creative.
> how should we decide when to warn and when to remain silent?
IMO software should remain silent unless you are very close to boundary conditions that make results are unstable or plain wrong.
> if it were up to you, how should developers allocate scarce resources between new or better functionality vs improving safety or user-friendliness?
User-friendliness > Functionality > Safety. I wouldn’t use software that doesn’t achieve of minimum of friendliness; having achieved friendliness I want as much functionality as possible. I can live with the onus of making sense of analyses (or asking someone to help me do it).
Pingback: Is statistical software harmful? | Betteridge’s Law
Just remembered one inarguable example of harmful statistical software: software that actively *encourages* mistakes as opposed to trying to prevent them, never mind remaining silent!
On the contrary (or just to be a contrarian 😉 ) this is evidence that knowledge to write software does not imply statistical understanding. We can take away all the statistical software that exists and folks will continue to make statistical mistakes, even when it requires programming to make them. I sometimes worry that in decrying user-friendly statistical software there is the implicit assumption that those who can program methods without using someone else’s software will have a better understanding of the statistics and be less likely to misuse it. Or maybe that was never implied?
In an example like this where the underlying idea is just wrong, it’s not clear that the software is what is actively harmful (one could blame the internet etc for making the bad idea ‘available to the masses’ just as much as the software — the risk is only to those with no appreciation for statistics.) It does beg the question of what would be the platonic example of software that is ‘harmful’ — I suspect it would be something that is statistically valid, can be used correctly, but is programmed in a way that even those with a decent understanding of the statistical and computational processes involved are likely to use it incorrectly. Perhaps we already have some good candidates in the examples cited above?
This comment thread is so good that even my attempts at jokes draw serious, thoughtful responses! 🙂
Pingback: Notes from France | theoretical ecology
Pingback: Code-Based Methods and the Problem of Accessibility | methods.blog