The starting point for this post is an old remark of statistician Jeff Leek (sorry, can’t find the link just now) that no statistical technique works at scale. He was defending frequentist statistical techniques like P-values and confidence intervals against the accusation that they’re widely misunderstood or misused, and we should therefore use Bayesian approaches instead. Jeff’s counter-argument is that if Bayesian approaches were used as widely as P-values and confidence intervals currently are, they’d be just as widely misunderstood and misused.
Does that argument generalize? Is it true that any statistical or other scientific technique gets used increasingly badly on average as the number of people using it rises? Or are there some for which the quality of the average application holds steady or even improves as the number of users increases? And are there some techniques for which the quality of the average application declines only slowly or asymptotically as the number of users increases, vs. other techniques for which the decline is much steeper?
I was also wondering how the relationship between average quality of application and number of users affects the number of users. Are there some techniques that are hard to use well, but because they’re hard to use well only end up getting used by a small number of people who use them well? Versus other techniques that are hard to use well but easy to think you’re using well, that end up getting used badly by lots of people? Or maybe the number of people using any given technique mostly depends on other factors?
Off the top of my head, I can think of some techniques that do work just as well “at scale” as they do for “early adopters”. Pipetting for instance. The vast majority of the time, when somebody pipets some liquid, they pipet the desired amount. And I doubt the pipetting error rate has increased appreciably over time as more and more scientists and trainees have been doing more and more pipetting. Same for weighing stuff using balances. Etc. What those examples of “scalable” techniques have in common is that they’re routine. The user doesn’t need to exercise any thought, interpretation, or judgment. So one working hypothesis is that the more thought and judgement a technique requires in order to work well, the worse it will scale to mass use.
If that working hypothesis is right, then the scalability of a technique won’t be solely a matter of how difficult it is to teach or learn the technique in any purely technical sense. For instance, my sense is that widespread abuse and misinterpretation of P-values is mostly not a matter of “purely” technical mistakes. It’s not that lots of people are miscalculating their P-values or don’t know what a P-value literally means. Brian made a similar point in his old post on why AIC appeals to ecologists’ lowest instincts. Unhelpful applications of AIC in ecology mostly aren’t a matter of people making purely technical mistakes in the calculation of AIC values, or being unaware of technical facts about AIC.
I emphasize that my interest in these questions is purely academic curiosity. I do not think that our choice of statistical or other scientific techniques should be dictated by worries about how well they scale up to mass use. And nothing in this post is a criticism of how people use or teach statistics or other scientific techniques. All we can do is use whatever techniques seem best, and teach others to do the same. Perhaps the only practical reason to discuss the issues raised in this post is to identify the “failure modes” of different techniques–the ways in which they tend to be misunderstood or misused, when they are misunderstood or misused. If you know the most common misunderstandings or abuses of a technique, you can aim to try to avoid or counter them in your own work and teaching.
Related old posts
Which big ideas in ecology were successful, and which were unsuccessful? The same questions this post asks about statistical techniques can also be asked about scientific ideas. Big ideas in ecology vary in how successful they’ve been. In at least some cases you can argue that the success, or comparative lack thereof, is related to how widely the idea was taken up.
Which ecological theories are widely misunderstood even by experts?
Techniques aren’t powerful, scientists are
While perhaps not entirely what you had in mind, large scale data collection through citizen science projects was the first thing that sprung to mind. For instance in the UK there is the Great Garden Bird Watch run by the RSPB each year, and my personal favourite Project Splatter run by Cardiff University. These types of projects could be performed by a small group of highly trained people, but then you would lose the larger geographical scale. So in terms of frequency of application as more people get involved the average identification ability should decrease.
Citizen science is interesting to think about in this context. There are multiple sorts of scaling involved. The scaling involved in adding more citizens to a single project, and the scaling involved in more investigators starting more citizen science projects.
We have some old posts on ensuring data quality in citizen science projects from Margaret Kosmala, who was heavily involved in the Snapshot Serengeti project using citizen science to ID the animals in a bunch of camera trap photos.
It’s interesting to consider the ‘black box’ complaint in this context. Arguably something like pipetting is a ‘black box’ process where you don’t have to know how it works and there are only a limited number of options. In the statistics world, ‘black box’ processes like Maxent are looked down upon for the same reasons. I’d argue that ‘black box’ methods are much more scale-able than the alternative, although someone will argue back ‘what is even the point of doing the analysis if you don’t totally understand it?!’. And I’d reply that sometimes you want a starting point or the decision is going to be made on other bases like in the case of protected areas designed by Marxan, another ‘black box’ process.
This is how comment sections work, right? I make my own counter-arguments?
“It’s interesting to consider the ‘black box’ complaint in this context. ”
That’s a really interesting point/counterpoint. I don’t really have anything to add, I’m still mulling it over myself. Hopefully others will jump in.
“This is how comment sections work, right? I make my own counter-arguments?”
Now it is. I’ll be over here if you need me. 🙂
Ultimately the “black box” is what one is striving for when developing a technique, right? The less judgement required on the part of the user, the more robust the technique – at least insofar as the technique accurately depicts the real world and its assumptions are satisfied or at least easily testable.
It seems that the issue w/ the application of many statistical tests is that the assumptions aren’t clear or explicit and often not even mentioned, much less thoughtfully considered and tested.
IMO your idea about the amount of judgement involved is the key. Technical expertise is relatively easy to obtain and is frequently “reusable” for many techniques (eg pipetting).
In geology, U-Pb isotopic dating is very successful even though it is technically difficult because the simultaneous use of two isotopic systems (238U and 235U) with different decay rates shows unequivocally whether or not the system has been disrupted, so users can see for certain whether or not the assumption of no loss of daughter isotopes has been satisfied. There is no need for the geo to use equivocal or uncertain means to guess about this fundamental assumption.
No doubt, the mean level of stats knowledge is currently higher in the Bayesian than in the frequentist crowd, because of various entry barriers (technical, institutional, educational).
And yes, if we assume that misuse / misinterpretation of a method becomes less likely the more you understand about it, we should think that Bayesian methods will be misused more often once they are as easily accessible as frequentist methods (I think we see this already happening, e.g. with the BayesFactor package).
Still, it would be surprising if every single statistical method had the same propensity for being misunderstood. That we should not care about the number of errors made by a typical user when applying a method because users always make errors sounds to me like a classical example of the nirvana fallacy (seat belts are not useful because people will still die in car crashes).
Based on this thought, I think it IS actually useful to think about the human error rate of a stats method, at least from the point of view of a stats developer / teacher. All other things equal, I would argue that a method that is less frequently misunderstood is a better method, for all practical purposes. The tricky thing is when we have a method that is more powerful, but also more likely misused / misinterpreted, but I’m not sure if that is the case for the Bayesian / frequentist controversy …
Indeed, I see the more meaningful distinction as between ‘modelers’ and those trained to view statistics as a toolbox of ‘tests’ for making binary decisions, which is of course easier to scale in the sense of push-button software. Now, it just so happens that Bayesian framework lends itself a little more naturally to the former, but I agree is secondary to questions of scalability.
I think this depends on the statistical technique in a sense of how well they appear to “magically” solve a problem. For example, whether you use p-values or delta-AIC, there is a magical number below (or above) which you have a “significant” or “meanigful” difference and above which you have no evidence for a difference. This is of course not how p-values or AIC should be used, and there’s a lot of criticism to these arbitrary thresholds, but they’re still widely used, often without any good reason, such as using a delta-AIC < 2. It's just so much easier to say a yes or a no than to discuss how much evidence you have, etc.
And another thing is how some methods make you seem you can "magically" solve a problem. Got non-normality? No problem, just use a non-parametric test! Got a non-linear pattern? Just use GAM! Did you forget to read Hurlbert and your study is pseudoreplicated? That's cool, just use mixed effects models and include autocorrelation if this makes you feel better! As these techniques are easy to apply and make us think we're solving our problems even when we're not, they become quite attractive and therefore easily misused.
As a counter-example, I think that regression trees are harded to interpret and don't give us a "magic" number such as a delta-AIC, and therefore don't have that much appeal. Or maybe they do, I just don't know enough about them.