Last week Jeremy linked to yet another study where expert researchers (social scientists in this case) were asked to analyze the same dataset. The key findings were: a) that the experts had broad disagreements about the best way to analyze the data, and b) that these differences were consequential in leading to totally different outcomes (positive, negative or no statistically significant effect). This should hardly be news; earlier studies have found this in another social science dataset and in fMRI scans in neurobiology.
You really have to pause and let that in. Established researchers each think a different method of analysis is best, and these different methods give completely different, even opposite answers for the same data on the same question. Even controlling for subtly different questions or experimental designs, the answer you get depends entirely on which person you give the data to for analysis!
This should be the death of “one right way” approach to statistics
It is hard to read these studies any other way than as a major blow to the view/approach that there is one right way to do statistics. Or at least it should be. I assume some will continue to think that there really is “one right way” (theirs) and that the variance occurs because most of the researchers (everybody else) are just plain wrong. But that is a bit too egocentric and lacking of perspective to my mind. I know I’ve offended people over my blogging history on this point (which has not been my intent), but I just find it really impossible to accept that there is only one right way to analyze any statistical problem. Statistics are (probabilistic) models of reality, and it is impossible to have a perfect model, thus all models are involved in tradeoffs. And these new studies just feel like a sledge hammer of evidence against the view there is one right way (even as they make us aware that it is really inconvenient that there is not just one right way).
When you look at the differences among researcher’s approaches reported in the studies, they’re not matters where on could dismiss an approach as being grossly wrong. They’re the kinds of things people debate all the time. Log transform or square root transform or no transform (yes sometimes log or no-log is clearly better, but there are a lot of data sets out there where neither is great and it is a matter of judgment which is better – I’ll give an example below). Or OLS vs logistic vs other GLM. Multivariate regression vs. principal component regression vs regression tree. AIC vs automated vs. researcher variable selection. Include a covariate or leave it out. And etc. There is no such thing as the “one true right way” to navigate these. And as these meta-studies show, they’re not so trivial we can ignore these differences of opinion either – conclusions can change drastically. So, again, these results really should give us pause. Ninety percent of our published articles might have come to an opposite conclusion if somebody else did the stats even with the same data! (And one person was “smart” and the other “dumb” is not really a viable explanation).
Is ecology and evolution different?
Or maybe the ecology literature is safe? For those of us in ecology and evolution, our time to find out is coming. A similar study is underway right now. I suppose logically there could be two outcomes. 1) Same results as previous studies – total researcher-dependent chaos. 2) Different results – the chosen question and dataset has a strong answer and lots of different methods recover the same answer (qualitatively – small changes in effect sizes and p-values are OK). A lot of people in response to Jeremy’s question of what to do about these studies seemed to be really thinking (hoping?) that ecology was different and would come out with outcome #2.
Personally, I doubt it. I don’t think fields are that different. Different questions within a field are the important difference. All fields sometimes chase large effect sizes, which will give outcome #2 (when you can see the pattern visually in the data, methods aren’t going to change the story), and sometimes fields chase small effects which will give outcome #1 (when the effect sizes are small and you have six control variables, it matters a lot how you analyze the data). But here’s the key: we don’t know after we’ve completed our study with a single analysis path whether our question and results are in outcome #1 (different paths would give different answers) or #2 (different paths would give similar answers). If we knew that, we wouldn’t have done the study! Sometimes studies of weak effects come up with an estimate of strong effect, and sometimes studies of a strong effect come up with an estimated weak effect. So trying to use statistics to tell us if we are in #1 or #2 is circular. This is a really key point – it might seem that the only way to tell if we are in #1 or #2 is to do some giant metastudy where we get a couple of dozen researchers to analyze the same question on the same dataset. That hardly seems practical! And the study being done on evolutionary ecology and conservation ecology questions could end up either in #1 or #2 (in my hypothesis depending on whether they are giving researchers weak effect or strong effect datasets/problems), so that is not a comprehensive guide for all of ecology and evolution. What we really need is a meta-meta-study that does several dozen of these meta-studies and then analyzes how often #1 vs #2 comes up (Jeremy has these same thoughts). I’m willing to bet pretty heavily that ecology and evolution have publications both that are safe (scenario #2) and completely dependent on how the analysis was done (scenario #1). In my own research in macroecology, I have been in scenarios where #1 is true and in scenarios where #2 is true.
Couldn’t individual authors just explore alternative analysis paths?
If we can’t afford to conduct a meta-analysis with a few dozen researchers independently analyzing each data set for each unique question (and we surely can’t!), then what alternative is there? There is an obvious alternative. An individual researcher can explore these alternatives themselves. A lot of researchers already do this. I bet every reader of this post has at one time tried with and without a log-transform or OLS vs. GLM assumptions on residuals. And, nominally, a strong majority of ecologists think such robustness checks are a good thing according to Jeremy’s poll. So it’s hardly a novel concept. In short, yes, it is clearly possible for a single author to replicate the main benefits of these meta-analyses by individually performing a bunch of alternative analysis paths.
But there is a deep aversion to doing this in practice. It is labelled with terrible names like “p-hacking” and the “garden of forking paths” (with its implicit reference to temptation in the Garden of Eden). I know in my own experience as a reviewer, I must have had a dozen cases where I thought the outcome reported was dependent on the analysis method and asked for a re-analysis using an alternative method to prove me wrong. Sometimes, the editor backs that request up (or the authors do it voluntarily). But a majority of the time they don’t. Indeed, I would say editors are much more receptive to “the stats are wrong, do them differently” than “the stats are potentially not very sturdy, try it a 2nd way and report both”.
Thus even though it seems we kind of think such single author exploration of alternative analysis approaches is a good idea in a private poll, it’s pretty murky and cloaked in secrecy and disapproval from others in the published literature and peer review process.
And there are of course some strong reasons for this (some valid, some definitely not):
- The valid reason is that if an author tries 10 methods and then picks the one with the most preferred results and only reports that, then it is really unethical (and misleading and bad for science), although in private most scientists admit it is pretty common.
- The invalid reason is that doing multiple analyses could take a seemingly strong result (p<0.05 is all that matters right?) and turn it into a murky result. It might be significant in some analyses and not significant in others. What happens if the author does the requested robustness check by analyzing the data a 2nd way and loses statistical significance? This is a really bad, but really human, reason to avoid multiple analyses. Ignorance is NOT bliss in science!
So how do we stay away from the bad scenario (reason #1 above) while acknowledging that motive #2 is bad for science in the long run (even if it is optimal for the individual scientist in the short run)?
Well, I think the solution is the same as for exploratory statistics, take it out of the closet and celebrate it as an approach and brag about using it! If we’re supporting and rewarding researchers using this approach, they’re going to report it. And scenario #1 goes away. Unlike exploratory statistics which at least had a name, this statistical approach has been so closeted it doesn’t even have a name.
Sturdy statistics – a better approach to statistics than “one true way”
So I propose the name/banner that I have always used in my head: “sturdy statistics”. (Robust statistics might be a better name but that has already been taken over for a completely different context of using rank statistics and other methods to analyze non-normal data). The goal of sturdy statistics is to produce an analysis that is, well, sturdy! It stands up against challenges. It weathers the elements. Like the folk tale of three little pigs – it is not a house/model made of straw that blows over at the first big puff of challenge (different assumptions and methods). I seek to be like pig #3 whose statistics are made of brick and don’t wobble every time a slightly different approach is used, and – important point – not only am I’m not afraid to have that claim tested, I WANT it tested.
A commitment to sturdy statistics involves:
- Running an analysis multiple different ways (an experienced researcher knows what alternative ways will be suggested to them, and we can help graduate students learn these).
- If the results are all qualitatively similar (and quantitatively close), then, great!, report that the analyses all converged so the results are really, truly sturdy
- If the results are different then this is the hard part where commitment to ethics come in. I think there are two options:
- Report the contrasting results (this may make it harder to get published, but I’m not sure it should – it would be more honest than making results appear sturdy by only publishing one analysis path thanks to shutting off reviewers who request alternative analysis paths)
- A more fruitful path is likely digging in to understand why the different results happened. This may not be rewarding and essentially leave you at 3a. But in my experience it very often actually leads to deeper scientific understanding which can then lead to a better article (although the forking paths should be reported they don’t have to take center stage if you really figure out what is going on). For example it may turn out the result really depends on the skew in the data and that there is interesting biology out in that tail of the distribution.
- As a reviewer or editor, make or support requests for alternative analyses. If they come back the same, then you know you have a really solid result to publish. If the authors come back saying your suggestion gave a different answer and we now understand why, then it is an open scenario to be judged for advancement of science. And if they come back different, well, you’ve done your job as a reviewer and improved science.
Sturdy statistics – An example
I’m going to use a very contrived example on a well-known dataset. Start with the Edgar Anderson (aka Fisher) iris dataset. It has measurements of sepal length and width and petal length and width (4 continuous variables) as well as species ID for 50 individuals in each of 3 species (N=150). It is so famous it has it’s own Wikipedia page and peer reviewed publications on the best way to analyze it. It is most classically used as a way to explore multivariate techniques and to compare/contrast e.g. principal component analysis vs. unsupervised clustering vs discriminant analysis, etc. However, I’m going to keep it really simple and parallel to the more common linear model form of analysis.
So let’s say I want to model Sepal.Length as a function of Sepal.Width (another measure of sepal size), Petal.Length(another measure of overall flower size and specifically length) and species name (in R Sepal.Length~Sepal.Width+Petal.Length+Species). As you will see this is a pretty reasonable thing to do (r2>0.8). But there are some questions. If I plot a histogram of Sepal.Length it looks pretty good, but clearly not quite normal (a bit right-skewed and platykurtic). On the other hand, if I log-transform it, I get something else that is not terrible but platykurtic, bimodal and a bit left-skewed (by the way Box-Cox doesn’t help a lot – any exponent from -2 to 2 is almost equally good). One might also think including species is a cheat or not, so there is a question about whether that should be a covariate. And of course we have fixed vs. random effects (for species). I can very easily come up with 6 different models to run (see R code at the bottom): simple OLS as in the formula already presented, same but log transform Sepal.Width, same as OLS but remove species, or treat species as random (one would hope that is not too different), or use a GLM with a gamma distribution which spans normal and lognormal shapes (but the default link function for gamma is log, but maybe it should be identity). And tellingly, you can find the iris data analyzed most if not all of these ways by people who consider themselves expert enough to write stats tutorials. Below are the results (the coefficients for the two explanatory variables – I left out species intercepts for simplicity – and r2 and p-values where available – i.e. not GLM).

What can we make out of this? Well Sepal Width and Petal Length are both pretty strongly positively covarying with Sepal Length and combine to make a pretty predictive (r2>0.8) and highly statistically significant model. That’s true in any of the 6 analyses. Log transforming doesn’t change that story (although the coefficients are a bit different and remain so even with back-transforming but that’s not surprising). Using Gamma-distributed residuals doesn’t really change the story either. This is a sturdy result! Really the biggest instability we observe is that relative strength of Petal Length and Sepal Width change when species is or isn’t included (Petal Length appears more important with species, but Sepal Width is relatively more important without species*). So the relative importance of the two variables is conditional on whether species is included or not – a rather classic result in multivariate regression. If we dig into this deeper we can see that in this dataset two species (virginica and versicolor are largely overlapping (shared slope and intercept at least), while setosa has higher intercept but similar slope vs Sepal Width, but vs. Petal Length, the slope for setosa also varies substantially from the other two so slope estimates would vary depending if you control species out or not (and maybe a variable slope and intercept model should be used). So that one weak instability (non-sturdiness) is actually pointing a bright red sign at an interesting piece of biology that I might have ignored if I had only run one analysis (and additional statistical directions I am not going to pursue in an already too long blog post). This paragraph seems simultaneously like a sturdy result, but at the same time, a sturdiness analysis caused me to dig a bit deeper into the data and learn something biologically interesting. Win all around!
And in a peer review context, that exercise hopefully saves time (reviewers not needing to request additional analyses), is fully transparent on the analyses done (no buried p-hacking), and draws convincing conclusions that leave science in a better place than if I had just chosen one analysis and doubled down on insisting it was right.
Conclusions
TL;DR: Sometimes the answer to a question on a dataset is sturdy against various analysis approaches. Sometimes it’s not. We can’t know a priori which scenario we are in. The logical solution to this is to actually try different analyses and prove our result is “sturdy” – hence an inference approach I call “sturdy statistics”. To avoid this turning into p-hacking it is important that we embrace sturdy statistics and encourage honest reporting of our explorations. But even if you don’t like sturdy statistics, we have to get over the notion of “one right way” to analyze the data and come up with a solution to finding out if multiple, reasonable analysis paths lead to different results or not, and what to do if they do.
What do you think? Do you like sturdy statistics? Do you already practice sturdy statistics (secretly or in the open)? Do you think the risk of sturdy statistics leading to p-hacking is too great? Or is the risk of p-hacking already high and sturdy statistics is a way to reduce its frequency? What needs to change in peer review to support sturdy statistics? Is there an alternative to sturdy statistics to address the many, many reasonable paths through analysis of one data set?
*NB: to really do this just by looking at coefficients I would need standardized independent variables, but the mean and standard deviation of the two variables are close enough and the pattern is strong enough and I am only making relative claims, so I’m going to keep it simple here.
R Code
data(iris)
str(iris)
#simplest model
mols<-lm(Sepal.Length~Sepal.Width+Petal.Length+Species,data=iris)
# log transform sepal length?
mlogols<-lm(log10(Sepal.Length)~Sepal.Width+Petal.Length+Species,data=iris)
#role of species as a covariate?
mnosp<-lm(Sepal.Length~Sepal.Width+Petal.Length,data=iris)
#species as random instead of fixed (shouldn't really differ except d.f.)
library(lme4)
mrndsp<-lmer(Sepal.Length~Sepal.Width+Petal.Length+(1|Species),data=iris)
#Gamma residuals (a good proxy for lognormal)
# with default log transformation on dependent variable
mgamlog<-glm(Sepal.Length~Sepal.Width+Petal.Length+Species, data=iris, family=Gamma(link="log"))
#No log transformation on
mgamident<-glm(Sepal.Length~Sepal.Width+Petal.Length+Species,
data=iris, family=Gamma(link="identity"))
# Is Sepal.Length better log transformed or raw?
hist(iris$Sepal.Length)
hist(log(iris$Sepal.Length))
#hmm not so obvious either way
#do these choices matter?
#pickout relevant pieces of result from either GLM or OLS objects
#return a row in a data frame
report1asdf <- function(mobj) {
co=coef(mobj)
if (!is.numeric(co)) {co=as.numeric(co$Species[1,]); co[1]=NA} #GLM var intrcpt
s=summary(mobj)
#handle GLM with no p/r2
if (is.null(s$r.squared)) s$r.squared=NA
if (is.null(s$fstatistic)) s$fstatistic=c(NA,NA,NA)
data.frame(
#CoefInt=co[1],
CoefSepW=co[2],
CoefPetL=co[3],
r2=s$r.squared,
p=pf(s$fstatistic[1],s$fstatistic[2],s$fstatistic[3],lower.tail=FALSE)
)
}
#assemble a table as a dataframe then print it out
res<-data.frame(CoefSepW=numeric(),CoefPetL=numeric(), r2=numeric(),p=numeric())
res<-rbind(res,report1asdf(mols))
res<-rbind(res,report1asdf(mlogols))
res<-rbind(res,report1asdf(mnosp))
res<-rbind(res,report1asdf(mrndsp))
res<-rbind(res,report1asdf(mgamlog))
res<-rbind(res,report1asdf(mgamident))
row.names(res)=c("OLS","OLS Log","OLS No Sp.","Rand Sp","GamLog","GamIdent")
print(res)
Thanks Brian, this is an important issue. On a side note, please don’t use the iris dataset if you take this forward for a publication or other demonstration. There are good reasons for retiring data that originally appeared in the Annals of Eugenics, especially when equally good data are readily available elsewhere. See https://armchairecology.blog/iris-dataset/
HI Markus – I agree with your basic point about Fisher. I came very close to adding in the word eugenecist as in “Edgar Anderson (aka eugenecist Fisher) iris dataset”. However, my understanding is that the iris data was originally published by Edgar Anderson and does not have any association with eugenics in that context. Do you have a different understanding? I suppose one could argue that the association alone is bad enough.
The blog post I linked to (and the associated comments) give a good review of the concerns. Also note that the first link on the Wikipedia page about the Iris dataset is to Annals of Eugenics. We could discuss at length whether data about flowers can be tainted by association, but there’s nothing about the Iris dataset that makes it uniquely valuable, so there’s no particular reason for us to keep using it. We can’t entirely disentangle it from the purposes to which it was put.
Lots to chew on here Brian.
I agree with the thrust of your post, so I’ve been trying to stress-test the argument in my own head. I’ve been drawing on my admittedly-sketchy second-hand knowledge of the economics literature. As you know, economics papers routinely report the robustness of the conclusions to what economists call “alternative specifications.” Alternative specifications are alternative ways of doing a statistical analysis. Say, treating a variable as a fixed effect in one analysis, and as a random effect in another analysis. Including covariates A, B, and C, or not. Etc. So it seems like one way to summarize your post would be “ecology papers should be written more like economics papers”.
My outsider’s impression is that economists themselves have somewhat mixed feelings about robustness checks. One complaint is that robustness checks make papers too long, and therefore too difficult and unpleasant to read (econ papers often are >50 pages long; >100 pages is far from unheard of). To which, I guess one could respond that that’s fine so long as the editor and reviewers read the robustness checks carefully. The other complaint is that robustness checks produce a false sense of security–that it’s too easy to consciously or subconsciously choose to perform and report the alternative analyses that support your preferred conclusions, without doing and reporting the alternative analyses that don’t support your preferred conclusions.
My other thought is that, maybe we need guidance as to how to write interesting papers about non-robust results. What’s the “hook” or the compelling “narrative” for a paper for which the results aren’t robust? I guess one narrative might be to lay out alternative hypotheses, then say “here’s a great dataset that you’d think would be able to distinguish those hypotheses for reasons X, Y, Z”, then say “but it turns out those hypotheses can’t be distinguished, you favor one hypothesis or the other depending on how you do the analysis”, then talk about what research would be needed to distinguish the hypotheses. I feel like that kind of paper would fly in leading EEB journals?
It would probably help if the alternative hypotheses are already out there in the literature, so readers don’t feel like you’re setting up straw men. You want to be able to say “here’s a current controversy in the literature; I show that it *can’t* be resolved with currently-available data”.
I think you raise good questions about the mechanics of publishing sturdiness analysis (on which I tend to agree with your poll results – the supplemental material is a fine place – really only reviewers or the very curious need to read them). And personally I’m pretty happy with a table – doesn’t need 5 pages of narrative text. Just put the R alternatives and a summary table in the supplemental and then briefly reference that “sturdiness analysis was performed on model specification, variable transformations with results not differing meaningfully across analysis paths (see supplemental material)”
On the question of can we fool ourselves by doing a lot of analyses and missing the really tough ones. Probably. But I guess that’s why we have reviewers.
As far as what happens when results are not sturdy, I really believe that is an opportunity/invitation to dig into the data and understand why. Some of the time it boils down to “a few outliers” which you can’t do much about, but in my experience very often it leads to improved understanding and a better albeit possibly different paper.
“And personally I’m pretty happy with a table – doesn’t need 5 pages of narrative text. Just put the R alternatives and a summary table in the supplemental and then briefly reference ”
I increasingly like that option too. I was bummed it wasn’t more popular in the poll I did.
Jeremy – I know you’re a fan of Mayo’s sever testing paradigm. I’m curious how you see this fitting in (or not).? To my more superficial knowledge it seems to have a lot of parallels (although it doesn’t get to the a priori nature of much of the notion of a severe test).
Re: the connection between robustness analysis, and Mayo’s notion of a severe test, good question. To answer it, I think you need to distinguish between a severe test of a specific statistical null hypothesis, and a severe test of some broader scientific claim. Different statistical analyses often test different null hypotheses with different interpretations. For instance, if you’re testing whether the slope of the relationship between the mean of Y and X1 is zero or not, the interpretation of that slope parameter changes if you include covariate X2 in your analysis. So I don’t know that you can compare the severity of those two statistical null hypothesis tests; they’re arguably not tests of the same hypothesis.
But the point of robustness checks is usually to test some broader scientific claim. A claim that itself probably can’t be fully evaluated just with a single statistical null hypothesis test. I do think robustness checks can increase the severity with which we evaluate broad scientific claims. A severe test is a test that a true claim would pass with high probability, and that a false claim would fail with high probability. I do think a broad scientific claim that’s true (or approximately true, or true in most circumstances, or etc.) should pass robustness checks with high probability, and that a false scientific claim should fail them. There are some scientific claims that have passed robustness checks in some of these many analysts, one dataset exercises, IIRC.
Re: the need for severe tests to be a priori, that’s often desirable but it’s not strictly essential. As long as you avoid circularity or cherry-picking that compromises the severity of your test, you’re ok. For instance, in these many analysts, one dataset exercises, the analysts don’t ordinarily pre-register their analysis plans in advance of seeing the data. Rather, each analyst explores the data and eventually settles on some preferred analysis. Quite possibly, each individual analyst is thereby compromising the severity of their own analysis a bit. But if the scientific claim turns out to be robust, each of those individually non-severe tests will lead to the same conclusion, and that conclusion could be said to have passed a severe test, I think. Passing a bunch of independent non-severe tests amounts to passing a severe test.*
But these are just my off the cuff thoughts. I’d be very curious what Deborah Mayo herself would say about these many analysts, one dataset exercises.
*Some care is needed here; this logic doesn’t always fly. For instance, if a bunch of published low-powered statistical tests of the same H0 all reject that H0, you shouldn’t infer that the H0 is false, you should infer that there’s publication bias. Because even if the H0 really is false, you wouldn’t expect it to *always* be rejected by low-powered tests.
Do you seriously believe that “Ninety percent of our published articles might have come to an opposite conclusion if somebody else did the stats even with the same data!”? If not, it is probably not a responsible statement from a scientist when there is significant public distrust of science and its role in forming policy.
If you do believe it, then I suggest some serious blogging about how we should form policy.
It is sort of the current conclusion from several recent studies, which is what I was recapping. But you did notice the word “might”? And the next sentence (but for a literal parenthetical thought) says “Or maybe the ecology literature is safe?” and then goes on to a fairly detailed and nuanced analysis of how likely that statement is to be true? Personally, I think one would have to be pretty myopic to pull that headline out of this blog post.
But I don’t think we do ourselves favors with long term credibility in policy if we are radically pushing and overinterpreting results that are wobbly and pulling out weak effect sizes that cannot stand up to even alternative statistical analyses, let alone different ecological systems, scales, etc. So yes, I think asking that question is an important one.
As far the larger question, does science gain or lose trust when we publicly discuss how we scientists are not perfect and could become more rigorous? We may not agree on the answer, but its not a slam dunk obvious question.
Thank you for your comments James, but please keep the tone more professional. “I disagree with this post, and am concerned about what I see as its implications, for reasons X Y Z” is totally fine. “This post is irresponsible and is going to have dire consequences for the world at large” is not. Brian’s “ninety percent” remark is a reasonable inference drawn from published “many analysts, one dataset” exercises. If you’re disturbed by the fact that different analysts often reach different conclusions from the same data, don’t rip Brian–he’s just the messenger here.
Re: future posts about the policy implications of the post, readers are welcome to suggest post topics, but we are under no professional or moral obligation to take up those suggestions. I’m guessing you haven’t seen Brian’s various past posts on the role of science in policymaking in a world with significant public distrust of science (which if you haven’t is fair enough; we don’t expect anyone to have read our back catalog). So I’ll take this opportunity to link to some of them:
Hi James, as somebody who is also convinced that we know very little about how much we know in ecology, and so, believes it’s very possible that 90% of our published articles contain little actual knowledge – why would that mean it falls to me to talk about how we form policy? Just because the consequences of what I believe to be true might not be good for society doesn’t obligate me to either (1) solve the problem or (2) change my beliefs.
Further, you imply that people’s distrust of science and it’s role in forming policy is completely unwarranted. Science gets it wrong often enough that it is not crazy to be somewhat skeptical of scientific claims and the role those claims have in forming policy. To suggest that all of our concerns about the validity of our science should stay ‘inside science’ and we should present an outward face of complete confidence is exactly the wrong prescription for gaining public trust. I think many disciplines ‘oversell’ themselves and I would include ecology among those disciplines. I think we would gain public trust if we could provide clear evidence for what we do know and acknowledge the vast amount we don’t. Your goal of gaining public trust is an admirable one, but we need to think carefully about how to do that.
@Brian
Thank you for this perspective. You made an important point in one of your old posts, which was that when one has a robust dataset, the choice of a model is often a trivial decision. I have tested this assertion with various datasets and I found that to be the case. So, your #2 scenario is perhaps where we need to do a better job.
Also, I think some ecologists are unnecessarily emotionally invested in their hypotheses or have an illogical need to conform with some previous studies. Some would discard robust datasets or analyses because they didn’t conform with the findings of Fox et al. 1854—the famous guy. Such has been suggested to me on more than one occasion. Not because they thought the data or the analyses were wrong but that I will have a hard time publishing them. And they were right. With all that is at stake in the current academic science, I can understand why some scientists might want to take the path of least resistance. For me, this a #3 scenario.
While I like and practice the idea of “sturdy analyses”, my experiences haven’t been that positive; whether with collaborators or peer-review process. I just had a collaboration that broke down because I wanted a particular result presented as is and let the readers judge them on their merits or lack thereof. The said collaborator has taken a different position in one of his previous papers and not minding that we now assembled a much larger dataset with longer time-series, hierarchical analyses, etc.
Thank you for your comments. One hopes that discussion about the best way to do statistics are not intermixed with egos, but with scientists being human beings, they often are.
Thanks Brian. This discussion does blend in rather seamlessly to the notion of what null model one uses (See Colwell and Winkler (1984) “A null model for null models”). Each test represents a set of assumptions and logic that often have at least some basis in the biology of the system. An analysis of a series of such tests using the same dataset, especially when they yield different conclusions, can yield, at the very least, new working hypotheses about mechanism that are worthy of following up.
One example might be found in “Phylogenetic corrections”, starting with Felsenstein (1985) and its descendants. Each model attempts to generate a null about how evolution works. We weren’t there to watch the tree of life diversify. It has always seemed reasonable to compare the conclusions of different null models—including the one that says each branch tip is generating variable individuals that are subject to ongoing natural selection (that is, statistically, n=the number of species)—that Felsenstein was rebelling against.
Really cool connection to idea of multiple null models. My suspicion is if one null model fits better that could be more informative about mechanism than if one statistical model does, but maybe I’m wrong.
Thanks Brian for this great post. I must admit that I am sometime scared when I get additional data (like a new year or new locations) that the results coming out will be much more complex to explain than from a more restricted dataset. Academia being a competitive place you do want to publish regularly to make it through as an early career researcher. In this context I guess that sturdy stats, which takes more time in the writing than just picking out your favorite models, would really take off if supported by editors of leading journals (what about an editorial in GEB on that?).
On a more technical side I was also wondering how you see multi-model averaging or similar techniques coming into sturdy stats? One way to get around the burden of having to report conflicting results would be to just average across all fitted models and say: voila we took into account uncertainty in modelling choices, so our results are robust / sturdy.
And finally I was wondering if you have read “Bayesian workflow” by A. Gelman: https://arxiv.org/abs/2011.01808, chapter 8 seems pretty fitting to this discussion: “The key aspect of Bayesian workflow, which takes it beyond Bayesian data analysis, is that we are fitting many models while working on a single problem.”
I’m not convinced sturdy stats take that much more time in the writing. See my conversation above with Jeremy. Personally I think just a table and a few sentences is enough.
I’m not a fan of model averaging. It’s fine if you average predictions to make a prediction. But I don’t think it helps with the kinds of inferences about hypothesis testing and it is definitely not appropriate for variable selection which are what I think scientists most often want to do. But then as a scientist I’ve always been innately more interested in the variability than the mean.
Thanks for the link – it sounds like it does have some convergence of ideas.
I’ve really enjoyed this post and the discussions around it, Brian.
What role should ‘model fit’ play in subsequent discussions of the models. In your example, the R2’s are similar for the top three models – what if they had been very different? If one of those three models had an R2 of .15 , would that be grounds for not mentioning it? And should it bother us that we don’t have an estimate of model fit for the bottom three models? Should we use one of the indices of model fit for glm’s?
Also, what do we do when the theoretical grounds for using one model are better than for any of the others but the model fit is much better for one of the models with less theoretical support? For example, examining the effect of wetland size on amphibian species richness –
Model 1: OLS (log(ASR)~ Wetland size)
Model 2: (ASR~ Wetland size, family = Poisson, link =log)
Model 3: (ASR~ Wetland size, family = Negative Binomial, link =log)
mean ASR = 3.7; Variance ASR = 3.9
ASR is count data with a hard floor of zero and the mean is not a long ways from zero, so OLS seems like a bad choice. The mean and variance are close so Poisson seems like a likely choice but NB is fine because it will, more or less, reduce to a Poisson when the mean is close to the variance. But when we analyse the data, the OLS provides a much better fit – let’s say 10x more likely than either Models 2 or 3. How much should the a priori theoretical considerations weigh into our interpretations?
In your example, you used a random intercept model and considered a random slop model, but because there are only three species you know you are getting poor estimates of the intercepts and slopes. If one of the ‘random’ models was, by far, the ‘best’ model, would it bother you that you might make a theoretical argument for not using a random intercepts or random slopes model?
I think this is an important recommendation for ecologists and now I’m done into the weeds of figuring out how to apply it.
Some good questions!
I’m all over R2 so yes – in a more real-world application I would provide some pseudo-R2 for GLM (although I might also report an RMSE or MAE). As far as your example where OLS is a better empirical fit but a worse theoretical fit. Can I cop out and say I’m not convinced it can happen – at least not to the degree of your example (I’m have expecting you to come back and tell me that is a real world example …)? But if it does I think you have to dig into why? To me that’s the harder but most fun part of sturdy stats. It’s great when they all line up. You’ve got something real. But when they don’t line up dig in. You’re modelling the data so what makes the models behave differently when you thought they were the same. Maybe in your example it could be that OLS can handle disproportionately large variances while Poisson can’t (but negative binomial can, but maybe with it’s own downsides).
Same with your random example. I’m all into model understanding rather than model selection. Why are models that seem like they should be better failing? Admittedly that is a much deeper dive than most people make statistically today, but that is where I think the real benefit of demanding sturdy analysis comes in.
You’re right Brian, It was a real world example (although from memory so I could have it a little wrong). But your recommendation to dig in got me thinking and my memory is that this was a rare case when the variance was smaller than the mean – maybe in rare cases that variance < mean, the assumptions of Poisson and NB that the mean is equal (or larger for NB) than the variance, cause bigger estimation problems than I might have thought.
I’m late to the party on this interesting discussion, but want to raise one point. It seems to me that “analyze the same data set with different methods” is an inappropriate comparison. It leaves out the question being asked. Lots of the examples pointed to are asking different questions. Two questions addressed with two methods leading to two answers does not seem to me like cause for concern. The different models underlying different statistical tests are asking different questions … even in as simple a case as just transforming variables (as mentioned in the iris example). Is it possible that carefully formulating the model underlying the analysis would remove some of the problem?
My own guess is that to a statistician they may seem like different questions, but to the average scientist, they do not. Certainly in the meta-studies that I started by discussing they asked one question (e.g. “Does skin color of player influence rate of red cards in professional soccer”). Within this context debates about log transforming or using logit or OLS were different faculty members interpretation of that question. Similarly, although my example with species as a covariate was a bit over simplistic or contrived, in the soccer example they were given about 6 covariates and in the context of control variables it might legitimately be seen as pursuit of the same question with different control variables included or excluded (as occurred in the study, and as regularly obsessed over by economists). Similarly in ecology where we are more prone to including site/replicate random factors rather than covariates, I think people might legitimately think they are pursuing the same question but just putting in different random factors as controls for nuisance psuedoreplication.
“It seems to me that “analyze the same data set with different methods” is an inappropriate comparison. It leaves out the question being asked.”
For all of these many analysts, one dataset projects, the analysts are all given the same scientific question to address. For instance, the ongoing ecological many analysts, one dataset project gives the following instructions to analysts of one of their datasets: “Download these data and analyse them to find the answer to the question “How does grass cover influence Eucalyptus spp. seedling recruitment?” using what ever methods you feel are most appropriate. This includes choosing which variables to use for ‘grass cover’ and ‘ eucalypt seedling recruitment'”