Last week Jeremy linked to yet another study where expert researchers (social scientists in this case) were asked to analyze the same dataset. The key findings were: a) that the experts had broad disagreements about the best way to analyze the data, and b) that these differences were consequential in leading to totally different outcomes (positive, negative or no statistically significant effect). This should hardly be news; earlier studies have found this in another social science dataset and in fMRI scans in neurobiology.

You really have to pause and let that in. Established researchers each think a different method of analysis is best, and these different methods give completely different, even opposite answers for the same data on the same question. Even controlling for subtly different questions or experimental designs, the answer you get depends entirely on which person you give the data to for analysis!

## This should be the death of “one right way” approach to statistics

It is hard to read these studies any other way than as a major blow to the view/approach that there is one right way to do statistics. Or at least it should be. I assume some will continue to think that there really is “one right way” (theirs) and that the variance occurs because most of the researchers (everybody else) are just plain wrong. But that is a bit too egocentric and lacking of perspective to my mind. I know I’ve offended people over my blogging history on this point (which has not been my intent), but I just find it really impossible to accept that there is only one right way to analyze any statistical problem. *Statistics are (probabilistic) models of reality, and it is impossible to have a perfect model, thus all models are involved in tradeoffs.* And these new studies just feel like a sledge hammer of evidence against the view there is one right way (even as they make us aware that it is really inconvenient that there is not just one right way).

When you look at the differences among researcher’s approaches reported in the studies, they’re not matters where on could dismiss an approach as being grossly wrong. They’re the kinds of things people debate all the time. Log transform or square root transform or no transform (yes sometimes log or no-log is clearly better, but there are a lot of data sets out there where neither is great and it is a matter of judgment which is better – I’ll give an example below). Or OLS vs logistic vs other GLM. Multivariate regression vs. principal component regression vs regression tree. AIC vs automated vs. researcher variable selection. Include a covariate or leave it out. And etc. There is no such thing as the “one true right way” to navigate these. And as these meta-studies show, they’re not so trivial we can ignore these differences of opinion either – conclusions can change drastically. So, again, these results really should give us pause. Ninety percent of our published articles might have come to an opposite conclusion if somebody else did the stats even with the same data! (And one person was “smart” and the other “dumb” is not really a viable explanation).

## Is ecology and evolution different?

Or maybe the ecology literature is safe? For those of us in ecology and evolution, our time to find out is coming. A similar study is underway right now. I suppose logically there could be two outcomes. 1) Same results as previous studies – total researcher-dependent chaos. 2) Different results – the chosen question and dataset has a strong answer and lots of different methods recover the same answer (qualitatively – small changes in effect sizes and p-values are OK). A lot of people in response to Jeremy’s question of what to do about these studies seemed to be really thinking (hoping?) that ecology was different and would come out with outcome #2.

Personally, I doubt it. I don’t think fields are that different. Different questions within a field are the important difference. All fields sometimes chase large effect sizes, which will give outcome #2 (when you can see the pattern visually in the data, methods aren’t going to change the story), and sometimes fields chase small effects which will give outcome #1 (when the effect sizes are small and you have six control variables, it matters a lot how you analyze the data). But here’s the key: we don’t know after we’ve completed our study with a single analysis path whether our question and results are in outcome #1 (different paths would give different answers) or #2 (different paths would give similar answers). If we knew that, we wouldn’t have done the study! Sometimes studies of weak effects come up with an estimate of strong effect, and sometimes studies of a strong effect come up with an estimated weak effect. So trying to use statistics to tell us if we are in #1 or #2 is circular. This is a really key point – it might seem that the only way to tell if we are in #1 or #2 is to do some giant metastudy where we get a couple of dozen researchers to analyze the same question on the same dataset. That hardly seems practical! And the study being done on evolutionary ecology and conservation ecology questions could end up either in #1 or #2 (in my hypothesis depending on whether they are giving researchers weak effect or strong effect datasets/problems), so that is not a comprehensive guide for all of ecology and evolution. What we really need is a meta-meta-study that does several dozen of these meta-studies and then analyzes how often #1 vs #2 comes up (Jeremy has these same thoughts). I’m willing to bet pretty heavily that ecology and evolution have publications both that are safe (scenario #2) and completely dependent on how the analysis was done (scenario #1). In my own research in macroecology, I have been in scenarios where #1 is true and in scenarios where #2 is true.

## Couldn’t individual authors just explore alternative analysis paths?

If we can’t afford to conduct a meta-analysis with a few dozen researchers independently analyzing each data set for each unique question (and we surely can’t!), then what alternative is there? There is an obvious alternative. An individual researcher can explore these alternatives themselves. A lot of researchers already do this. I bet every reader of this post has at one time tried with and without a log-transform or OLS vs. GLM assumptions on residuals. And, nominally, a strong majority of ecologists think such robustness checks are a good thing according to Jeremy’s poll. So it’s hardly a novel concept. In short, yes, it is clearly possible for a single author to replicate the main benefits of these meta-analyses by individually performing a bunch of alternative analysis paths.

But there is a deep aversion to doing this in practice. It is labelled with terrible names like “p-hacking” and the “garden of forking paths” (with its implicit reference to temptation in the Garden of Eden). I know in my own experience as a reviewer, I must have had a dozen cases where I thought the outcome reported was dependent on the analysis method and asked for a re-analysis using an alternative method to prove me wrong. Sometimes, the editor backs that request up (or the authors do it voluntarily). But a majority of the time they don’t. Indeed, I would say editors are much more receptive to “the stats are wrong, do them differently” than “the stats are potentially not very sturdy, try it a 2^{nd} way and report both”.

Thus even though it seems we kind of think such single author exploration of alternative analysis approaches is a good idea in a private poll, it’s pretty murky and cloaked in secrecy and disapproval from others in the published literature and peer review process.

And there are of course some strong reasons for this (some valid, some definitely not):

- The valid reason is that if an author tries 10 methods and then picks the one with the most preferred results and only reports that, then it is really unethical (and misleading and bad for science), although in private most scientists admit it is pretty common.
- The invalid reason is that doing multiple analyses could take a seemingly strong result (p<0.05 is all that matters right?) and turn it into a murky result. It might be significant in some analyses and not significant in others. What happens if the author does the requested robustness check by analyzing the data a 2
^{nd}way and loses statistical significance? This is a really bad, but really human, reason to avoid multiple analyses. Ignorance is NOT bliss in science!

So how do we stay away from the bad scenario (reason #1 above) while acknowledging that motive #2 is bad for science in the long run (even if it is optimal for the individual scientist in the short run)?

Well, I think the solution is the same as for exploratory statistics, take it out of the closet and celebrate it as an approach and brag about using it! If we’re supporting and rewarding researchers using this approach, they’re going to report it. And scenario #1 goes away. Unlike exploratory statistics which at least had a name, this statistical approach has been so closeted it doesn’t even have a name.

## Sturdy statistics – a better approach to statistics than “one true way”

*So I propose the name/banner that I have always used in my head: “sturdy statistics”. *(Robust statistics might be a better name but that has already been taken over for a completely different context of using rank statistics and other methods to analyze non-normal data). The goal of sturdy statistics is to produce an analysis that is, well, sturdy! It stands up against challenges. It weathers the elements. Like the folk tale of three little pigs – it is not a house/model made of straw that blows over at the first big puff of challenge (different assumptions and methods). I seek to be like pig #3 whose statistics are made of brick and don’t wobble every time a slightly different approach is used, and – important point – not only am I’m not afraid to have that claim tested, I WANT it tested.

A commitment to sturdy statistics involves:

- Running an analysis multiple different ways (an experienced researcher knows what alternative ways will be suggested to them, and we can help graduate students learn these).
- If the results are all qualitatively similar (and quantitatively close), then, great!, report that the analyses all converged so the results are really, truly sturdy
- If the results are different then this is the hard part where commitment to ethics come in. I think there are two options:
- Report the contrasting results (this may make it harder to get published, but I’m not sure it should – it would be more honest than making results appear sturdy by only publishing one analysis path thanks to shutting off reviewers who request alternative analysis paths)
- A more fruitful path is likely digging in to understand why the different results happened. This may not be rewarding and essentially leave you at 3a. But in my experience it very often actually leads to deeper scientific understanding which can then lead to a better article (although the forking paths should be reported they don’t have to take center stage if you really figure out what is going on). For example it may turn out the result really depends on the skew in the data and that there is interesting biology out in that tail of the distribution.

- As a reviewer or editor, make or support requests for alternative analyses. If they come back the same, then you know you have a really solid result to publish. If the authors come back saying your suggestion gave a different answer and we now understand why, then it is an open scenario to be judged for advancement of science. And if they come back different, well, you’ve done your job as a reviewer and improved science.

## Sturdy statistics – An example

I’m going to use a very contrived example on a well-known dataset. Start with the Edgar Anderson (aka Fisher) iris dataset. It has measurements of sepal length and width and petal length and width (4 continuous variables) as well as species ID for 50 individuals in each of 3 species (N=150). It is so famous it has it’s own Wikipedia page and peer reviewed publications on the best way to analyze it. It is most classically used as a way to explore multivariate techniques and to compare/contrast e.g. principal component analysis vs. unsupervised clustering vs discriminant analysis, etc. However, I’m going to keep it really simple and parallel to the more common linear model form of analysis.

So let’s say I want to model Sepal.Length as a function of Sepal.Width (another measure of sepal size), Petal.Length(another measure of overall flower size and specifically length) and species name (in R Sepal.Length~Sepal.Width+Petal.Length+Species). As you will see this is a pretty reasonable thing to do (r^{2}>0.8). But there are some questions. If I plot a histogram of Sepal.Length it looks pretty good, but clearly not quite normal (a bit right-skewed and platykurtic). On the other hand, if I log-transform it, I get something else that is not terrible but platykurtic, bimodal and a bit left-skewed (by the way Box-Cox doesn’t help a lot – any exponent from -2 to 2 is almost equally good). One might also think including species is a cheat or not, so there is a question about whether that should be a covariate. And of course we have fixed vs. random effects (for species). I can very easily come up with 6 different models to run (see R code at the bottom): simple OLS as in the formula already presented, same but log transform Sepal.Width, same as OLS but remove species, or treat species as random (one would hope that is not too different), or use a GLM with a gamma distribution which spans normal and lognormal shapes (but the default link function for gamma is log, but maybe it should be identity). And tellingly, you can find the iris data analyzed most if not all of these ways by people who consider themselves expert enough to write stats tutorials. Below are the results (the coefficients for the two explanatory variables – I left out species intercepts for simplicity – and r^{2} and p-values where available – i.e. not GLM).

What can we make out of this? Well Sepal Width and Petal Length are both pretty strongly positively covarying with Sepal Length and combine to make a pretty predictive (r^{2}>0.8) and highly statistically significant model. That’s true in any of the 6 analyses. Log transforming doesn’t change that story (although the coefficients are a bit different and remain so even with back-transforming but that’s not surprising). Using Gamma-distributed residuals doesn’t really change the story either. This is a *sturdy* result! Really the biggest instability we observe is that relative strength of Petal Length and Sepal Width change when species is or isn’t included (Petal Length appears more important with species, but Sepal Width is relatively more important without species*). So the relative importance of the two variables is conditional on whether species is included or not – a rather classic result in multivariate regression. If we dig into this deeper we can see that in this dataset two species (virginica and versicolor are largely overlapping (shared slope and intercept at least), while setosa has higher intercept but similar slope vs Sepal Width, but vs. Petal Length, the slope for setosa also varies substantially from the other two so slope estimates would vary depending if you control species out or not (and maybe a variable slope and intercept model should be used). So that one weak instability (non-sturdiness) is actually pointing a bright red sign at an interesting piece of biology that I might have ignored if I had only run one analysis (and additional statistical directions I am not going to pursue in an already too long blog post). This paragraph seems simultaneously like a sturdy result, but at the same time, a sturdiness analysis caused me to dig a bit deeper into the data and learn something biologically interesting. Win all around!

And in a peer review context, that exercise hopefully saves time (reviewers not needing to request additional analyses), is fully transparent on the analyses done (no buried p-hacking), and draws convincing conclusions that leave science in a better place than if I had just chosen one analysis and doubled down on insisting it was right.

## Conclusions

*TL;DR: Sometimes the answer to a question on a dataset is sturdy against various analysis approaches. Sometimes it’s not. We can’t know a priori which scenario we are in. The logical solution to this is to actually try different analyses and prove our result is “sturdy” – hence an inference approach I call “sturdy statistics”. To avoid this turning into p-hacking it is important that we embrace sturdy statistics and encourage honest reporting of our explorations. But even if you don’t like sturdy statistics, we have to get over the notion of “one right way” to analyze the data and come up with a solution to finding out if multiple, reasonable analysis paths lead to different results or not, and what to do if they do. *

What do you think? Do you like sturdy statistics? Do you already practice sturdy statistics (secretly or in the open)? Do you think the risk of sturdy statistics leading to p-hacking is too great? Or is the risk of p-hacking already high and sturdy statistics is a way to reduce its frequency? What needs to change in peer review to support sturdy statistics? Is there an alternative to sturdy statistics to address the many, many reasonable paths through analysis of one data set?

*NB: to really do this just by looking at coefficients I would need standardized independent variables, but the mean and standard deviation of the two variables are close enough and the pattern is strong enough and I am only making relative claims, so I’m going to keep it simple here.

## R Code

```
data(iris)
str(iris)
#simplest model
mols<-lm(Sepal.Length~Sepal.Width+Petal.Length+Species,data=iris)
# log transform sepal length?
mlogols<-lm(log10(Sepal.Length)~Sepal.Width+Petal.Length+Species,data=iris)
#role of species as a covariate?
mnosp<-lm(Sepal.Length~Sepal.Width+Petal.Length,data=iris)
#species as random instead of fixed (shouldn't really differ except d.f.)
library(lme4)
mrndsp<-lmer(Sepal.Length~Sepal.Width+Petal.Length+(1|Species),data=iris)
#Gamma residuals (a good proxy for lognormal)
# with default log transformation on dependent variable
mgamlog<-glm(Sepal.Length~Sepal.Width+Petal.Length+Species, data=iris, family=Gamma(link="log"))
#No log transformation on
mgamident<-glm(Sepal.Length~Sepal.Width+Petal.Length+Species,
data=iris, family=Gamma(link="identity"))
# Is Sepal.Length better log transformed or raw?
hist(iris$Sepal.Length)
hist(log(iris$Sepal.Length))
#hmm not so obvious either way
#do these choices matter?
#pickout relevant pieces of result from either GLM or OLS objects
#return a row in a data frame
report1asdf <- function(mobj) {
co=coef(mobj)
if (!is.numeric(co)) {co=as.numeric(co$Species[1,]); co[1]=NA} #GLM var intrcpt
s=summary(mobj)
#handle GLM with no p/r2
if (is.null(s$r.squared)) s$r.squared=NA
if (is.null(s$fstatistic)) s$fstatistic=c(NA,NA,NA)
data.frame(
#CoefInt=co[1],
CoefSepW=co[2],
CoefPetL=co[3],
r2=s$r.squared,
p=pf(s$fstatistic[1],s$fstatistic[2],s$fstatistic[3],lower.tail=FALSE)
)
}
#assemble a table as a dataframe then print it out
res<-data.frame(CoefSepW=numeric(),CoefPetL=numeric(), r2=numeric(),p=numeric())
res<-rbind(res,report1asdf(mols))
res<-rbind(res,report1asdf(mlogols))
res<-rbind(res,report1asdf(mnosp))
res<-rbind(res,report1asdf(mrndsp))
res<-rbind(res,report1asdf(mgamlog))
res<-rbind(res,report1asdf(mgamident))
row.names(res)=c("OLS","OLS Log","OLS No Sp.","Rand Sp","GamLog","GamIdent")
print(res)
```