Book review: The Knowledge Machine by Michael Strevens

Last spring I read and discussed Michael Strevens’ The Knowledge Machine: How Irrationality Created Modern Science with my lab group. Here’s my review.

tl;dr: a scientist who wants to dip a toe into philosophy of science could do much worse than to start with this book.

Continue reading

The death knell for the “one right way” approach to statistics? and the need for Sturdy Statistics

Last week Jeremy linked to yet another study where expert researchers (social scientists in this case) were asked to analyze the same dataset. The key findings were: a) that the experts had broad disagreements about the best way to analyze the data, and b) that these differences were consequential in leading to totally different outcomes (positive, negative or no statistically significant effect). This should hardly be news; earlier studies have found this in another social science dataset and in fMRI scans in neurobiology.

You really have to pause and let that in. Established researchers each think a different method of analysis is best, and these different methods give completely different, even opposite answers for the same data on the same question. Even controlling for subtly different questions or experimental designs, the answer you get depends entirely on which person you give the data to for analysis!

This should be the death of “one right way” approach to statistics

It is hard to read these studies any other way than as a major blow to the view/approach that there is one right way to do statistics. Or at least it should be. I assume some will continue to think that there really is “one right way” (theirs) and that the variance occurs because most of the researchers (everybody else) are just plain wrong. But that is a bit too egocentric and lacking of perspective to my mind. I know I’ve offended people over my blogging history on this point (which has not been my intent), but I just find it really impossible to accept that there is only one right way to analyze any statistical problem. Statistics are (probabilistic) models of reality, and it is impossible to have a perfect model, thus all models are involved in tradeoffs. And these new studies just feel like a sledge hammer of evidence against the view there is one right way (even as they make us aware that it is really inconvenient that there is not just one right way).

When you look at the differences among researcher’s approaches reported in the studies, they’re not matters where on could dismiss an approach as being grossly wrong. They’re the kinds of things people debate all the time. Log transform or square root transform or no transform (yes sometimes log or no-log is clearly better, but there are a lot of data sets out there where neither is great and it is a matter of judgment which is better – I’ll give an example below). Or OLS vs logistic vs other GLM. Multivariate regression vs. principal component regression vs regression tree. AIC vs automated vs. researcher variable selection. Include a covariate or leave it out. And etc. There is no such thing as the “one true right way” to navigate these. And as these meta-studies show, they’re not so trivial we can ignore these differences of opinion either – conclusions can change drastically. So, again, these results really should give us pause. Ninety percent of our published articles might have come to an opposite conclusion if somebody else did the stats even with the same data! (And one person was “smart” and the other “dumb” is not really a viable explanation).

Is ecology and evolution different?

Or maybe the ecology literature is safe? For those of us in ecology and evolution, our time to find out is coming. A similar study is underway right now. I suppose logically there could be two outcomes. 1) Same results as previous studies – total researcher-dependent chaos. 2) Different results – the chosen question and dataset has a strong answer and lots of different methods recover the same answer (qualitatively – small changes in effect sizes and p-values are OK). A lot of people in response to Jeremy’s question of what to do about these studies seemed to be really thinking (hoping?) that ecology was different and would come out with outcome #2.

Personally, I doubt it. I don’t think fields are that different. Different questions within a field are the important difference. All fields sometimes chase large effect sizes, which will give outcome #2 (when you can see the pattern visually in the data, methods aren’t going to change the story), and sometimes fields chase small effects which will give outcome #1 (when the effect sizes are small and you have six control variables, it matters a lot how you analyze the data). But here’s the key: we don’t know after we’ve completed our study with a single analysis path whether our question and results are in outcome #1 (different paths would give different answers) or #2 (different paths would give similar answers). If we knew that, we wouldn’t have done the study!  Sometimes studies of weak effects come up with an estimate of strong effect, and sometimes studies of a strong effect come up with an estimated weak effect. So trying to use statistics to tell us if we are in #1 or #2 is circular. This is a really key point – it might seem that the only way to tell if we are in #1 or #2 is to do some giant metastudy where we get a couple of dozen researchers to analyze the same question on the same dataset. That hardly seems practical! And the study being done on evolutionary ecology and conservation ecology questions could end up either in #1 or #2 (in my hypothesis depending on whether they are giving researchers weak effect or strong effect datasets/problems), so that is not a comprehensive guide for all of ecology and evolution. What we really need is a meta-meta-study that does several dozen of these meta-studies and then analyzes how often #1 vs #2 comes up (Jeremy has these same thoughts). I’m willing to bet pretty heavily that ecology and evolution have publications both that are safe (scenario #2) and completely dependent on how the analysis was done (scenario #1). In my own research in macroecology, I have been in scenarios where #1 is true and in scenarios where #2 is true.

Couldn’t individual authors just explore alternative analysis paths?

If we can’t afford to conduct a meta-analysis with a few dozen researchers independently analyzing each data set for each unique question (and we surely can’t!), then what alternative is there? There is an obvious alternative. An individual researcher can explore these alternatives themselves. A lot of researchers already do this. I bet every reader of this post has at one time tried with and without a log-transform or OLS vs. GLM assumptions on residuals. And, nominally, a strong majority of ecologists think such robustness checks are a good thing according to Jeremy’s poll. So it’s hardly a novel concept. In short, yes, it is clearly possible for a single author to replicate the main benefits of these meta-analyses by individually performing a bunch of alternative analysis paths.

But there is a deep aversion to doing this in practice. It is labelled with terrible names like “p-hacking” and the “garden of forking paths” (with its implicit reference to temptation in the Garden of Eden). I know in my own experience as a reviewer, I must have had a dozen cases where I thought the outcome reported was dependent on the analysis method and asked for a re-analysis using an alternative method to prove me wrong. Sometimes, the editor backs that request up (or the authors do it voluntarily). But a majority of the time they don’t. Indeed, I would say editors are much more receptive to “the stats are wrong, do them differently” than “the stats are potentially not very sturdy, try it a 2nd way and report both”.

Thus even though it seems we kind of think such single author exploration of alternative analysis approaches is a good idea in a private poll, it’s pretty murky and cloaked in secrecy and disapproval from others in the published literature and peer review process.

And there are of course some strong reasons for this (some valid, some definitely not):

  1. The valid reason is that if an author tries 10 methods and then picks the one with the most preferred results and only reports that, then it is really unethical (and misleading and bad for science), although in private most scientists admit it is pretty common.
  2. The invalid reason is that doing multiple analyses could take a seemingly strong result (p<0.05 is all that matters right?) and turn it into a murky result. It might be significant in some analyses and not significant in others. What happens if the author does the requested robustness check by analyzing the data a 2nd way and loses statistical significance? This is a really bad, but really human, reason to avoid multiple analyses. Ignorance is NOT bliss in science!

So how do we stay away from the bad scenario (reason #1 above) while acknowledging that motive #2 is bad for science in the long run (even if it is optimal for the individual scientist in the short run)?

Well, I think the solution is the same as for exploratory statistics, take it out of the closet and celebrate it as an approach and brag about using it! If we’re supporting and rewarding researchers using this approach, they’re going to report it. And scenario #1 goes away. Unlike exploratory statistics which at least had a name, this statistical approach has been so closeted it doesn’t even have a name.

Sturdy statistics – a better approach to statistics than “one true way”

So I propose the name/banner that I have always used in my head: “sturdy statistics”. (Robust statistics might be a better name but that has already been taken over for a completely different context of using rank statistics and other methods to analyze non-normal data). The goal of sturdy statistics is to produce an analysis that is, well, sturdy! It stands up against challenges. It weathers the elements. Like the folk tale of three little pigs – it is not a house/model made of straw that blows over at the first big puff of challenge (different assumptions and methods). I seek to be like pig #3 whose statistics are made of brick and don’t wobble every time a slightly different approach is used, and – important point – not only am I’m not afraid to have that claim tested, I WANT it tested.

A commitment to sturdy statistics involves:

  1. Running an analysis multiple different ways (an experienced researcher knows what alternative ways will be suggested to them, and we can help graduate students learn these).
  2. If the results are all qualitatively similar (and quantitatively close), then, great!, report that the analyses all converged so the results are really, truly sturdy
  3. If the results are different then this is the hard part where commitment to ethics come in. I think there are two options:
    1. Report the contrasting results (this may make it harder to get published, but I’m not sure it should – it would be more honest than making results appear sturdy by only publishing one analysis path thanks to shutting off reviewers who request alternative analysis paths)
    2. A more fruitful path is likely digging in to understand why the different results happened. This may not be rewarding and essentially leave you at 3a. But in my experience it very often actually leads to deeper scientific understanding which can then lead to a better article (although the forking paths should be reported they don’t have to take center stage if you really figure out what is going on). For example it may turn out the result really depends on the skew in the data and that there is interesting biology out in that tail of the distribution.
  4. As a reviewer or editor, make or support requests for alternative analyses. If they come back the same, then you know you have a really solid result to publish. If the authors come back saying your suggestion gave a different answer and we now understand why, then it is an open scenario to be judged for advancement of science. And if they come back different, well, you’ve done your job as a reviewer and improved science.

Sturdy statistics – An example

I’m going to use a very contrived example on a well-known dataset. Start with the Edgar Anderson (aka Fisher) iris dataset. It has measurements of sepal length and width and petal length and width (4 continuous variables) as well as species ID for 50 individuals in each of 3 species (N=150). It is so famous it has it’s own Wikipedia page and peer reviewed publications on the best way to analyze it. It is most classically used as a way to explore multivariate techniques and to compare/contrast e.g. principal component analysis vs. unsupervised clustering vs discriminant analysis, etc. However, I’m going to keep it really simple and parallel to the more common linear model form of analysis.

So let’s say I want to model Sepal.Length as a function of Sepal.Width (another measure of sepal size), Petal.Length(another measure of overall flower size and specifically length) and species name (in R Sepal.Length~Sepal.Width+Petal.Length+Species). As you will see this is a pretty reasonable thing to do (r2>0.8). But there are some questions. If I plot a histogram of Sepal.Length it looks pretty good, but clearly not quite normal (a bit right-skewed and platykurtic). On the other hand, if I log-transform it, I get something else that is not terrible but platykurtic, bimodal and a bit left-skewed (by the way Box-Cox doesn’t help a lot – any exponent from -2 to 2 is almost equally good). One might also think including species is a cheat or not, so there is a question about whether that should be a covariate. And of course we have fixed vs. random effects (for species). I can very easily come up with 6 different models to run (see R code at the bottom): simple OLS as in the formula already presented, same but log transform Sepal.Width, same as OLS but remove species, or treat species as random (one would hope that is not too different), or use a GLM with a gamma distribution which spans normal and lognormal shapes (but the default link function for gamma is log, but maybe it should be identity). And tellingly, you can find the iris data analyzed most if not all of these ways by people who consider themselves expert enough to write stats tutorials. Below are the results (the coefficients for the two explanatory variables – I left out species intercepts for simplicity – and r2 and p-values where available – i.e. not GLM).

Results of a sturdy analysis of the iris data predicting Sepal Length

What can we make out of this? Well Sepal Width and Petal Length are both pretty strongly positively covarying with Sepal Length and combine to make a pretty predictive (r2>0.8) and highly statistically significant model. That’s true in any of the 6 analyses. Log transforming doesn’t change that story (although the coefficients are a bit different and remain so even with back-transforming but that’s not surprising). Using Gamma-distributed residuals doesn’t really change the story either. This is a sturdy result! Really the biggest instability we observe is that relative strength of Petal Length and Sepal Width change when species is or isn’t included (Petal Length appears more important with species, but Sepal Width is relatively more important without species*). So the relative importance of the two variables is conditional on whether species is included or not – a rather classic result in multivariate regression. If we dig into this deeper we can see that in this dataset two species (virginica and versicolor are largely overlapping (shared slope and intercept at least), while setosa has higher intercept but similar slope vs Sepal Width, but vs. Petal Length, the slope for setosa also varies substantially from the other two so slope estimates would vary depending if you control species out or not (and maybe a variable slope and intercept model should be used). So that one weak instability (non-sturdiness) is actually pointing a bright red sign at an interesting piece of biology that I might have ignored if I had only run one analysis (and additional statistical directions I am not going to pursue in an already too long blog post). This paragraph seems simultaneously like a sturdy result, but at the same time, a sturdiness analysis caused me to dig a bit deeper into the data and learn something biologically interesting. Win all around!

And in a peer review context, that exercise hopefully saves time (reviewers not needing to request additional analyses), is fully transparent on the analyses done (no buried p-hacking), and draws convincing conclusions that leave science in a better place than if I had just chosen one analysis and doubled down on insisting it was right.

Conclusions

TL;DR: Sometimes the answer to a question on a dataset is sturdy against various analysis approaches. Sometimes it’s not. We can’t know a priori which scenario we are in. The logical solution to this is to actually try different analyses and prove our result is “sturdy” – hence an inference approach I call “sturdy statistics”. To avoid this turning into p-hacking it is important that we embrace sturdy statistics and encourage honest reporting of our explorations. But even if you don’t like sturdy statistics, we have to get over the notion of “one right way” to analyze the data and come up with a solution to finding out if multiple, reasonable analysis paths lead to different results or not, and what to do if they do.

What do you think? Do you like sturdy statistics? Do you already practice sturdy statistics (secretly or in the open)? Do you think the risk of sturdy statistics leading to p-hacking is too great? Or is the risk of p-hacking already high and sturdy statistics is a way to reduce its frequency? What needs to change in peer review to support sturdy statistics? Is there an alternative to sturdy statistics to address the many, many reasonable paths through analysis of one data set?

*NB: to really do this just by looking at coefficients I would need standardized independent variables, but the mean and standard deviation of the two variables are close enough and the pattern is strong enough and I am only making relative claims, so I’m going to keep it simple here.

R Code

data(iris)
str(iris)

#simplest model
mols<-lm(Sepal.Length~Sepal.Width+Petal.Length+Species,data=iris)
# log transform sepal length?
mlogols<-lm(log10(Sepal.Length)~Sepal.Width+Petal.Length+Species,data=iris)
#role of species as a covariate?
mnosp<-lm(Sepal.Length~Sepal.Width+Petal.Length,data=iris)
#species as random instead of fixed (shouldn't really differ except d.f.)
library(lme4)
mrndsp<-lmer(Sepal.Length~Sepal.Width+Petal.Length+(1|Species),data=iris)
#Gamma residuals (a good proxy for lognormal) 
# with default log transformation on dependent variable
mgamlog<-glm(Sepal.Length~Sepal.Width+Petal.Length+Species, data=iris,                                                            family=Gamma(link="log"))
#No log transformation on 
mgamident<-glm(Sepal.Length~Sepal.Width+Petal.Length+Species,
data=iris, family=Gamma(link="identity"))
# Is Sepal.Length better log transformed or raw?
hist(iris$Sepal.Length)
hist(log(iris$Sepal.Length))
#hmm not so obvious either way
#do these choices matter?

#pickout relevant pieces of result from either GLM or OLS objects
#return a row in a data frame
report1asdf <- function(mobj) {
  co=coef(mobj)
  if (!is.numeric(co)) {co=as.numeric(co$Species[1,]); co[1]=NA} #GLM var intrcpt
  s=summary(mobj)
  #handle GLM with no p/r2
  if (is.null(s$r.squared)) s$r.squared=NA
  if (is.null(s$fstatistic)) s$fstatistic=c(NA,NA,NA)
  data.frame(
    #CoefInt=co[1],
    CoefSepW=co[2],
    CoefPetL=co[3],
    r2=s$r.squared,
    p=pf(s$fstatistic[1],s$fstatistic[2],s$fstatistic[3],lower.tail=FALSE)
  )
}

#assemble a table as a dataframe then print it out
res<-data.frame(CoefSepW=numeric(),CoefPetL=numeric(), r2=numeric(),p=numeric())
res<-rbind(res,report1asdf(mols))
res<-rbind(res,report1asdf(mlogols))
res<-rbind(res,report1asdf(mnosp))
res<-rbind(res,report1asdf(mrndsp))
res<-rbind(res,report1asdf(mgamlog))
res<-rbind(res,report1asdf(mgamident))
row.names(res)=c("OLS","OLS Log","OLS No Sp.","Rand Sp","GamLog","GamIdent")
print(res)

Scientific bets vs. scientific influence

Martin Rees and Steven Pinker have a new piece in New Statesman about their friendly bet from a number of years ago, on whether “bioterror or bioerror” would kill a million people in a single event within a 6-month period before 31 Dec. 2020. Rees bet yes, Pinker took the other side. The winner of the bet depends on whether the “lab leak” hypothesis for the origin of Covid-19 turns out to be true. (Here’s a brief overview of the evidence on that.)

I’m not so interested in Rees and Pinker’s piece; it’s mostly a generic popsci overview of tail risks in biomedical research. I presume the reason it’s in a fairly high-profile outlet like New Statesman is some combination of (i) Rees and Pinker are famous, and (ii) the betting angle.

I think the interplay of (i) and (ii) is kind of interesting. I’ve written a lot in the past about scientific bets, and their cousin, adversarial collaborations. The purported purpose of scientific bets, and adversarial collaborations, is to settle scientific disputes. The idea is that, by committing to one side of a claim, and to the evidence that would decide its truth, you tie your own hands. Making a public bet forces you to make a specific, unambiguous, checkable claim, and so forces you to admit if that claim is right or wrong. That’s the theory, anyway, but in practice it doesn’t work out that way. Based on the examples I know of, I don’t think scientific bets and adversarial collaborations actually settle disputes (case in point). I can’t think of a case in which the loser of a public scientific bet admitted to being in the wrong. People can always find some excuse to retroactively wriggle out of the bet, or out of the scientific implications of the bet.

But just because scientific bets don’t serve their purported purpose doesn’t mean they don’t serve some other purpose.

I think the main purpose of public scientific bets is to gain or maintain influence. The bettors might not care so much about winning or losing the bet. They mostly just want others to pay attention to them, and to whatever scientific issue they’re betting on. Which only works if you’ve already got some measure of influence, I think. I mean, if two scientists with no public profile had made the same bet as Rees and Pinker, would anyone else care?

But I wouldn’t criticize Rees and Pinker for making their bet, and then taking up an obvious opportunity to write about it in the New Statesman. Lots of scientists, from the most famous down to the most obscure, try to bring themselves and their ideas to the attention of others, in all sorts of ways. Different ways of doing that will be most effective for different people. As best I can tell, there’s not much agreement as to which ways of seeking attention and influence should be totally beyond the pale, if any.

So I think it’s fine for Rees and Pinker to make a public bet and then write about it. Just as I think it’s fine for me to write a quickie blog post about it. They’re doing what works for them, and I’m doing what works for me. 🙂

p.s. There is a meaty post from Brian in the queue for Monday! This post is like chips and dip, tiding you over until your next proper meal. 🙂

What conclusions should you draw from a dataset when different analysts reach different conclusions?

The latest many analysts, one dataset project is out as an unreviewed preprint, and this one has the most depressing conclusions of the ones I’ve seen. Breznau et al. gave 73 social science research teams the same dataset, asking them to estimate the effect of immigration on public support for social policies. The point estimates were centered on zero but varied massively. A substantial minority of research teams reported a statistically significant negative effect, and a different, almost equally-big substantial minority reported a statistically significant positive effect. All this is in contrast to previous many analysts, one dataset projects I’ve seen, in which most or all analysts at least agreed about the sign of the effect of interest.

The variation among analysts in the Breznau et al. study is because there were a lot of analytical choices that varied among research teams, no one of which has a big “main effect” on the outcome of the analysis. So it’s not that, say, omitting one specific influential observation always reduces the estimated effect of immigration by some massive amount, independent of your other analytical choices.

On the plus side, at least you couldn’t explain much of the among-analyst variation from knowledge of analysts’ prior beliefs about the topic. Because it would be really depressing to live in a world in which every analyst was consciously or subconsciously putting a thumb on the scale, to reach the conclusions they already “knew” to be true.

Nor could you explain much of the among-analyst variation from knowledge of analysts’ methodological expertise. I find that result both unsurprising and reassuring. But I’m curious if any of you are either surprised or bothered by it.

Question: how exactly would you phrase the take-home message here regarding effects of immigration on public support for social policies? Would you say that the dataset is “uninformative” about the effect of immigration? That there’s “no evidence” for an effect? “Mixed evidence”? “Conflicting evidence”? Strong evidence against an effect of immigration? Or what? I’m torn.

Another question: I wonder if at some point we’ll have enough of these many analysts, one dataset projects to do some comparative analyses of them? If we ever do, here’s my pet hypothesis as to what we’ll find: the among-analyst variability will be highest in cases when the mean estimate (averaging over all analysts) is close to zero. My other pet hypothesis is that among-analyst variance typically will be as large or larger than among-dataset variance. That is, if you gave the same analysts two different datasets addressing the same question, I bet there’d usually be more variation in results among analysts than among datasets. (The analyst x dataset interaction term might be big too.)

Nobody writes papers complaining about the state of ecology any more (do they?)

Earlier this week, I was interested to read Liam Bright arguing that his own field (analytic philosophy) is a degenerating research program, and that ongoing attempts to salvage it by making it “relevant” or “practical” aren’t likely to succeed.

Even if you don’t think your own field is degenerating, I think it can be good mental exercise to consider the possibility. I’ve done so in the past in a somewhat tongue in cheek way, designing an imaginary graduate course on whether ecology is f*cked or awesome. That imaginary graduate course mostly wasn’t based on my own concerns or complaints about the state of ecology. It was based on other people’s concerns and complaints.

I bring this up because, reading Liam Bright’s piece, it struck me that I haven’t read any similar critique from an ecologist in almost a decade, with the exception of pieces critiquing ecologists’ statistical methods. I mean, click that last link–most of the readings I proposed in the imaginary graduate course are old! Heck, most of them were old back when I wrote the post in 2016, and now they’re even older. And the few that aren’t old are either (i) about statistics, (ii) are pretty narrowly focused (e.g., critiques of NEON), or (iii) aren’t really about ecologists’ research practices at all (e.g., concerns about climate change).

Now, maybe I’ve missed a bunch of stuff (please tell me if I have!) And maybe I’m just wrong–maybe such pieces have always been rare events and they aren’t any rarer nowadays. But if I’m right that such pieces are rarer these days, why is that?

Maybe it’s because the field of ecology has grown more sophisticated in its statistical methods, and more focused on global change and conservation? So worries about our statistical methods loom large in our minds. Whereas worrying that global change or conservation research wasn’t worth doing would risk crossing the line from “good mental exercise” to “trolling”.

What do you think? Am I just off base here? (Quite possible!) And if I’m not off base, what do you think has changed?

p.s. If something’s changed, I don’t know that it’s necessarily a bad thing, or a good thing. It’d depend on the reasons for the change.