Ecologists like robustness checks–especially if other people do them

Yesterday, I polled y’all on robustness checks–different ways of doing a statistical analysis that lead to the same broad conclusion, thereby indicating that the conclusion is robust. I asked whether, as an author, your papers usually include robustness checks, and if so, where (in the main text, in an appendix, etc.). And I asked, as a reviewer or reader, if you usually want authors to include robustness checks, and if so, where.

Here are the poll results so far! They’re pretty interesting, and surprising to Brian (I wasn’t that surprised). So you should totally read on.

As of this writing, we have 88 responses to the question about what you usually do as an author, and 78 responses to the question about what you usually prefer as a reader or reviewer. Presumably, a few respondents overlooked the second question. Thanks everyone who responded! As usual, this isn’t a random sample of ecologists, or even of our readers. But it’s a bigger and likely more diverse sample than, say, your lab group or journal club. So it seems worth talking about.

Here’s a graph of the percentage of respondents choosing each of the options. Blue bars are respondents indicating what they usually do as authors, orange bars are respondents indicating what they usually prefer as reviewers or readers:

Here are the take-home points:

  • Ecologists like robustness checks. This is the result that surprised Brian. Only a small percentage of respondents to either question said they usually do or prefer no robustness checks. But I wonder how many authors only do them because reviewers ask for them, or because they think reviewers will ask for them, because…
  • Ecologists like robustness checks more when it’s other people doing them. Doing and reporting robustness checks is work. Ecologists are human; we like work best when it happens to other people. 🙂 Over 60% of respondents usually prefer robustness checks to be reported in an appendix when they’re acting as readers or reviewers, vs. only 49% when they’re acting as authors. In other words, there are some ecologists who want others to report their robustness checks in an appendix, but who don’t usually do so themselves. What do those authors do instead? Well, they mostly either do the robustness checks but tell the reader that the results are “not shown”, or else they don’t do the robustness checks at all. Those are the two options that were chosen by more authors than readers/reviewers. Conversely, putting robustness checks in the main text–the option that involves the most work–was the preferred option of more readers/reviewers than authors.

These results remind me of how everyone simultaneously complains that peer review takes so long, and also complains about being asked to do peer reviews quickly. Everybody wants to go to heaven, but nobody wants to die, as the saying goes.

But we shouldn’t be too cynical about this. One way to ensure that everybody does something that few people want to do is to create and enforce a norm that everyone will do the thing. We all hold each other to higher standards than we would hold ourselves to if we were all left to our own devices. That’s mostly a good thing.

12 thoughts on “Ecologists like robustness checks–especially if other people do them

  1. This reminds me of an experience I had with a manuscript written by a student lead-author.

    In the first draft, we use multiple regression and came to the conclusion that the specific species was more likely to be found on sun-exposed slopes. A reviewer suggested we use model selection based on AIC instead, which we did and – again – found that the species was more likely to be found on sun-exposed slopes. We resubmitted and the manuscript was sent to a different reviewer who suggested we also account for detection probability. Again, we re-analysed the data as requested and showed that the species was more likely to be found on sun-exposed slopes. This wasn’t surprising because this pattern was clear from eye-balling the data.

    It has now been about 18 months since the initial submission and the paper is still not published (granted, Covid played a part in the delay too)

    I don’t mind too much because the reviewers were technically correct each time and the additional tests strengthened my confidence in the conclusion. But it was a rough experience for the student who is still trying to publish his first paper.

    It worries me that the student might learn from this experience that in order to get your work published, you have to start with the most complex statistical methods first. Maybe all this could’ve been avoided had we done robustness checks from the beginning?

      • I agree that editors should play a role in preventing runaway statistical complexity. But it is difficult for them to know whether the conclusions are indeed robust to the choice of statistical test without the authors showing this in subsequent revisions.

        My example only seems obvious in hindsight because the data were clear enough that the choice of statistics didn’t affect the conclusions. It would be much messier had the alternative statistical tests pointed towards different conclusions…

  2. In my experience robustness checks arise quite naturally in many projects, without explicit planning. For let’s say 4-5 decisions, we discuss that it could be done this way or that way (include this or that covariate, how to quantify covariate, multiple regression or AIC as in Falko’s case), and unless one seems clearly more appropriate, we often say “try it both ways and if it comes out the same, we’re all good”. However, if one does that five times, then you have 2^5 = 32 possible analyses, and so we might only do it if there is some reason to suspect it might matter. And the only times it does seem to matter is for effects that are weak to begin with, in which case we should already be phrasing conclusions with an appropriate level of uncertainty. I suspect that this is most problematic when p < 0.05 is taken too seriously (still often the case). If three analyses come out with p = 0.03, 0.05 or 0.07, then any conclusion about what is happening biologically should not be overstated, even if you only did one of the three (any one). If we’re prioritizing effort, I’d say robustness checks are most important when the answers to the *core* questions are ambiguous. In Falko’s case, it sounds like the conclusion was clearly not going to change, but reviewers just had some pet thing they would have done differently (not a great justification for more effort).

  3. I think that we sometimes learn that we have to perform a single analysis and use its conclusions; and if we change the analysis, we must use the conclusion of the new analysis. It makes a lot of sense to me to perform different analyses (all of them valid) to see whether the results change according to the analysis used. But I’m also concerned that this may seem as p-hacking (performing many analysis to find a result – even though conceptually it is quite different) or a lack of knowledge or prior choice on which analysis to use. Still, it is known that results may change according to the analysis and often it’s not trivial to determine which analysis is the most appopriate for a given dataset and question. So… I think I never stated such checks because I never knew they were used and acceptable. Perhaps this shows something about how statistics are taught and used… Do you know of any references defending the use of robustness checks?

  4. Hi Jeremy, great post! I am one of those who voted for ‘none’ as a reader (not a strongly held opinion), and my basic reasons (and/or questions) are as follows.

    1) When deciding what statistical test to use, first we should decide what assumptions we are making (such as the distribution of data and so on), and we verify those assumptions. Then based on that, we determine the test which either implicitly or explicitly (or both) includes those assumptions. Now, given that a certain set of assumptions remains constant, I imagine that whatever tests we run after accounting for those shouldn’t give us different results. In fact, I would imagine the tests with the same assumptions run on the same data would give the same results – because they should be identical tests by definition.

    There may be assumptions we can’t easily verify, and so for those we could run the test with different assumptions, any of which may be true, and report the results of all tests – but then isn’t it better to invest time and effort in making our assumptions more certain rather than running the test with all of them one by one? Unless of course, there is no way to make them more certain, and in this – I believe rare – scenario we should do robustness checks.

    2) I think it’s possible that a lot of robustness checks that all show the same result, but which have been cherry picked for that purpose, might lead to a higher degree of confidence in the results (for the reader) than should be the case. Now, this isn’t an argument against robustness checks, because you can’t argue that one shouldn’t do X because, if done poorly, it leads to undesired consequences – because nobody is arguing for X to be done poorly in the first place. However, it’s just a concern about what might happen if robustness checks become a bit like a bandwagon.

    3) Related to the above two points, in the few times that I’ve seen robustness checks, they have sometimes not been very satisfactorily justified, and sometimes they have been. In my opinion, a paper with both instances, for example, is Thibaut and Connoly et al. 2019 in Ecology. After writing about “well recognised” issues with a certain analysis, they write that “nevertheless” they wanted to confirm their own analysis is not “sensitive to this choice” and do that analysis (fitting a Ricker-logistic model to time series data). In my opinion, that is not a well justified choice. At the same time, they write that they wanted to check whether a particular assumption of their model is true (that the previous year’s population abundance captures all the information about density dependent effect of past abundance), and then they explicitly test for this, which I think is great.


  5. Having a graph accompany an analysis is a form of robustness check. I am unsure that a graph is required for most venues but it is uncommon to see original research without one. The graph confirms the analysis.

    • Hmm. The graph is there for other reasons too, of course. To help the reader understand the analysis, for instance. And to help the reader understand the data in a way that’s independent of the analysis.

  6. Seems like a good way to decide which methods to report is to consider the spectrum from more to least powerful statistical procedures. The more powerful ones detect more subtle patterns or effects and usually make more/stricter assumption about the data being analyzed. The less powerful ones detect only the most obvious effects and make fewer assumptions about the data. If you just bracket those with two analyses, seems like you’d cover most of what’s of concern. I’ve always been a fan of the interocular cranial impact test (hits you between the eyes) that virtually all analyses see. 😉

  7. Pingback: The death knell for the “one right way” approach to statistics? and the need for Sturdy Statistics | Dynamic Ecology

Leave a Comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.