How replicable is ecology? Take the poll!

The talk of the social science online-o-sphere this week is this long meaty polemic from Alvaro de Menand. Alvaro was a participant in Replication Markets, a DARPA program to estimate the replicability of social science research. “Replication” here refers to getting a statistically significant of the same sign as the original, using either the same data collection process and analysis on a different sample, or the same analysis on a similar but independent dataset. Participants in the replication market were volunteers who wagered on the replicability of 3000 studies from across the social sciences. A sample of which are going to be replicated to see who was right. But in the meantime, previous prediction markets have been shown to predict replicability in the social sciences, and so in the linked post Alvaro treats the replication market odds as accurate estimates of replicability.

And he’s appalled by the implications, because the estimates are very low on average. The mean is just a 54% replication probability. The distribution of estimates is bimodal with one of the modes centered on 30%. And when you break the results down by field (Gordon et al. 2020), there are entire fields that do quite badly. Psychology, marketing, management, and criminology are the worst. (Economics does the best, with sociology not too far behind.)

The hypothesized reasons for this are pretty interesting (turns out you can learn a lot by reading a bunch of papers from a bunch of fields…). Alvaro argues that lack of replicability is mostly not down to lack of statistical power, except perhaps when it comes to interaction effects. Nor does he think the main problem is political hackery masquerading as real research, except in a few narrow subfields. And he has interesting discussions of the typical research practices in various fields. As sociologist Keiran Healy pointed out on Twitter, the replication market participants basically seem to have identified a methodological gradient across fields. The more your field relies on small-sample experiments on undergrads to test hypotheses that are pulled out of thin air, the less replicable your field’s work is estimated to be. Alvaro also has interesting discussions of variation within fields.

At the end, he has some proposals to address matters, some of them quite radical (e.g., earmarking 60% of US federal research funding for preregistered studies).

I’m curious whether all this applies to ecology. What do you think? How replicable are ecological studies in your view, and what do you think are the sources of non-replication? Take the short poll below! I’ll summarize the answers in a future post.

30 thoughts on “How replicable is ecology? Take the poll!

  1. That’s the main question our community should try to answer now. In my opinion, most problems of replicability among ecological studies come from two main issues: (i) poorly explained methods and (ii) lack of transparency in data and analysis. How can one replicate a risotto, if the recipe lacks some steps and the description of ingredients is ambiguous? Therefore, in addition to other efforts, we should work towards improving things like writing style, and data and code sharing.

    • Hi Marco – I think that the recipe analogy is an interesting one, but not for the reason that you suggest. Two people can make the same dish using the same ingredients and it will taste different for perhaps reasons of very subtle variations in how it’s stirred, or how the temperature is controlled, differences in air pressure affecting boiling the boiling point of water, the type of pan they use, etc. Some people seem to be naturally skilled in the kitchen, others are not. It’s the same with field and lab experiments – what works for one person may not work for another even if they are following the same steps. You hear this a lot among people who use PCR, and I’ve seen it in the field with nectar extraction from flowers.

      • Sure, I totally agree with you, Jeff. That’s why it’s so important to precisely describe all steps of the recipe. Of course, in addition to being crystal clear in the list of ingredients. It those two components of a recipe are poorly explained, two people following the same the recipe may end up one with a risotto and the other with a paella.

      • I agree that this is a plausible factor undermining replicability. However, I think it is worth noting that in psychology, as replication become more common during the past decade, a standard response from those whose work had failed to replicate was that the replicators lacked the skill to implement the study correctly. At least in that case, it was clear that this was typically just a knee-jerk defensive move. And a valid reply in many of these cases was along the lines of: ‘if this result is so hard to obtain, why should we even care about it?’ This critique won’t apply when real technical skill is required, but I do think we should be cautious when applying the ‘skill’ defense of studies that have not replicated.

      • Completely agree with you on this Tim re: failure to replicate being down to tiny methodological differences between the original study and the replication attempt. I’d add Andrew Gelman’s “time reversal heuristic”. Just because study A was conducted before study B doesn’t mean the methods and conditions used in study A are somehow better or preferable to the (slightly different) methods and conditions used in study B.

        On another note, I’ve been trying to think of cases in which it really was true that a result would replicate, but only if a very finicky method was implemented with great skill. The only one I can think of (and I hope I’m not misremembering it) is cloning via somatic cell nuclear transfer. IIRC, other labs besides the original Scottish lab had trouble getting the technique too work. The problem was solved when the postdoc from the Scottish lab went to another lab and did the nuclear transfers. The technique requires an *extremely* steady hand with a pipet, otherwise you damage the nucleus. This one postdoc really was exceptionally skilled at the technique.

        The other possible case I can think of is Leo Luckinbill’s early-70s experiment stabilizing a predator-prey cycle in protist microcosms by thickening the culture liquid with methyl cellulose. This slows the movements of the prey and predator and so reduces the rate at which they encounter one another, thereby reducing the predator’s per-capita attack rate. (see Gary Harrison’s amazing re-analysis of Luckinbill’s data from the mid-1990s in Ecology). As a postdoc, I planned to build on Luckinbill’s work by introducing a second prey species and studying prey coexistence. Step one was to replicate Luckinbill’s results with a single prey species. I couldn’t do it. There seemed to be a very sharp threshold between “medium not thickened enough to make any difference” and “medium so thick the predators can barely move, so they all starve”. I knew were various differences between my methods and Luckinbill’s (e.g., different culture medium), but I think they were pretty minor. So I wouldn’t say that Luckinbill’s result wasn’t real (and I certainly don’t think Luckinbill faked it!). I just think that his result was super-finicky. I’m pretty confident that if one could replicate his methods *exactly*, you’d have good odds of getting the same result he got (thickening the medium stabilizes the predator-prey cycle so it’s not so extinction-prone). Whereas I wouldn’t say that of a lot of these social psychology experiments that fail to replicate–I think those results are the products of some combination of p-hacking, sampling error, and publication bias, so that you wouldn’t expect them to replicate even if you could skillfully and exactly reproduce the original methods.

  2. I wonder if it is necessary to specify what kind of experiment. A “typical ecological result” could be almost anything. Crucially, I would expect replication rates to be pretty decent in mirocosm and mesocosm experiments, were the conditions of the original experiment can be reproduced, and much lower in field experiments where so much of the context cannot be controlled.

  3. “The more your field relies on small-sample experiments on undergrads to test hypotheses that are pulled out of thin air, the less replicable your field’s work is estimated to be.”

    This is one of my wife’s bete noirs. She’s a therapist and criticises the findings from psychological research as often being only applicable to students (who are by definition not a random group of individuals) under narrow conditions.

  4. One thing I have never quite understood about the notion of replicating studies is that there is somehow something wrong if the results are not the same between two studies. If study A is underpowered and gives a positive result, but is then replicated by Study B which is also underpowered but this time gives a negative result. Then study A would have been considered to have unreplicable results, despite not being able to tell if study A or study B gave the false positive or false negative result.
    Or am I missing something?

    • And I thought that was the reason for why we are encouraged to do meta-analyses, to be able to find a common truth among a lot of underpowered studies. We should perhaps not ask if a study can be repeated or not but rather how variable outcomes that can be observed given that the hypothesis is correct. Perhaps that tells us something else than only if a study outcome is significant according to the p<0.05 convention.

      • Although if replicability is too low (say, because among-study heterogeneity is extremely hight), a meta-analysis doesn’t tell you much, does it?

      • Of course it does. It tells you how predictable this pattern is. That is the very point. We don’t only aim to find mean tendencies with meta-analyses but also variabilities due to heterogeneity.

  5. Thanks for calling attention to Alvaro de Menand’s post on replication forecasts in social science. I want to say a few things about that project and a few things about replicability in ecology.

    First, I urge anyone with an interest in the reliability of empirical science to read Alvaro’s post. It’s insightful and well written. I don’t agree with everything in that post, but I won’t get into a detailed critique here. I have been working on a parallel project (replicATS:, funded through the same DARPA program, to generate forecasts for the same 3000 social science studies. Our approach was different – we spent more time with each paper (20-30 minutes rather than 2.5 minutes), and we refined our forecasts through a collaborative process (IDEA protocol – My experience with working on approximately 100 of these forecasts leads me to conclusions that are broadly similar to Alvaro’s. As an ecologist, it was fascinating to dive into the social science literature with a critical eye, and to realize that I could actually generate replicability forecasts that often converged with those of domain experts. The lesson: understanding a few basic principles of study reliability can take you far in this process.

    As for replicability in ecology, I expect that it varies considerably among sub-fields. This is because sub-fields vary based on two of the best predictors of replicability – sample size and plausibility / prior probability of hypotheses. Small sample size is obviously an important cause of variability in effect size, and I doubt anyone reading this needs convincing of that, but if you want an empirical demonstration, check out Fanelli et al. 2017 PNAS However, the relationship between prior probability and replicability may be less obvious. For instance, in a null hypothesis testing framework, the easiest way to get a high proportion of false positives (FPRP – false positive report probability) is to test unlikely hypotheses (this has been widely discussed, but a frequently cited explanation is in Ioannidis 2005 PLOS Medicine ). Social psychology suffers from both small samples and an attraction to unlikely (counter intuitive, exciting) hypotheses, and this probably explains its poor replicability in the big multi-study replication projects. As an aside, one of the biggest obstacles I faced when estimate replicability of social science studies came when I lacked sufficient information to generate a strong prior. I think one of the reasons we can estimate replicability reasonably well in social psychology is that our own lives give us a robust basis for assessing the plausibility of hypotheses. Anyway, back to ecology. As someone trained as a behavioral ecologist, I am willing to identify that sub-discipline as the one with the closest parallels to poorly replicable fields of social science – both due to small samples and frequent tests of unlikely hypotheses. Although the small sample size problem is widespread in ecology, I suspect that the unlikely hypothesis problem may be less so. I’d be curious to hear what others think. I agree, however, that other obstacles to replicability in ecology would involve biological heterogeneity (temporal / spatial / taxonomic) and difficulty with methodological heterogeneity (which is why I’m a big fan of distributed experiments like NutNet and DRAGnet). One of the problems that plagues some social science realms but I think may be less of a problem in ecology in general is the obsession with p = 0.05 as a threshold. I saw many social science papers with results hovering below 0.05, and ‘statistical significance’ is poorly repeatable when your p-value is between 0.05 and 0.01. Also, when a study reports a suite of p-values falling that range, that in itself is an unlikely event (even in the case of a real effect), and so is a red flag for p-hacking or selective reporting. Again, I don’t think I see as much of that in ecology broadly. However, I haven’t systematically searched.

    My last words for now – I suspect ecology has higher replicability on average than the social science disciplines with the poorest replicability, but I doubt we’re in wonderful shape. It will be hard to generate robust estimates of replicability across the discipline because so many ecology studies are so difficult to replicate. However, various people are working around the edges of this problem and I think will generate some useful insights soon. For instance, the collaborative many-analysts project I’m working on has had 171 analysts or analyst teams submit separate analyses of one or the other of two ecology/evolutionary bio data sets, and this should provide some insights into the degree to which among-analyst variation in statistical decisions can drive heterogeneity in results.

    • Tim, these are *super* interesting comments! I assume you wouldn’t mind if I hoisted them into a standalone blog post? Relatively few readers read the comments, unfortunately, but your remarks here really deserve more eyeballs.

      And yes, if you asked me to guess which subfield of ecology would be least replicable, I’d guess behavioral ecology too. At least, the bits of behavioral ecology that are most like social psychology–small sample size experiments testing hypotheses that seem implausible but that would be super-cool if they were true. I was actually thinking of saying that in the first draft of this post, but decided I didn’t want to nudge poll respondents by including any of my own views. Note that I wouldn’t hazard a guess as to how replicable behavioral ecology is an absolute sense–I really have no idea. I just suspect it’s less replicable than other subfields of ecology. But of course, having not gone through the DARPA-style exercise myself, I wouldn’t have much confidence in my own suspicions here!

      • Hi Jeremy. Hoist away. Though, I would have put a bit more care into them if I had thought they would be the primary content of post. Maybe I could edit a bit? I’ll send you an email.

  6. To p hacking I would add question and hypothesis hacking – writing a paper around the results that were significant (likely after a fishing expedition).

  7. A couple of points about replicability. First, the current concern seems to be primarily about finding effects that aren’t real. And I understand why – because the incentives probably push scientists more towards that kind of mistake than the mistake of missing effects that are real. But, an overweening concern for ensuring that scientists don’t publish research with Type I errors is going to lead to an increase in Type II errors. I’m not sure that’s a good thing for science.
    In fact, if we truly believe that science is (and should be) an iterative process – that is, a process where early research is intended to find good candidate explanations for phenomena of interest and as the research matures, narrowing and refining the explanations to those for which there is increasing empirical evidence, then in those early days we should be casting the net wide and using small mesh, so that we capture every reasonable candidate. In those early stages, we should be more worried about Type II errors than Type 1’s. If anybody buys this argument, the problem isn’t that there are all kinds of studies that aren’t replicable – it’s that we aren’t doing science in a way that routinely discovers and discards them. It’s not that we reach conclusions that end up being false, it’s that we treat science as if it’s a 40-meter dash rather than an ultra-marathon. If we had the long-run view – we wouldn’t be trying to reduce the number of studies that aren’t replicable – we would just be trying to identify them. There would be no shame in a study that couldn’t be replicated – the first steps are made with our eyes closed – it’s not that surprising that they’re often in the wrong direction.
    The second point – and I will continue to beat the corpse of the barely recognizable horse that I rode in on – while replicability and predictive ability may not be the same thing, they are a lot closer than kissing cousins. When we demand that every model of the how the world works, makes better predictions than a guess, we’ll avoid taking extended digressions in the wrong direction.

  8. Neither de Menard’s post nor this one and its commenters (so far) mention the inherent difficulty of replicating results in the context of complex systems. It would require much effort to determine how broadly applicable that concern might be in any given research context, but “in the wild” (whether that be a college campus or a wilderness hillside) studies of complex systems likely won’t yield the same results twice, and even many replication attempts may not produce high confidence in a general tendency.

    • I think that kind of falls under the first option in the poll. It’s what meta-analysts would call “heterogeneity”, right?

      Though note as well that “replication” for purposes of this post and poll (and the de Manand post) just means “find an effect in the same direction as the original, that’s significant at the 0.05 level”. So, just a qualitative replication–same sign of effect, not same magnitude. That’s a pretty low bar, and one that should often make replication possible even in the face of a fair bit of complexity/heterogeneity/whatever-you-want-to-call-it.

    • I agree Matt. But if this is generally true in ecology then an explicit part of ecological research should be identifying at what spatial and temporal scales and in what contexts any particular finding would be expected to hold.

  9. Pingback: Don’t forget to take our short poll on the replicability of ecology! | Dynamic Ecology

  10. Pingback: Hoisted from the comments: Tim Parker on replicability in ecology vs. the social sciences | Dynamic Ecology

  11. Pingback: Poll results: how replicable do ecologists think ecology is, and why? | Dynamic Ecology

Leave a Comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.