A case study in how to choose one’s methods when there’s no agreed “best” method (or, how should I estimate heterogeneity in a random effects meta-analysis?)

This blog post is me thinking out loud about something I bet all of you have thought about at some point: what method should I use to do X when there’s no agreed-upon “standard” or “best” method? I’m thinking about this problem in the context of random effects meta-analysis, but I think what I have to say will resonate more broadly. Plus, there’s a little poll at the end where you can vote on what I should do. 🙂

Recently I wrote some blog posts exploring heterogeneity in ecological meta-analyses. “Heterogeneity” basically means “variation in effect size not attributable to sampling error.” In particular, I focused on heterogeneity in effect size within vs. among research papers, and was alarmed to see just how much heterogeneity there is within papers in most ecological meta-analyses. The “true” mean effect size often seems to vary a lot among effect sizes reported in the same paper (e.g., same experiment conducted on each of two different species). Even though effect sizes reported in the same paper should have a lot in common (same investigators, using exactly the same methods, etc.).

But those blog posts glossed over some tricky technical issues. I didn’t say how I estimated the random effects of variation among studies, and among effect sizes within studies. And I didn’t report confidence intervals for those estimated random effects. As an astute commenter noted, the graphs I presented give some strong hints that within- and among-paper heterogeneity often were estimated very imprecisely. Which wouldn’t be a surprise, honestly, given how few papers there are in most ecological meta-analyses, and the fact that many papers only report 1-2 effect sizes.

All of which was fine for purposes of an exploratory analysis presented in a blog post. But now I’m getting serious about actually doing the analyses right. Unfortunately, I immediately ran into a stumbling block: exactly how should one estimate heterogeneity, and its confidence interval, in a meta-analysis? The webpage for the metafor package (a popular R package for doing meta-analyses) helpfully points out that different ways of estimating heterogeneity give wildly different point estimates and wildly different confidence intervals when applied to the same data. But it doesn’t offer any advice as to which way should be preferred, or provide any information that would help a non-expert like me make an informed decision (e.g., which estimators are based on asymptotic theory).

Perhaps the metafor package authors don’t give advice because there’s no agreed-upon advice to give? A quick search of the literature suggests to me that estimation of heterogeneity is a huge unsolved problem in meta-analysis (von Hippel 2015, Partlett & Riley 2016, Nagashima et al. 2018, Takkouche et al. 2013, Jackson & Bowden 2016, and probably a bunch of other papers my casual googling didn’t discover).

So, what should I do? I can see a few options:

  • Just pick an estimator of heterogeneity, and a way of estimating its 95% confidence interval, and go with it. For instance, just follow Senior et al. 2016 Ecology. This option is attractive to me because it is the least amount of work for me, and because I know it would probably be fine in the eyes of reviewers, editors, and most readers. It’s unattractive to me because I’m pretty sure I have an old blog post complaining about authors who “justify” their methods by saying (in so many words) “We have no idea if this method works or not, but other people have used it, so if it’s wrong you should blame those other people, not us.”
  • Do a massive simulation study to determine the best way to estimate heterogeneity and its 95% confidence interval. That in itself would be a useful contribution to the statistical literature. This is not actually a live option because I am totally unqualified to do it properly, but I mention it for completeness.
  • Repeat my study with every proposed estimator of heterogeneity, and its 95% confidence interval, to confirm that the results are robust (or at least to bracket the range of plausible results). Has the advantage of being rigorous. Though of course it doesn’t rule out the (hopefully remote!) possibility that all heterogeneity estimators have some shared flaw that biases them all in the same direction. Has the disadvantage of being hilariously computationally intensive. I have 460 meta-analyses to deal with. If I have to estimate heterogeneity and its 95% c.i. in, say, 5 different ways, that’s 460×5=2300 meta-analyses. Some of which will take hours to fit. Well, unless I include “bootstrapping” among my ways of estimating the 95% confidence intervals, in which case they’ll all take hours to fit. Worse, to address some of the questions I’m interested in, I need to do cumulative meta-analyses. That is, do a meta-analysis of just the first two published papers, then add in the third published paper and recalculate the meta-analysis, and so on. Cumulative meta-analysis is a way of studying how the evidence base on a topic changes over time as more and more papers are published. Each cumulative meta-analysis is effectively a bunch of meta-analyses (as many as several hundred, if the meta-analysis included several hundred primary research papers). Doing 460 cumulative meta-analyses, and then redoing them all with each of several different heterogeneity estimators, would take approximately forever. And if I did all 460 cumulative meta-analyses using bootstrapping to put confidence intervals on the random effects, that would take several forevers.
  • Abandon this project because it’s intractable. Revisit the project if/when statisticians agree on the best way to estimate heterogeneity and its 95% confidence interval. After all, it’s not as if I don’t have other projects I could pursue instead. I have an old blog post arguing in favor of this option–that ecologists should be more ready to switch rather than fight, when faced with an intractable technical obstacle to their research. But hypocrite that I am, I’m reluctant to go down this road unless someone else pushes me down it.

So, what should I do? Take the poll!

7 thoughts on “A case study in how to choose one’s methods when there’s no agreed “best” method (or, how should I estimate heterogeneity in a random effects meta-analysis?)

  1. I picked “choose one and go with it” because I see the payoff of using several different estimators as relatively small unless they all converge on a similar region (which sounds unlikely). Because if they differ, what do you do? Take the mean? Knowing that one of them is probably better and adding the others is just adding ‘noise’? Integrating them in any way (including averaging them) seems almost as arbitrary as just picking one.
    And we’re always dealing with errors in our estimates – of population size or species richness or whatever. And we rarely have a good idea of how much error there is. But those errors rarely prevent us from learning from our data. I would take the ‘easy’ way and let the chips fall…

    • I was originally leaning towards “report several estimators”. But as I read up on this more, I’m finding that the literature does offer *some* advice. So now I’m leaning towards just picking the one that seems best, based on the advice I’ve read.

  2. I picked “report several” (not necessarily all). Of course if there is substantive guidance towards one, per your comment above, that’s a different story. But otherwise, I think picking just one is not reporting the uncertainty (in this case in methods) that we know exists.

  3. Not many votes so far, but the two most popular options are “pick an estimator and go with it” and “repeat the study with several different estimators”.

    A few respondents are either trolling, or are weirdly enthusiastic about me doing a massive simulation study that I’m totally unqualified to do.

  4. Hello!
    I would personally say “reframe the meta-analytic analyses as a hierarchical model and use brms and stan to sample from the posterior distribution”, which leads to abandoning totally the “what a good estimator is?” question.
    All the best,

    • That would require me to learn to use stan, which from my own lazy point of view is a drawback. But it’s a good suggestion.

      Having dug further into the literature since I wrote this post, I’m now more confident than I was that there’s a defensible “best” way for me to proceed. Not “best” as in “any other way would *definitely* be inferior”. But “best” as in “there are some decent theoretical and empirical arguments for my chosen way of proceeding working reasonably well in an absolute sense, and working better than the alternatives”.

      • Hello,

        I suggested using brms because it has a syntax very similar to lme4 🙂


        Anyway, I perfectly understand the trade-offs behind such technical choices. I have to admit I sometimes hope I would not have dived into mcmc when I wait for an hour without knowing if there was an error in my stan code… But then I am confronted with all these boring estimator questions, and to the limitation of max-likelihood packages, and I dive again!
        All the best,

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.