This blog post is me thinking out loud about something I bet all of you have thought about at some point: what method should I use to do X when there’s no agreed-upon “standard” or “best” method? I’m thinking about this problem in the context of random effects meta-analysis, but I think what I have to say will resonate more broadly. Plus, there’s a little poll at the end where you can vote on what I should do. 🙂
Recently I wrote some blog posts exploring heterogeneity in ecological meta-analyses. “Heterogeneity” basically means “variation in effect size not attributable to sampling error.” In particular, I focused on heterogeneity in effect size within vs. among research papers, and was alarmed to see just how much heterogeneity there is within papers in most ecological meta-analyses. The “true” mean effect size often seems to vary a lot among effect sizes reported in the same paper (e.g., same experiment conducted on each of two different species). Even though effect sizes reported in the same paper should have a lot in common (same investigators, using exactly the same methods, etc.).
But those blog posts glossed over some tricky technical issues. I didn’t say how I estimated the random effects of variation among studies, and among effect sizes within studies. And I didn’t report confidence intervals for those estimated random effects. As an astute commenter noted, the graphs I presented give some strong hints that within- and among-paper heterogeneity often were estimated very imprecisely. Which wouldn’t be a surprise, honestly, given how few papers there are in most ecological meta-analyses, and the fact that many papers only report 1-2 effect sizes.
All of which was fine for purposes of an exploratory analysis presented in a blog post. But now I’m getting serious about actually doing the analyses right. Unfortunately, I immediately ran into a stumbling block: exactly how should one estimate heterogeneity, and its confidence interval, in a meta-analysis? The webpage for the metafor package (a popular R package for doing meta-analyses) helpfully points out that different ways of estimating heterogeneity give wildly different point estimates and wildly different confidence intervals when applied to the same data. But it doesn’t offer any advice as to which way should be preferred, or provide any information that would help a non-expert like me make an informed decision (e.g., which estimators are based on asymptotic theory).
Perhaps the metafor package authors don’t give advice because there’s no agreed-upon advice to give? A quick search of the literature suggests to me that estimation of heterogeneity is a huge unsolved problem in meta-analysis (von Hippel 2015, Partlett & Riley 2016, Nagashima et al. 2018, Takkouche et al. 2013, Jackson & Bowden 2016, and probably a bunch of other papers my casual googling didn’t discover).
So, what should I do? I can see a few options:
- Just pick an estimator of heterogeneity, and a way of estimating its 95% confidence interval, and go with it. For instance, just follow Senior et al. 2016 Ecology. This option is attractive to me because it is the least amount of work for me, and because I know it would probably be fine in the eyes of reviewers, editors, and most readers. It’s unattractive to me because I’m pretty sure I have an old blog post complaining about authors who “justify” their methods by saying (in so many words) “We have no idea if this method works or not, but other people have used it, so if it’s wrong you should blame those other people, not us.”
- Do a massive simulation study to determine the best way to estimate heterogeneity and its 95% confidence interval. That in itself would be a useful contribution to the statistical literature. This is not actually a live option because I am totally unqualified to do it properly, but I mention it for completeness.
- Repeat my study with every proposed estimator of heterogeneity, and its 95% confidence interval, to confirm that the results are robust (or at least to bracket the range of plausible results). Has the advantage of being rigorous. Though of course it doesn’t rule out the (hopefully remote!) possibility that all heterogeneity estimators have some shared flaw that biases them all in the same direction. Has the disadvantage of being hilariously computationally intensive. I have 460 meta-analyses to deal with. If I have to estimate heterogeneity and its 95% c.i. in, say, 5 different ways, that’s 460×5=2300 meta-analyses. Some of which will take hours to fit. Well, unless I include “bootstrapping” among my ways of estimating the 95% confidence intervals, in which case they’ll all take hours to fit. Worse, to address some of the questions I’m interested in, I need to do cumulative meta-analyses. That is, do a meta-analysis of just the first two published papers, then add in the third published paper and recalculate the meta-analysis, and so on. Cumulative meta-analysis is a way of studying how the evidence base on a topic changes over time as more and more papers are published. Each cumulative meta-analysis is effectively a bunch of meta-analyses (as many as several hundred, if the meta-analysis included several hundred primary research papers). Doing 460 cumulative meta-analyses, and then redoing them all with each of several different heterogeneity estimators, would take approximately forever. And if I did all 460 cumulative meta-analyses using bootstrapping to put confidence intervals on the random effects, that would take several forevers.
- Abandon this project because it’s intractable. Revisit the project if/when statisticians agree on the best way to estimate heterogeneity and its 95% confidence interval. After all, it’s not as if I don’t have other projects I could pursue instead. I have an old blog post arguing in favor of this option–that ecologists should be more ready to switch rather than fight, when faced with an intractable technical obstacle to their research. But hypocrite that I am, I’m reluctant to go down this road unless someone else pushes me down it.
So, what should I do? Take the poll!