I wish I could’ve titled this post “The ~~call ~~heterogeneity is coming from inside the ~~house ~~primary studies”, but WordPress doesn’t allow strikethrough text in post titles. 🙂

Ecology is full of variability, and most of it isn’t just sampling error that would go away if only we had large enough sample sizes. For instance, in a typical ecological meta-analysis, something like 85% of the variation in effect size is attributable not to sampling error, but rather to “heterogeneity”: real variation in the true mean effect size (Senior et al. 2016).

Variation cries out for explanation. If it’s not sampling error, then that means there must be *some* reason(s) for it. We’d like to know the reason(s)! That’s why ecological meta-analyses routinely include moderator variables–covariates that might explain some of the heterogeneity in effect size. Perhaps effect sizes vary because some primary studies are observational and others are experimental. Maybe some primary studies were conducted on birds and others were conducted on mammals. Maybe some primary studies were conducted on islands and others were conducted on continents. Maybe different primary studies were conducted at different latitudes, or using different methods, or etc.

But what if most of the variation isn’t *among *primary studies? Rather, what if most of the variation is *within* primary studies? After all, many primary studies (i.e. single research papers) report multiple effect sizes. The investigators conducted the same experiment on each of three related species, or in each of two different habitats, or etc. Now, it might seem far-fetched to worry that effect sizes from the same primary study will be all *that *heterogeneous. After all, effect sizes reported in the same primary study ordinarily have a lot in common. They’re based on data collected by the same investigators, using the same methods, usually at the same time, and usually at the same or nearby locations. That’s why effect sizes from the same primary study generally share the same values for most or even all of the moderator variables in a typical meta-analysis. How much within-study heterogeneity in effect size could there possibly be?

About as much as there is among studies, actually! Below is a graph from my fairly comprehensive compilation of over 450 ecological meta-analyses. For each meta-analysis, I used a hierarchical random effects model to partition the variation in effect size into variation among primary studies, within primary studies, and sampling error. The graph below plots the % of variation in effect size attributable to among-study variation vs. the % attributable to within-study variation. There’s one point for each meta-analysis.

The first thing you’ll notice that most of the observations fall close to an imaginary boundary line with a slope of -1, running from the upper-left corner to the lower-right corner. That reflects the fact that, for most meta-analyses, most the variation in effect size is due to heterogeneity (the sum of among-study + within-study heterogeneity), not sampling error. The boundary line marks all combinations of among-study heterogeneity and within-study heterogeneity that add up to 100% of the total variance in effect size.

But that’s not the important thing to notice for purposes of this post. For purposes of this post, the important thing to notice is that most points are not clustered in the upper-left corner. Rather, they’re spread out pretty uniformly from the upper-left to the lower-right, just below the boundary line. Which means that within-study heterogeneity is about as large, on average, as among-study heterogeneity. Effect sizes reported in the same primary study are just as different from one another, on average, as are effect sizes reported in different primary studies.

Now, there are *some *meta-analyses for which the heterogeneity is entirely among studies; within-study heterogeneity is estimated to be zero or close to zero. But don’t get too excited about that, for two reasons. First, there are also some meta-analyses for which the heterogeneity is entirely *within *studies! Second, most of the meta-analyses with zero within-study heterogeneity (or zero among-study heterogeneity) are small meta-analyses that only include a handful of studies. Here’s a graph of within-study heterogeneity, as a function of the number of primary studies in the meta-analysis:

Notice that most of the meta-analyses with 0% (or 100%) of variation in effect size attributable to within-study heterogeneity have <25 studies, and all but two have <75 studies. That strongly suggests that, if more studies of those topics were conducted, substantial within- and among-study heterogeneity would be revealed.

If you’re someone wants to explain variation in effect size, I think these results should worry you. In a typical ecological meta-analysis, something like 50% of the variance in effect size is variance among effect sizes within studies. Sources of within-study variation are going to be difficult or impossible to identify! Many of the usual moderator variables aren’t going to help, because they don’t vary within studies.

These results make me wonder how much distributed experiments like NutNet cut down on heterogeneity. One reason to conduct a distributed experiment is to eliminate some sources of heterogeneity in effect size.* Investigators at many different locations all perform the same experiment, at the same time, using the same methods, on organisms that are sufficiently similar to one another in various ways (body size, behavior, etc.) that one can study them all using the same methods. But of course, “same experiment,” “same time”, “same methods”, and “sufficiently similar organisms” all apply to pretty much every primary study in ecology. Apparently, all that sameness within a given primary study still leaves considerable scope for heterogeneity among the effect sizes reported by that study. So I think it’d be really interesting to quantify how much heterogeneity there is among effect sizes in a single distributed experiment like NutNet, as compared to heterogeneity among effect sizes in a meta-analysis of a bunch of primary studies.

UPDATE: see the comments, where a commenter makes a very good point that in retrospect I probably should’ve made in the post: these estimates of within- and among-study heterogeneity are just that, estimates. They have error bars–quite possibly big ones. See the comments for discussion of this and its implications for the points made in the post. /end update

p.s. I’ve made the points in this post before. But I haven’t shown the graphs before, so I decided to give them a standalone post.

*There are of course other reasons to do distributed experiments. Follow that last link for a great interview with NutNet co-founder Elizabeth Borer, addressing this point and much more.

Interesting stuff! Question – how precise are the estimates of within- and between- study variation? You show them as dots on the graph, but they are estimates from a model, and so might have quite a bit of uncertainty. When I see a strong negative correlation between predictions like that (first graph), I worry it could simply be that the model didn’t have enough information to resolve the two variables. I.e. your hierarchical model seemed generally confident that the sampling error was low, but perhaps it was not confident in whether the remaining variation was due to within- or between study variation, which would be reflected in wide confidence intervals and correlated point estimates.

Good question. I have the same question myself. I wonder if what we’re seeing here is a precise estimate of total heterogeneity, combined with a very imprecise estimate of how to divide total heterogeneity into within-study and among-study components.

Just eyeballing the data, I can certainly believe that the division between within- vs. among-study heterogeneity often will be imprecisely estimated. For many of these meta-analyses, many or even most of the primary studies only had one effect size. So the estimate of within-study heterogeneity is driven by the few primary studies that report multiple effect sizes. Further, even for the meta-analyses for which most primary studies report multiple effect sizes, those primary studies usually just report 2 or 3 effect sizes. So we’re basically estimating variances from very small samples here, which could indeed lead to very imprecise estimates.

p.s. even if we’re not confident in the precise division between within-study and among-study heterogeneity for most meta-analyses, that in itself is somewhat concerning. After all, it means we’re *not* confident that most of the heterogeneity is among studies. And unless we can be confident that most heterogeneity is among-study heterogeneity, I don’t think we can’t be confident in our ability to explain much of that heterogeneity.

p.p.s. My previous two comments, and the original post, would be wrong if there’s *bias* in the estimation of within-study vs. among-study heterogeneity. In particular, if for some reason my estimates of within-study heterogeneity are biased upwards by some substantial amount, that would invalidate the post. I don’t know of any reason to be concerned about that hypothetical possibility, but I’m not an expert on hierarchical random effects models. Are there circumstances in which one would expect the variance components estimates to be seriously biased?

Hi,

Very interesting. I would expect multiple effect sizes within a study to exist because the researchers had reason to believe varying some factor in the design would affect the effect size. If so perhaps not so surprising if the within-study heterogeneity estimate sometimes is large? (I don’t work in ecology, so perhaps this reasoning doesn’t apply here).

Good point, that might apply in some cases.

Great post, Jeremy! The 50% figure does not surprise me in the least, for three reasons:

a) Most investigators do not conduct pilot studies sufficient to estimate confidence intervals relative to sampling effort.

b) Those who do often discover that to get below a 50% confidence interval limit, the sampling regime is so intense that they haven’t the staff, time or money to get there.

c) Ergo, welcome to the mine field of meta-analyses concerning field data.

It is safe to assume a significant number of studies generate data wherein the confidence intervals are so large, you could drive a truck through them. Thus, I am actually pleasantly surprised that only 50% of the heterogeneity in ecological meta-analyses is within studies. I’d have expected it to be greater.

That said, while this kind of variation excludes most- if not all- quantitative assessments, one can certainly utilize these meta-analyses qualitatively, and for a rough indication of the real effect in nature.

Quibble: I actually disagree that pilot studies to estimate the sample size required for a given level of confidence are useful. I don’t think it’s useful to have very imprecise pilot estimates of population parameters. In large part because of your “b”–investigators already collect the most data they possibly can, given the time/staff/money available. And their choices of which questions to ask in the first place mostly aren’t dictated by considerations of sampling error. Rarely do ecologists decide “I’m going to work on question Y rather than question X, because I can obtain more precise estimates of population parameters if I study question Y.”

Good points, Jeremy. I agree that hypotheses often evolve & change over time. Perhaps I had not stated my case as clearly as I should have. I’ve made extensive use of the following text for the purpose of estimating sampling intensity relative to vegetation monitoring:

https://www.wiley.com/en-us/Monitoring+Plant+and+Animal+Populations%3A+A+Handbook+for+Field+Biologists-p-9780632044429

Elzinga et al. provide several equations to estimate sampling intensity relative to confidence intervals. I have found these equations to be of good use. In essence, they communicate the number of transects and number of points per transect needed to estimate the actual occurrence of species/guilds/communities within a given site- i.e., sample. Thus, I misspoke when I described this as a means to estimate the number of samples needed to estimate a population parameter.

In my most recent endeavor utilizing these equations, my study involved 40 sites- i.e., samples. However, subsequent analyses revealed I only needed 13 sites to estimate all population parameters of interest. Thus, I over-sampled. However, the estimates for sampling intensity remained valid, if that makes any sense.

Even though I had over-sampled in the first year of study, this approach enabled me to reduce from 40 to 13 the number of sites evaluated over future years of data acquisition. That in turn imparted significant savings of time, effort & money.