Scientists—and indeed scholars in any field—often have to choose how wide a net to cast when attempting to define a concept, estimate some quantity of interest, or evaluate some hypothesis. Is it useful to define “ecosystem engineering” broadly so as to include any and all effects of living organisms on their physical environments, or does that amount to comparing apples and oranges?* Should your meta-analysis of [ecological topic] include or exclude studies of human-impacted sites? Can microcosms and mesocosms be compared to natural systems (e.g., Smith et al. 2005), or are they too artificial? As a non-ecological example that I and probably many of you are worrying about these days, are there any good historical precedents for Donald Trump outside the US or in US history, or is he *sui generis*? In all these cases and others, there’s no clear-cut, obvious division between relevent information and irrelevant information, things that should be lumped together and things that shouldn’t be. Rather, there’s a fuzzy line, or a continuum. What do you do about that? Are there any general rules of thumb?

I have some scattered thoughts on this, inspired by the concept of “shrinkage” estimates in statistics:

- If you’re not familiar with the concept of “shrinkage” or the intuitions behind it, I
*highly*recommend reading Efron (1977; link fixed), which is one the best explainers I’ve ever read on*anything*. You really should click through and then come back, but if you insist on reading on, here’s a summary: Imagine that you’re estimating a bunch of independent or nearly-independent means. Efron uses the example of estimating the true batting averages of each of a bunch of baseball players, based on their observed batting averages in a small sample of games. But you could also think of, say, estimaing the true expression levels of a bunch of different genes. Your best estimates, in the sense of lowest total squared error, are provided not by the sample means themselves, but by “shrinking” all the sample means towards the grand mean. The optimal amount of shrinkage depends on (i) the precision with which the sample means are estimated (less precise estimates get shrunk towards the grand mean more), (ii) how close the sample means are to one another (you shrink them less if they’re more spread out around the grand mean), and (iii) how many means you’re estimating (more means = more shrinkage). The intuition is that, if you have a bunch of sample means, just by chance some of them will happen to be overestimates of the corresponding true means, others will happen to be underestimates. The more means you have, the greater the chances some of them will be extreme over- or under-estimates. So you can reduce your overall error by biasing all of your estimated means towards the grand mean. That slight increase in bias buys you a big reduction in variance. Another way to put it is to say that each of the means gives you some information about what the true values of the*other*means are. You’re throwing that information away if you just use the unshrunken sample means as your estimates of the true means. You’re casting too narrow an evidentiary net. This is the intuition behind empirical Bayes methods, and it’s closely related to the intuition behind Bayesian methods involving prior information. - A similar intuition arises in the context of regression, something I also learned from a Brad Efron paper. You can think of regression as using “indirect” as well as “direct” information to estimate the mean of the dependent variable Y conditional on the value of the predictor variable X. Your best estimate of the conditional mean of Y for any given value of X isn’t the mean of the observations of Y at that value of X (the “direct” information provided by the data). Indeed, you might not even have any observations of Y at that value of X! Rather, your best estimate of the conditional mean of Y for any given value of X depends on the estimated parameters describing the regression of the mean of Y on X (estimated slope and intercept for linear regression). And those estimated parameters of course depend on
*all*In other words, observations of Y for any given value of X give you indirect information about the true mean of Y at*other*values of X. - A similar intuition underpins hierarchical modeling, the statistical estimation of different, hiearchically-nested sources of variation. For instance, here’s a nice intuitive example from baseball, a sport in which players’ unknown true batting averages vary among players who play the same position, and among positions (e.g., pitchers tend to be terrible hitters, whereas first basemen tend to be excellent hitters). You can minimize your total error by shrinking your estimates of each player’s hitting skill towards the mean for their position, and shrinking your estimated means for each position towards the grand mean. To do otherwise is to discard relevant information. That may seem puzzling: why should the batting average of any one player affect your estimate of the average of some other player, especially one who plays a different position? The answer is that every player’s batting average provides some information about the overall “batting environment” all players experience. They all play under the same rules, they all face the same pitchers, and so on.
- However, it’s not always the case that estimating more means always provides more information about all of them. That’s for at least a couple of reasons. First, estimating more means might well require estimating each one with less precision, for instance because total sampling effort is finite. Second, and less obviously, reducing the total error associated with estimating a bunch of means is not the same as reducing the error with which any
*particular*mean is estimated. As Efron (1977) shows, if you’re estimating a bunch of means, and you expand your dataset to include one or more genuninely atypical means (for a certain precise technical sense of “atypical”), then you can actually increase the error with which you estimate both the atypical means, and the typical ones. - One can address the issue of “atypicality” by estimating how heterogeneous one’s collection of means is. Meta-analysts often do this. It’s how they address concerns about what studies to include or exclude from the meta-analysis. Correct me if I’m wrong, but I believe the usual advice for meta-analysts is to err on the side of including a wider range of studies rather than a narrower range, and then estimate whether the means of those studies are in fact heterogeneous (i.e. do they differ more than would be expected by sampling error alone). So for instance, if you’re not sure if studies from human-impacted ecosystems should be included in your ecological meta-analysis, your best bet is to include them and then test whether they’re really different from studies of non-impacted systems. That ensures you’re not unwittingly throwing away information.
- What I’m struggling with is how far these intuitions can be pushed. Can they be pushed beyond their original, strictly-statistical context? Do the intuitions in the previous bullets provide a general argument that one should always cast a wide evidentiary net?
- For instance, do these intuitions provide a knock-down argument for the relevance of microcosms and mesocosms to ecological research? Even if you only care about estimating what’s going on in nature, is there a strong
*prima facie*case that you’re Doing It Wrong if you ignore information from microcosms and mesocosms on the grounds that they’re “unrealistic” or “different”? As long as microcosms have*something*in common with nature, isn’t that a*prima facie*argument for their relevance? - I don’t think these intuitions can be pushed completely beyond their original statistical context without losing all force. For instance, I don’t think they tell you much about whether a broad concept like “ecosystem engineering” is scientifically useful or not. I think that depends on much more than the statistical precision with which one can estimate the effects of ecosystem engineers in a meta-analytic context.
- All this is connected to my old post on when one should focus on the average vs. on the variation around the average.

*Footnote for smart alecks who think that it’s fine to lump together apples and oranges if one’s question is about fruit: for “apples and oranges” read “apples, oranges, bricks, and Major League Baseball”.

Thank you for this really interesting presentation. The Efron (1977) link does not open though and I am convinced I should read it!!

Thanks, link fixed.

I’d like to see if the apples are all apples or not. Ie, is it appropriate to compare batting averages with microcosms? Are these two kinds of apple and thus ammenable to all “apple methods”, or is one an apple and the other something different w, many more dimensions? I’m not sure but would live to see the arguments both ways.

Touche. 🙂

Hi Jeremy, interesting questions! Couple quick comments: 1) At least some of the additional error when including an ‘atypical’ mean in an empirical Bayes model comes from not marginalizing over hyperparameters. For example, in Efron’s example of including fraction of foreign cars in a batch of 16 batting average means, the hyper parameter for group-level variance would gain a fair amount of uncertainty (in fact proportional to ‘mismatch’ between the atypical and typical means). 2) I think in general we want to shrink towards a group-level regression rather than a single mean. What I mean is that we evidently have additional information about how we’re structuring items into batches when we can identify something as “atypical” versus “typical”. In an ecological setting, you might have a random effect of species (say), but then you actually know something about species origin or whatever and that should be included in model. But I agree there is a kind of fuzzy conceptual line here.

Aha! Cheers for this, I was hoping for comments from people who know this stuff.