One (broad) definition of a “null” model is a model that deliberately omits or “nullifies” something in order to reveal what the world would be like in the absence of that something. Comparing the null model’s predictions to data is one way to make inferences about whatever the model omitted. For instance, if the null model’s predictions match the data, it’s tempting to infer that whatever was omitted or “nullified” doesn’t matter.
The validity of that inference can be and has been debated. But this post isn’t about that, it’s about the prior step of constructing the null model in the first place. How do you decide what your “null” model should “nullify”? I raise this question because I struggle to understand why many prominent null models in ecology retain what they retain, and omit what they omit.
Here’s what I’m struggling with: it’s often the case in ecology that any given factor or cause can affect any given response variable via multiple causal pathways. For instance, a predator might affect prey population dynamics both by killing prey, and by causing prey to hide more and feed less, thereby reducing the prey birth rate. So if you want your null model to omit some underlying factor or cause, you need to nullify all of its effects, via all pathways, on your response variable of interest.
But that’s often not what null models in ecology do. Instead, they eliminate only some of the effects of some underlying cause, propagated via only some causal pathways. I struggle to understand why you’d want to do that. The only reason I can see for doing that is as a way of testing whether certain causal pathways matter. But that’s rarely the reason proffered for using a null model.
Randomized null models that purport to test for effects of interspecific competition on species x site matrices are perhaps the most famous (infamous?) example. You have a data matrix that tells you which species are present at which sites, and you ask whether some feature of that matrix (say, the number of sites at which two species co-occur) changes if you randomize which sites harbor which species, but don’t change the number of species at each site and the number of sites at which each species occurs. In other words, the randomization removes any effect of competition or other species interactions–except those that affect the number of species at each site and the number of occurrences of each species. That is, the randomization almost certainly retains some really important effects of interspecific competition.* Colwell and Winkler (1984) called this unavoidable retention of certain effects of competition the “Narcissus effect”. It’s a problem because randomized null models of species x sites matrices were intended to eliminate all effects of interspecific competition, not just some.
In my view, Robert Colwell later fell prey to the Narcissus effect himself when he proposed his “mid-domain effect” null model of species richness gradients. The goal of this null model was to try to eliminate all effects of environmental gradients on species’ geographic distributions, leaving only effects that arise from the fact that species are distributed within “hard” boundaries (e.g., the seashore is a hard boundary beyond which the geographic range of a terrestrial species cannot extend). Colwell’s null model randomized the positions of species’ observed geographic ranges within a bounded domain, and found that species richness peaked in the center of the domain, much as how species richness peaks near the equator along the pole-to-pole latitudinal gradient. This “mid-domain effect” occurs because large geographic ranges have to be placed so as to overlap the middle of the “domain”, otherwise they’d overflow the boundaries. But the trouble is (and I’m far from the first or only person to point this out), the null model retains all effects of environmental gradients on species’ geographic range sizes. Environmental conditions don’t just affect where species’ geographic ranges are centered, they also affect where the boundaries of those ranges are located, thus affecting the sizes of species’ ranges. So what the mid-domain effect “null” model reveals is not, or not just, the effects of hard boundaries on species richness gradients. Rather, the “null” model reveals the effect of environmental conditions on species richness gradients, via their effects on geographic range sizes.
I worry about this issue in contexts besides null models based on data randomization. For instance, it’s increasingly popular to use “MaxEnt” (“maximum entropy”) as a null model in ecology. Roughly speaking, MaxEnt is a mathematical technique for choosing the “simplest” or “smoothest” statistical distribution of the data consistent with some specified constraint(s). For instance, FODE Ethan White recently used MaxEnt to predict the shapes of species-abundance distributions, taking as constraints the observed species richness and total abundance values (White et al. 2012). People who use MaxEnt seem to find those sorts of constraints quite innocuous, but I’m not so sure. Surely the total number of species at a site, the total abundance of those species, and the shape of the species-abundance distribution, all are effects of whatever underlying causes determine species’ birth, death, and movement rates, right? Similarly, there’s a large class of food web topology models which takes the numbers of species and predator-prey links as given, and then uses ecological assumptions to decide how those links are likely to be arranged (i.e. “who eats whom”). But surely the same underlying causal factors that determine who eats whom (e.g., predator foraging decisions) also determine how many species and feeding links there are. (And before you say it, yes, I know these food web models aren’t ordinarily considered “null” models–but the issue is the same) Other examples could be given.
Note that not every “null” model in ecology suffers from this problem. For instance, neutral models in population genetics and community ecology really do omit all effects of selection, because the selection coefficients in these models are all set to zero. And writing recently in Nature, Storch et al. compare observed species-area and endemics-area curves to those predicted by four different “null” models, each of which retains different features of the observed data while omitting others. This seems like a nice way to figure out how the observed data could, or could not, have been generated.
Don’t misunderstand me, we always have to take something as given, as exogenous. I have no problem with that. But it seems kind of weird to me for that exogenous something to be, not some particular causal factor like selection, but only some of the causal pathways by which a given causal factor affects the response variable of interest. I think if you’re going to omit only certain causal pathways from your null model, you ought to say up front that that’s what you’re doing, and explain why you’re doing it. Why retain the pathways you retained, and omit the ones you omitted? Ideally, I think your answer to this question should be principled rather than pragmatic (e.g., “I couldn’t figure out how to omit certain causal pathways, so I retained them” isn’t a very compelling answer. Neither is “I retained certain pathways because that’s what everybody else does“.)
*If you don’t believe me, try this exercise: write down a spatial competition model, such as a spatial Lotka-Volterra model, in which the strength of interspecific competition is fully specified by some parameter or parameters that don’t affect any other features of the system. The model can have any other features you like–any sort of spatial variation, variation among species, etc. Use the model to simulate species’ dynamics across a bunch of sites, thereby generating a simulated species x sites matrix. See if you can create a situation in which, as you dial the interspecific competition parameter(s) down to zero, you change species’ co-occurrences without changing the species richness of any site or the number of sites at which any species occurs. I’ll bet you either can’t do it, or you can only do it by making very specific, and probably very strange, assumptions.