In an old post I talked about how the falsehood of our models often is a feature, not a bug. One of the many potential uses of false models is as a baseline. You compare the observed data to data predicted by a baseline model that incorporates some factors, processes, or effects known or thought to be important. Any differences imply that there’s something going on in the observed data that isn’t included in your baseline model.* You can then set out to explain those differences.
Ecologists often recommend this approach to one another. For instance (and this is just the first example that occurred to me off the top of my head), one of the arguments for metabolic theory (Brown 2004) is that it provides a baseline model of how metabolic rates and other key parameters scale with body size:
The residual variation can then be measured as departures from these predictions, and the magnitude and direction of these deviations may provide clues to their causes.
Other examples abound. One of the original arguments for mid-domain effect models was as a baseline model of the distribution of species richness within bounded domains. Only patterns of species richness that differ from those predicted by a mid-domain effect “null” model require any ecological explanation in terms of environmental gradients, or so it was argued. The same argument has been made for neutral theory–we should use neutral theory predictions as a baseline, and focus on explaining any observed deviations from those predictions. Same for MaxEnt. I’m sure many other examples could be given (please share yours in the comments!)
This approach often gets proposed as a sophisticated improvement on treating baseline models like statistical null hypotheses that the data will either reject or fail to reject. Don’t just set out to reject the null hypothesis, it’s said. Instead, use the “null” model as a baseline and explain deviations of the observed data from that baseline.
Which sounds great in theory. But here’s my question: how often do ecologists actually do this in practice? Not merely document deviations of observed data from the predictions of some baseline model (many ecologists have done that), but then go on to explain them? Put another way, when have deviations of observed data from a baseline model ever served as a useful basis for further theoretical and empirical work in ecology? When have they ever given future theoreticians and empiricists a useful “target to shoot at”?
Off the top of my head, I can think of only a few examples. And tellingly to my mind, in most (not all) of those examples the baseline models were very problem-specific. For instance, there’s Gary Harrison’s (1995) wonderful use of a nested series of baseline models to explain Leo Luckinbill’s classic predator-prey cycles dataset. The simplest baseline model explains certain features of the data, a second baseline model is then introduced to explain additional features, and so on (with additional validation steps along the way to avoid overfitting). Or think of the many “random draws” experiments on plant diversity and total plant biomass that use the Loreau-Hector (2001) null model as a baseline in order to subtract out sampling effects, going on to partition the deviations from the null model into effects of “complementarity” and “selection”. And in the context of allometric scaling, I’m sure there’s work on why particular species or phylogenetic groups deviate as they do from baseline allometric relationships (e.g., higher vertebrates have larger brains for their body size than lower vertebrates).
But in most cases I can think of in ecology where someone’s proposed some “generic” null model like MaxEnt or neutral theory, or some null model based on constrained randomization of the observed data, it hasn’t turned out to be very productive to try to explain deviations of the observed data from the null model. All we usually end up with is a list of cases in which the data either do or don’t match the null model, with no obvious rhyme or reason to the occurrence, size, or direction of those deviations. See, e.g., Xiao et al. 2015 for MaxEnt models of species-abundance and species-size distributions. In general, deviations of observed data from the predictions of some generic “null” model do not seem to be a very good source of stylized facts.
Assuming for the sake of argument that I’m right about this, why is that? I honestly don’t know, but I can think of a few possibilities:
- Our baseline models don’t correctly capture all and only the effects of the processes or factors they purport to capture. So that deviations of the observed data from the baseline models aren’t interpretable. I think that’s usually what’s going on in cases where the baseline model is some constrained randomization of the observed data.
- Multicausality. It’s hard to build a baseline model that subtracts out all and only the effects of processes A & B if processes A & B are far from the only ones that matter. Indeed, insofar as many processes matter we should expect any patterns in our data to take the form of “statistical attractors” that exist independent of the nature and details of those processes. More subtly, even if there are just one or two dominant processes that we can capture with a baseline model, the deviations of the observed data from the baseline model are going to be hard to interpret unless they too are dominated by one or two processes.
If I’m right, then I think ecologists shouldn’t be so quick to to recommend the approach of developing a baseline model and then explaining deviations of the data from it. And reviewers and readers should probably default to skepticism of this approach.
p.s. Nothing in this post is an argument against deliberately-simplified models that omit some processes, factors, or effects in order to focus on others. I’m just arguing that we should default to skepticism of one particular use of deliberately-simplified models. They have other uses (e.g.).
*Note that lack of differences don’t imply that your baseline model is actually correct, or even a good approximation to the correct model. But leave that (common) scenario aside for purposes of this post.