The very smart Steven Frank has an unusual and thought-provoking new paper out on “generative models vs. underlying symmetries to explain biological pattern“. As with several of his recent papers, it’s really applied philosophy. Steven Frank has some very deep and quite abstract ideas about evolutionary biology, and science more generally. But he’s very good at applying those ideas to concrete scientific questions.
This particular paper interested me for several reasons. Papers in which scientists are explicit about their philosophy of science always interest me. This particular paper hits on an important issue–the links between process and pattern–that’s near and dear to many ecologists, including Brian and I (see here, here, here, here, here, and here). But the paper’s in an evolution journal, so it’s one ecologists might miss. The example Frank uses to make his point is bang up to date: it’s one of the latest results from Rich Lenski’s long-term evolution experiment. And it’s a rare case where I disagree with Frank (in some respects). For all those reasons, I thought I would use a post to “think out loud” about this paper and hopefully spark an interesting conversation. It seems like a good way to revisit some issues we haven’t touched on in a while, while bringing in a fresh perspective.*
First, by way of background: Wiser et al. (2013) reported that the mean fitness of E. coli populations adapting to glucose-limited media in the lab has been increasing as a power law function of time for 50,000 generations. That’s a very surprising result–you’d think that at some point in a constant environment, fitness would reach the maximum possible value for that organism in that environment and stop increasing. But you’d be wrong (a power law has no asymptote). Wiser et al. also built a theoretical model showing that a combination of known and plausible biological mechanisms (clonal interference plus diminishing returns epistasis) reproduces the observed power law increase in mean fitness over time. Rich Lenski summarizes Wiser et al. here.
Frank doesn’t think that Wiser et al. are wrong, exactly. But he does think they’re missing the forest for the trees, and that their data don’t provide a severe test of their theoretical model:
To evaluate the match between an observed pattern and a hypothesized process, mathematical models have become the standard in biology. Typically, one puts together a set of plausible assumptions about process and then studies the resulting model for how well it generates the target outcome. A successful match implies a plausible generative model of process…But does a successful generative model, by itself, really provide much information about underlying process? Probably not. The more commonly a pattern is observed, the more important it is to understand the underlying process. At the same time, it is almost always true that the more common a pattern, the greater the number of underlying generative models that match the pattern. The simple law of nature is that the commonness of a pattern associates with the number of distinctive underlying processes that lead to that pattern (Jaynes, 2003). Put another way, it is overwhelmingly easy to make a generative model that matches a simple, common pattern, but the match provides little information about the true underlying process (Frank, 2009).
Frank goes on to argue that that a pattern in one’s data only reveals information about the “symmetries” in the underlying processes. Any generative model capable of reproducing the observed pattern falls in the same symmetry class. Here, any biological model that generates a power law increase in fitness over time can reproduce the data of Wiser et al. 2013, so their data provide no reason to prefer their model over any of the many alternatives.
Further, if our goal is to explain the observed pattern, then the explanation must necessarily lie in the symmetries that are common to all generative models capable of reproducing the pattern. Differences among those generative models are just irrelevant details, because we can change them without changing the the predicted pattern. Indeed, paying attention to such details leads to confusion and mistakes (e.g., to pointless arguments about which generative model “really” explains the pattern of interest, when the correct answer is “any generative model with the right symmetries”). Differences in detail among generative models matter only for explaining deviations from the main pattern. For instance, different generative models predicting a power law increase in mean E. coli fitness over time might differ with respect to their predictions about the variance in fitness among replicate lines.
Frank makes an analogy to the central limit theorem and Gaussian (normal) distributions. The central limit theorem applies to many, many different generative models, which is why Gaussian distributions are so common in nature. Gaussian distributions are common because they’re hard to avoid. So when we see a Gaussian distribution in our data, we aren’t ordinarily inspired to figure out the details of the underlying generative model, since those details aren’t ordinarily very important or interesting. Just summarize the distribution with a mean and a variance and be done with it. Frank’s argument is that we should take this same attitude in many other situations, since many other patterns–like the power law relationship between mean fitness and time in Wiser et al.–also can arise in many different ways, the details of which are of at best secondary interest.
Frank illustrates this general argument by using a branch of mathematics known as extreme value theory to derive a model that captures the symmetries in the Wiser et al. data and so defines the essential features shared by all generative models consistent with the data. In contrast to Wiser et al.’s model, which assumes clonal interference and diminishing returns epistasis, Frank’s model makes very minimal biological assumptions, instead relying on the statistical properties of any distribution of rare, “extreme” events (here, the appearance of beneficial mutations).
I don’t have a fully worked-out response to Frank’s very interesting paper, but here are my thoughts. Hopefully others who’ve read the paper (or who are inspired to do so by my post) can chime in with their own thoughts.
- There are obvious similarities between Frank’s views and those of MaxEnt advocates like John Harte. Also similarities to the views of macroecologists like John Lawton. But I’m not sure how deep the similarities go. In particular, unless I’ve misunderstood something, Frank’s argument in this paper isn’t quite a MaxEnt-type argument about how we should expect to observe the “macrostates” that correspond to the greatest number of equally-probable “microstates”. Because in MaxEnt, we imagine that the world is constantly changing from one microstate to another, with the macrostate remaining unchanged. In contrast, it seems odd (at least to me) to imagine the world constantly switching from one generative model to another. Or perhaps I’m slightly misreading Frank here and he intends an analogy to other arguments for a MaxEnt-type philosophy (there are other, subtly but importantly different arguments)?
- It’s interesting to read Frank’s paper alongside Bill Wimsatt’s wonderful essay on “false models as means to truer theories”. In particular, compare #9 on Wimsatt’s list of “productive uses of false models” to Frank’s paper.
- I agree with Frank’s general point that it’s really important for you to know what class of generative models is consistent with your data (see this old post). I also agree with his general point that, when all you know is that many different generative models are all consistent with some bit of data, that there’s no overriding reason to prefer one of those models over the others (e.g., on grounds of “simplicity” or because you’ve designated one of them as your “null” model). Ecologists haven’t always taken these points on board, unfortunately. For instance, see this old paper of Brian’s for discussion in the context of research on species abundance distributions.
- But I disagree that that’s all that can be said, or at least all that it’s important to say. Frank takes as given that the focus is on explaining a single pattern in the data. (UPDATE: I now think the previous two sentences reflect a slight misreading of Frank on my part. His focus for purposes of this paper is on explaining a single pattern in the data, but I shouldn’t have suggested that that’s all he thinks there is to linking models and data, processes and patterns. This one paper is by no means a full statement of Frank’s view on how to link models and data.) But that’s rarely the case in science, or at least it should rarely be the case. For instance, the power law increase in mean fitness over time is far from the only striking result from Rich Lenski’s long term evolution experiment. There are repeatable patterns in the evolution of mutation rates. There’s long-term coexistence of different competing clones via negative frequency-dependent selection. There’s evolution of evolvability. One of the lines evolved a novel function (ability to grow on citrate). Etc. And those various results are interconnected. For instance, to explain the evolution of fitness, you need to know something about mutation rates–but to explain the observed mutation rates (which themselves evolve), you need to know something about the evolution of fitness. As another example, epistasis can explain both the power law increase in mean fitness over time, and evolution of evolvability. So it’s true that many different generative models often will be consistent with any given pattern in the data–but those various generative models typically will make different predictions about other features of the data. So if you want to infer the model that generated your data from among a set of alternatives, you ideally should consider all the predictions (and assumptions) of those models, not just their predictions about one particular feature of your data. I don’t know that Frank would deny this, but it’s not something he talks about. And I think he should have. If, like Frank, you’re keen to keep people from over-interpreting a match between their favorite generative model and one particular pattern in the data, then I think the way to do that is to get people to broaden their focus to explaining all the features of their data. Frank actually does the opposite, at least implicitly–he encourages a focus on one pattern at a time, so as to identify the shared features of all generative models capable of producing that particular pattern. See this old paper of Brian’s for an ecological example illustrating my point here: research on species abundance distributions had been held back by researchers’ single-minded focus on explaining just the shape of the species-abundance distribution.
- But for the sake of argument, let’s take for granted that we do indeed only care about explaining a single, simple pattern in our data. I’m not sure, but I think I’d still deny that all we care about is identifying the class of generative models consistent with the data. I still want to identify the one generative model from that class that actually did generate the data. Even if the reason why the true generative model generated that pattern is “the true model has symmetries X and Y, which are shared by many other models”. And even if the best way to discover that reason is to discover that many other generative models also have symmetries X and Y. As a scientist, I want to know how the world really is. Knowing that other, hypothetical worlds would behave just like the real world is very useful for me to know, but only as a means to the end of helping me understand the real world. Again, I’m not sure if Frank would push back against this, but it kind of sounds like he might.
- Another reason why we want to know the true generative model, even if we’re only interested in a single, simple pattern in our data, is to be able to predict and explain changes in the pattern. For instance, Wiser et al.’s generative model doesn’t just reproduce the observed power law increase in mean fitness over time. It also provides a mechanistic explanation for why the observed power law has the parameter values it does. So it can explain why E. coli lines that evolved high mutation rates also exhibited more rapid (but still power law) increases in mean fitness over time. In general, one reason why we want mechanistic models rather than just statistical-phenomenological ones is to be able to predict and explain changes in the parameter values of statistical-phenomenological models.
- In passing, Frank makes an interesting claim that the more complicated and realistically-detailed the generative model, the more strongly it will display some simple pattern characteristic of all generative models in its symmetry class. The various realistic complications act like a bunch of random “perturbations” that all end up averaging away or cancelling one another out. Again, echoes here of arguments for MaxEnt, but I’m not sure the argument is exactly the same.
- I’m still not entirely sure what’s meant by a “generative model”. For instance, I’m not sure if Frank’s own extreme value theory model is supposed to be a generative model in its own right, or whether it’s just way to reveal the symmetries characterizing any generative model consistent with the Wiser et al. data. On the one hand, Frank emphasizes the “privileged position” of extreme value theory, much like the central limit theorem. But on the other hand, his own model starts from non-trivial biological assumptions (e.g., adaptation via sequential fixation of beneficial mutations, constant mutation rate), and he notes that other applications of extreme value theory in evolutionary genetics have made different biological assumptions leading to different predictions.
- A thought/question: in this and other papers Frank emphasizes that simple, strong patterns arise–indeed, only arise–when many different generating models can produce the pattern. The Gaussian distribution is a canonical example, and Frank argues that power law relationships between variables should be regarded as another strong “statistical attractor”. So here’s my question: can many different generating models produce humped relationships between two variables? Put another way, is there an analogue of the central limit theorem or extreme value theory for humped relationships between two variables? I don’t know that there is, not even for simple parametric forms like a concave-down quadratic. I ask because ecologists often have claimed, on the basis of specific generative models, that humped relationships between variables are to be expected. Think of the intermediate disturbance hypothesis, or the expectation of humped diversity-productivity relationships. Off the top of my head, all such “humped” hypotheses have terrible empirical track records–the predicted “humped” pattern is more the exception than the rule, and even when it’s observed it’s usually really messy. Much messier than, say, power law body size allometries or other truly strong and general ecological patterns. Perhaps that’s because humped relationships between variables can only be generated by a small number of quite specific generating models, so that humped patterns are fragile and easily destroyed by even slight changes to model assumptions. The same argument could be made for multimodal frequency distributions, I think (e.g., it’s sometimes claimed that the frequency distribution of species’ body sizes is, or should be expected to be, “clumpy”, i.e. multimodal). I don’t know that there’s any equivalent of the central limit theorem for multimodal frequency distributions. If this line of thought is right, it suggests to me that ecologists ought to quit paying so much attention to purported humped patterns (and multimodal patterns) and the generative models proposed to explain them. Instead, we should have a very strong “prior” that purported humped patterns (and multimodal patterns) aren’t going to be very clear-cut or general. We should also have a prior that the generative models proposed to explain humped patterns (and multimodal patterns) usually will have their predictions swamped by all sorts of other factors. Very curious to hear what folks think of this line of thought. Are there any really strong, general humped (or multimodal) patterns in ecology that I’ve forgotten about? And if not, maybe that’s a signal that we shouldn’t expect humped or multimodal patterns to exist, and should quit paying so much attention to theories predicting humped or multimodal patterns?**
*Don’t think of this as “post-publication review” of Frank’s paper. I think his paper is excellent, I’m really glad it was published, and I don’t think it needs any changes. That I don’t entirely agree with it doesn’t mean I think it’s flawed or that it needs to be corrected.
**Please don’t say that bivariate relationships commonly have “humped upper bounds” and that that’s really interesting and important. That’s just a trivial consequence of plotting one unimodal variable against another. Even if two unimodally distributed variables are independent of one another, a plot of one against the other will have a humped upper bound.