As a complement to the previous post on “why do mathematical modeling”, I thought it would be fun to compile a list of all the reasons why one might conduct an experiment. But I am lazy* (though not as lazy as this man), and so rather than compiling my own list I’ll share the list from Wootton and Pfister 1998 (in Resetarits and Bernardo’s nice Experimental Ecology book).
To see what happens. At it’s simplest, an experiment is a way of answering questions of the form “What would happen if…?” Such experiments often are conducted simply out of curiosity. This sort of experiment teaches you something about how the system works that you couldn’t have learned through observation, it gives you a starting point for further investigation (e.g., you can develop a model and/or do follow-up experiments to explain what happened), and it can be of direct applied relevance (e.g., if you want to know what effect trampling has on a grassland you’re trying to conserve, go out and trample on randomly-selected bits of it).
There are limitations to such experiments, of course. Because they’re conducted without any hypothesis in mind, they’re typically difficult or impossible to interpret in light of existing hypotheses. And on their own, they don’t provide a good foundation for generalization (e.g., would the experiment come out the same way if you repeated it under different conditions, or in a different system?)
Interestingly, Wootton and Pfister suggest that experiments conducted just to see what happens are most usefully conducted in tractable model systems about which we already know a fair bit (analogous to developmental biologists focusing their experiments on C. elegans and a few other model species). They worry that curiosity-driven experiments, conducted haphazardly across numerous systems, leave us with not only with a very incomplete understanding of any given system, but with no basis for cross-system comparative work. This illustrates how the decision as to what kind of experiment to conduct often is best made in the context of a larger research program, an issue to which I’ll return at the end of the post.
As a means of measurement. These experiments are conducted to measure the quantitative relationship between two variables. Feeding trials to measure the shape of a consumer’s functional response are a common example: you provide individual predators with different densities of prey, and then plot predator feeding rate as a function of prey density. These experiments are a good way of isolating the relationship between two variables. For instance, in nature a predator’s feeding rate will depend on lots of things besides prey density, including some things that are likely confounded with prey density, making it difficult or impossible to use observational data to reliably estimate the true shape of the predator’s functional response. Or, maybe prey density just doesn’t vary that much in nature, so in order to measure how predator feeding rate would vary if prey density were to vary (which of course it might in future), you need to experimentally create variation in prey density. This is an example of a general principle: in order to learn how natural systems work, we’re often forced to create unnatural conditions (i.e. conditions that don’t currently exist, and may never exist or have existed).
Of course, the challenge with these experiments is to make sure that the controls needed to isolate the relationship of interest don’t also distort the relationship of interest. For instance, feeding trials conducted in small arenas are infamous for overestimating predator feeding rates because prey have nowhere to hide, and because prey and predators behave differently in small arenas than they do in nature.
To test theoretical predictions. Probably the most common sort of experiment reported in leading ecology journals. Again, often most usefully performed in tractable model systems**.
But as Wootton and Pfister point out, these kinds of experiments, at least as commonly conducted and interpreted by ecologists, have serious limitations that aren’t widely recognized. For instance, testing the predictions of only a single ecological model, while ignoring the predictions of alternative models, prevents you from inferring much about the truth of your chosen model. If model 1 predicts that experiment A will produce outcome X, and you conduct experiment A and find outcome X, you can’t treat that as evidence for model 1 if alternative models 2, 3, and 4 also predict the same outcome. It’s for this reason that Platt (1964) developed his famous argument for “strong inference“, with its emphasis on lining up alternative hypotheses and conducting “crucial experiments” that distinguish between those hypotheses.
There’s another limitation of experiments conducted to test theoretical predictions, which Wootton and Pfister don’t recognize, but which is well-illustrated by one of their own examples. Wootton and Pfister’s first example of an experiment testing a theoretical prediction is the experiment of Sousa (1979) testing the intermediate disturbance hypothesis (IDH). Which, as readers of this blog know, is a really, really unfortunate example. Experiments to test predictions are only as good as the predictions they purport to test. So if those predictions derive from a logically-flawed model that doesn’t actually predict what you think it predicts (as is the case for several prominent versions of the IDH), then there’s no way to infer anything about the model from the experiment. The experiment is shooting at the wrong target. Or, if the prediction actually “derives” from a vague or incompletely specified model, then the experiment isn’t really shooting at a single target at all–it’s shooting at some vaguely- or incompletely-specified family of targets (alternative models), and so allows only weak or vague inferences about those targets (this is what I think was going on in the case of Sousa 1979).
One way to avoid such ill-aimed experiments is for experimenters to rely more on mathematical models and less on verbal models for hypothesis generation. But another way to avoid such ill-aimed experiments is to quit focusing so much on testing predictions and instead conduct an experiment…
To test theoretical assumptions. It is quite commonly the case in ecology that different alternative models will make many similar predictions. For instance, models with and without selection (non-neutral and neutral models) infamously make the same predictions about many features of ecological and evolutionary systems. This makes it difficult to distinguish models by testing their predictions. So why not test their assumptions instead, thereby revealing which alternative model makes the right prediction for the right reasons, and which alternative is merely getting lucky and making the right prediction for the wrong reasons? For instance, I’ve used time series analysis techniques to estimate the strength of selection in algal communities (Fox et al. 2010), thereby directly testing whether algal communities are neutral or not (they’re not). In this context, this is a much more direct and powerful approach than trying to distinguish neutral and non-neutral models by testing their predictions (e.g., Walker and Cyr 2007 Oikos) (UPDATEx2: The example of Fox et al. 2010 isn’t the greatest example here, because while it is an assumption-testing study, it’s not actually an experiment. Probably should’ve stuck with Wootton and Pfister’s first example of testing evolution by natural selection by conducting experiments to test for heritable variation in fitness-affecting traits, which are the conditions or assumptions required for evolution by natural selection to occur. And as pointed out in the comments, the Walker and Cyr example isn’t great either because they actually were able to reject the neutral model for many of the species-abundance distributions they checked, in contrast to many similar studies).
A virtue of focusing on assumptions as opposed to predictions is that it forces you to pay attention to model assumptions and their logical link to model predictions, rather than treating models as black boxes that just spit out testable predictions. Because heck, if all you want is predictions, without caring about where they come from, you might as well get them here.
Another virtue of tests of assumptions, especially when coupled with tests of predictions, is learning which assumptions are responsible for any predictive failures of the model(s) being tested. This is really useful to know, because it sets up a powerful iterative process of modifying your model(s) appropriately, and then testing the assumptions and predictions of the modified models.
Another reason to test assumptions rather than predictions is that it might be easier to do. Of course, in some situations it could be easier to test predictions than assumptions. And in any case, you all know what I think of doing science by simply pursuing the path of least resistance (not much).
Of course, testing assumptions has its own limitations. Since theoretical assumptions are rarely if ever perfectly met, we’re typically interested in whether the a model’s predictions are robust to violations of its assumptions. Does the model “capture the essence”, the important factors that drive system behavior, and over what range of circumstances does it do so? So you run into the issue of how big a violation of model assumptions is worth worrying about. I don’t have any great insight to offer on how to deal with this; sometimes it’s a judgment call. Sometimes one person’s “capturing the essence” is another person’s “wrong”. For what it’s worth, we make similar judgment calls in other contexts (e.g., how big a violation of statistical assumptions of normality and homoscedasticity is worth worrying about?)
Wootton and Pfister conclude their chapter by discussing how to choose what kind of experiment to conduct. For instance, if you’re studying a system about which not much is known (and assuming you have a good reason for doing that!), you may have no choice but to conduct a “see what happens experiment” (“kick the system and see who yells”, as my undergrad adviser David Smith put it). You might want different experiments depending on whether you’re seeking a general, cross-system understanding of some particular phenomenon, vs. intensively studying a particular system. Or different experiments depending on whether you’re setting out to test mathematical theory, or to identify the likely consequences of, say, a dam or some other human intervention in the environment. Problems arise when you don’t think this through. For instance, conducting an experiment just to see what happens, and then retroactively trying to treat it as a test of some theoretical model, hardly ever works (but it’s often tempting, which is why people keep doing it).
So what do you think? Is this a complete list?
*Or, depending on your point of view, “resourceful”.
**Someday, I need to do a post on what makes for a good “model system”.
Could you clarify how much of this is your interpretation of the chapter vs. a recapitulation of it? Also exactly what you (or they) mean by “experiment”–the definition implied here seems to be “any type of intrusion on a system” which is certainly not my idea of what an experiment is.
At the most fundamental level, experiments are for disentangling the effects of variables that would otherwise be confounded to the point of not being distinguishable in their effects on the system of interest. This can sometimes only be done with a mathematical model, depending on various practical concerns. To do something just to “see what happens”–I don’t call that an experiment. To me, experiments come after you’ve already made lots of observations–you don’t need any more natural observations. You need either higher quality natural observations (as you stated in a recent post, specifically choosing observations that are least confounded in the variables of interest), or a manipulative experiment, where you accomplish what nature has not by creating a certain combination of treatment levels.
However, the really most important statement in here, by far IMO, is: “They worry that curiosity-driven experiments, conducted haphazardly across numerous systems, leave us with not only with a very incomplete understanding of any given system, but with no basis for cross-system comparative work.” This is a really important point and ecology has suffered enormously from this problem. This topic alone deserves a whole string of posts. Of course, you will piss a lot of people off if you broach it, because a lot of ecologists just want to study whatever they are personally interested in, not what is societally most important.
The post is mostly recapitulation; I meant to be clear about that, sorry if I wasn’t. Every item on the list is from Wootton and Pfister, as is much of the commentary. The bits that are mine:
-noting that “let’s see what happens” experiments are difficult to interpret in light of existing hypotheses
-the example of feeding trials as a “measurement” experiment
-failure to consider or distinguish between predictions of alternative models as a problem with “prediction testing” experiments, at least as commonly practiced by ecologists
-testing vague or logically-flawed predictions as a problem with “prediction testing” experiments
-the example from my own work of an “assumption testing” experiment
-the ideas that assumption testing forces you to pay attention to model assumptions, and that it allows you to identify which assumptions are the problematic ones
-the difficulty of distinguishing models that “capture the essence” from models that are right for the wrong reasons
Note also that the post does not summarize everything Wootton and Pfister have to say; in particular, I’ve left out most of their examples.
Thanks for clarifying Jeremy.
In terms of the definition of an experiment, Wootton and Pfister say:
“For our purposes, we will consider an experiment to involve manipulating one or more factors in a system, while the effects of the other factors are either minimized or unmanipulated.”
They go on to note that other factors may either be experimentally controlled (e.g., in a lab where you can impose a constant environment), or allowed to vary, with random assignment of treatments to experimental units being used to minimize the possibility of confounding.
As their definition, and their many examples, make clear, Wootton and Pfister themselves agree with you that experiments are for disentangling effects of variables that would otherwise be confounded. I do think that even “let’s just see what happens” experiments do that. For instance, Wootton and Pfister, point out that the classic work of Paine (1966) was such an experiment. Paine’s experiment gives information about the effects of Pisaster seastars on the prey community, independent of other variables, that could not have been obtained from merely observing variation in seastar predation and prey community composition over time or space.
OK, many thanks for elaborating and clarifying Jeremy.
I should add that the point about testing assumptions instead of outcomes is excellent. I would add that if your various possible outcomes are all predicted by the various hypotheses, then you need to re-design your research so that you are measuring things that are truly diagnostic.
Most of my favorite experiments are motivated by observation and hypothesis, not necessarily as mechanisms for testing theory. I suppose that the observation -> hypothesis -> experiment pathway, rooted in natural history and curiosity, falls somewhere between “to see what happens” and “as a means of measurement”. However, I think that experiments as a way to test causal relationships based on observations of the natural world are a distinct category.
If you’re testing a prediction about causality (e.g., “If A increases, then B will decrease, causing C to decrease”), whether derived from natural history observations or any other source, then I’d probably call that a prediction-testing experiment. You have an a priori prediction which dictates many features of your experiment, particularly what to manipulate and what to measure (e.g., “I’ll increase A, and measure the responses of B and C”). That’s a prediction-testing experiment, perhaps with elements of assumption-testing too (e.g., in my hypothetical example, the measurement of B can be viewed as a check of the assumption that the effect of A on C is transmitted via B).
First…love the blog. Completely agree that assumptions are often more important than predictions.
Just a quick defense of the Walker and Cyr paper. You are completely correct that assumption checking is usually more powerful than prediction. So what surprised us so much is just how badly the neutral model *predicted* phytoplankton diversity, despite the extreme weakness of our *predictive* test; in particular, the dominant species in phytoplankton communities are far more dominant than any neutral model could possibly predict. I agree that the problem here is that a predictive failure doesn’t tell you anything about why the prediction failed. But if you predict as badly as we did, than its unlikely that the model making the predictions will teach you much about the system anyways. Maybe I’m being a little too falsificationist by saying that…there are many instances where a predictively bad model can teach you lots about a system. But what I learned by our phytoplankton result is just that the neutral model does so badly at making even the most basic predictions that should be ‘easy’, that — for me anyways — the neutral model is completely off the table for phytoplankton communities. And (although this is not published) this result is robust across all of the expanding variety of neutral models.
Or maybe I don’t need to defend our work anyways. I mean, you were simply using our test as an example of a weak test and I agreed with you: when weak tests are passed, you get very little insight. But when weak tests fail, then you’ve learned that something is *very* wrong with your model. Am I missing a key philosophy of science thing here? I hope not.
Hi Steve,
Thanks for your comments, glad you like the blog.
I owe you some clarification, and a bit of a mea culpa. In offering Walker and Cyr as an example of weak predictive test, I didn’t mean to pick on that paper. To be honest, I just arbitrarily chose it as one among many tests of neutral theory based on the species-abundance distribution. I only picked it because it was fresh in my mind (I use the data to introduce the notion of species-abundance distributions in the aquatic ecology class I teach), and because it was published in Oikos (I try to take the opportunity to highlight Oikos papers relevant to whatever issue I’m discussing). As you point out, your paper’s actually not a great example for my purposes, because you were able to reject the neutral model in many cases. Frankly, I probably should’ve taken 30 seconds to look up an example better suited to my purposes.
As for the philosophical point–we rejected our hypothesis using a weak test, so that means we can be *really* confident that it’s false–I think that’s often true but I don’t know if it’s always true. Deborah Mayo has written about this, and IIRC she discusses cases where this intuitively-appealing inference is problematic. But I’d need to go back and look up her writings on this, I may be misremembering them.
Thanks for the pointer to Mayo. I’ve read a bit of her stuff and always learn something when I do. Having lived by the ‘you can never read enough Gelman’ credo for the last year, its probably time I switch to another statistical philosophy guru. And thanks for the publicity and thoughtful response. Its much appreciated.
Pingback: Why, and how, to do statistics (it’s probably not why and how you think) | Dynamic Ecology
Pingback: Friday links: weirdest foraging experiment ever, beer and stats, abuse of parental leave, and more | Dynamic Ecology
Pingback: ESA Monday review: Tony Ives rocks | Dynamic Ecology
Pingback: The power of “checking all the boxes” in scientific research: the example of character displacement | Dynamic Ecology