Is requiring replication statistical machismo?

A recent post of mine about why Biosphere 2 was a success stirred mixed reactions. But one of the most common negative reactions was that there was no replication in Biosphere 2, which of course EVERYBODY knows is a hallmark of good science. This actually spilled into a spirited discussion in the comments. So, do we need replication to do good science?

Anybody who has read some of my older posts (e.g. one true route poststatistical machismo post) will know that my answer is going to be no. I’m not going to tell a heliologist that they are doing bad science because they only have one sun (they do have the stars, but most of the phenomena they study like sun spots are not yet studyable on other stars). Nor am I going to say that to people who have developed theories about why our inner solar system contains rock planets and the outer solar system contains giant gaseous planets (although in the last 2-3 years we are actually getting to the point where we have data on other solar systems, these theories were all developed and accepted well before then). And Feynman’s televised proof that a bad interaction between cold weather and a rubber O-ring led to the demise of the Space Shuttle Challenger definitely did not need and would not tolerate replication. Closer to home, I am not going to tell people who have been measuring CO2 on top of Mauna Kea  (aka the Keeling Curve one of the most well known graphs in popular science today) that their science is bad because they only have one replicate. Nor am I going to tell people who study global carbon cycling to give up and go home because it is a well mixed gas on only one planet (I mean come on  N=1, why waste our time!?). In short , no, good science does not REQUIRE replication.

Let me just state up front that replication IS good. The more replication the better. It always makes our inferences stronger. We DO need replication when it is feasible. The only problem is that replication is not always possible (sometimes even with infinite amounts of money and sometimes only due to real world time and money constraints). So the question of this post is NOT “do we need replication?” It IS”do we HAVE to have replication” and “what do you do in these trade-off or limitation situations?” Give up and go home – don’t study those questions – seems to be some people’s answers. Its not mine. Indeed any philosophy of science position which leads to the idea that we should stop studying questions that inconveniently fail to fit a one-stop-shopping approach to science is not something I will endorse. This is the statistical machismo I have talked about before – when one has to make the statistics so beautiful AND difficult that few can achieve the standard you have set and you can then reject others work as WRONG, WRONG, WRONG. Careful thinking (and perusing the examples in the last paragraph) lead to a number of ways to do good, rigorous science without replication.

First let’s step back and define what replication is and why it is important. Wikipedia has several entries on replication, which in itself is probably informative about the source of some of the confusion. When ecologists think about replication they are usually thinking about it in the context of statistics (wikipedia entry on statistical replication) and pretty quickly think of Hurlbert’s pseudoreplication (also see Meg’s post on the paper) . This is an important context, and it is pretty much the one that is being violated in the examples above. But this definition is only saying you need replication to have good statistics (which is not the same as good science). But Wikipedia has an alternative entry on “replication – scientific method” which redirects to “reproduceability”. This definition is the sine qua non of good science, the difference between science and pseudoscience. Reproduceability means if you report a result, somebody else can replicate your work and get the same thing. If somebody is doing science without reproduceability, call them out for bad science. But don’t confuse it with replication for statistics. Ecologists do confuse these two all the time. Thus to an ecologist replication means multiple experimental units well separated in space (not well separated=pseudoreplication, not multiple=no replication=degrees of freedom too small). As I said, those are both good goals (which I teach in my stats class and push students to achieve). But they are not the sine qua non of good science.

It is instructive to think about an example that came up in the comments on the Biosphere 2 post: the LHC (large hadron collider) and the hunt for the Higg’s Boson. Pretty blatantly they did not have ecological replication. Each LHC facility costs billions of dollars and they only had one (ditto for Biosphere 2). But the physicists actually had an extremely well worked out notion of rigorous reproduceability. Despite only having one experimental unit, they did have multiple measurements (observed particle collisions). Thus this is a repeated measures scenario, but notice that since there was only one “subject” there was no way to correct for the repeated measure. The physicists made the assumption that despite being done on one experimental unit, the measures were independent. But what I find fascinating is that the physicists had two teams working on the project that were “blnded” to each others work (even forbidden to talk about work with each other) to tackle the “researcher degrees of freedom” problem that Jeremy has talked about. They also had very rigorous a priori standards of 5σ (p<0.0000003) to announce a new particle (I seem to recall that at 3σ they could talk about results being “consistent with” but not “proof of” but I haven’t found a good reference to this). So, in summary, the Higg’s test had an interesting mix of statistical replication (5σ), reproduceability (two separate teams) and pseudoreplication (uncorrected repeated measures) from an ecologist’s perspective.

So what do we get out of statistical replication? The biggest thing is it allows us to estimate σ2 (the amount of variance). We might want to do this because variance is innately interesting. For instance, rather than ask does density dependence exist, I would rather ask what percent of the year-to-year variance is explained by density dependence (as I did in chapter 8 of this book and as I argued one should do in this post on measures of prediction). Or we might want to quantify σ2 because it lets us calculate a p-value, but this is pretty slippery and even circular – our p-value gets better and better as we have more replication (even though our effect size and variance explained don’t change at all). This higher p-value due to more replication is often treated as equal good science, but that is poppycock. Although there are valid reasons to want a p-value (see Higg’s Boson), pursuit of p-value quickly becomes a bad reason for replication. Thus for me, arguing to have replication to estimate σ2 is a decidedly mixed bag – sometimes a good thing, sometimes a bad thing depending on the goal.

However, and to me this is the biggest message in Hurlbert’s paper but often forgotten against the power of the word “pseudoreplicationn”, is the #1 problem driving everything else in the paper is the issue of confoundment. If you only have one site (or two or three), you really have to worry about whether you get the effect you observed because of peculiarities of that that site and any weird covariances between your variable of interest and hidden variables (Hurlbert’s demonic intrusions). Did you get more yield because of pest removal as you think or because its downhill and the soil is wetter? One way to kill the demon of confoundment is to have 100 totally independent, randomly chosen sites. But this is expensive. And its just not true that it is the ONLY way to kill the demon. I don’t think anybody would accuse the LHC of confoundment despite only having one site. You could spin a story about how the 23rd magnet is wonky and that imparts a mild side velocity (or spin or I don’t know my particle physics well enough to be credible here …) that fools everybody into thinking they saw a Higg’s boson. But I don’t hear anybody making that argument. The collisions are treated as independent and unconfounded. The key here is there is no way to measure that or statistically prove that. It is just an argument made between scientists that depends on good judgement, and so far the whole world seems to have accepted the argument. It turns out that is a perfectly good alternative to 100’s of spatial replicates.

Let me unpack all of these examples and be more explicit about alternatives to replication as ecologists think about it – far separated experimental units (again these alternatives are only to be used when necessary because replication is too expensive or impossible but that occurs more often in ecology than we admit):

  1. Replication in time – repeated measures on one or a few subjects do give lots of measures and estimates of σ2 – its just that the estimate can be erroneously low (dividing by too many degrees of freedom) if the repeated measures are not independent. But what if they are independent? Then its a perfectly valid estimate. And there is no way to prove independence (when you have only one experimental unit to begin with). This is a matter for mature scientists to discuss and use judgement on as with the LHC – not a domain for unthinking slogans about “its pseudoreplicated”. Additionally there are well-known experimental designs designs that deal with this, specifically the BACI or before/after/compare (just Google BACI experimental design). Basically one makes repeated measures before a treatment to quantify innate variability, then repeated measures after the treatment to further quantify innate variability and then compares the before and after difference in means vs. the innate variability. The Experimental Lakes Area eutrophication experiments are great examples of important BACI designs in ecology and nobody has ever argued those were inconclusive.
  2. Attention to covariates – if you can only work at two sites (one treatment and one control) you can still do a lot of work to rule out confoundment. Specifically you can measure the covariates that you think could be confounding. Moisture, temperature, soils, etc and show that they’re the same or go in the opposite direction of the effect observed (and before that you can pick two sites that are as identical as possible on these axes).
  3. Precise measurements of the dependent variable – what if σ2=0? Then you don’t really need a bunch of measurements. This is far from most ecology, but it comes up sometimes in ecophysiology. For a specific individual animal under very specific conditions (resting, postprandial), metabolic rate can be measured fairly precisely and repeatably. And we know this already from dozens of replicated trials on other species. So do we need a lot of measurements the next time? A closely related one is when σ2>0, but the amount of error are very well measured and we can do error analysis that ripples all the error bars through the calculations. Engineers use this approach a lot.
  4. We don’t care about σ2 – what if we’re trying to estimating the global NPP. We may have grossly inaccurate measurement methods and our error bars are huge. But since we have only one planet, we can’t do replication and estimate σ2, but does that mean we should not try and estimate the mean? This is a really important number, should we give up? (note – sometimes the error analyses mentioned in #3 can be used to put confidence intervals on, but they have a lot of limitations in ecology). And note I’m not saying having no confidence intervals is good, I’m saying dropping entire important questions because we can’t easily get confidence intervals is bad.
  5. Replication on a critical component – The space shuttle example is a good example of this. One would not want to replicate on space shuttle’s (even if human lives were taken out of the equation cost alone is prohibitive). But individual components could be studied through some combination of replication and precise measurement (#3 above). The temperature properties of the O-ring were well known and engineers tried desperately to cancel the trip. They didn’t need replicate measures at low temperatures on the whole shuttle. Sometimes components of a system can be worked on in isolation with replication but still generalize to the whole system where replication is not possible.
  6. Replication over the community of scientists – what if you have a really important question that is at really big scales so that you can only afford one control and one experimental unit, but if it pans out you think it could launch a whole line of research leading to confirmation by others in the future? Should you just skip it until you convince a granting agency to cough up 10x as much money with no pilot data? We all know that is not how the world works. This is essentially the question Jeff Ollerton asked in the comments section of the Biosphere 2 post.

So, in conclusion: Ecologists have an overly narrow definition of what replication is and what its role in good science is. High numbers of experimental units spatially separated is great when you can do it. But when you can’t, there are lots of other things you can do to deal with the underlying reasons for replication (estimating σ2 and confoundment). And they are not places for glib one-word (“pseudoreplication” sneeringly said) dismissals. They are places for complex, nuanced discussions about the costs of replication and how convincingly the package of alternatives (#1-#6) are deployed, and sometimes even how important the question is.

What do you think? Have you done work that you were told is unreplicated? How did you respond? Where do you think theory fits into the need for replication – do we need less replication when you have better theory? Just don’t tell me you have to have replication because its the only way to do science!

 

This entry was posted in Issues by Brian McGill. Bookmark the permalink.

About Brian McGill

I am a macroecologist at the University of Maine. I study how human-caused global change (especially global warming and land cover change) affect communities, biodiversity and our global ecology.

18 thoughts on “Is requiring replication statistical machismo?

  1. Thanks for the great post, Brian! I think this is central to how Ecology, as a field, informs yesterday’s, today’s, and tomorrow’s environmental issues. This post is reminiscent of a decade-long push by Steve Carpenter, from the late 80’s to late 90’s, to highlight the importance of large-scale, often poorly replicated experiments (e.g. Ecology, 70: 453-463; Ecology 71: 2038-2043; and Ecosystems 1: 335-344). Your post presents this in a bit more balanced way than some of Steve’s articles, but overall the message is the same. There are many times and places where the opportunity for replication is limited, but that does not mean the science is BAD. To the contrary, often planned and unplanned large-scale experiments are some of the most USEFUL for applied issues, innovation, and synthesis. Recent examples of this include environmental catastrophes like the Gulf of Mexico oil spill and the rise of new areas of ecology, such as social-ecological systems and macrosystems. Of course, these settings often require less-traditional statistical approaches, as you have described above. Some of my colleagues and I have wondered whether another discussion, like the one Carpenter participated in in the 90’s, was necessary to remind ecologists of the value of SOME poorly replicated research. Your post starts that important conversation again!

    • You put your finger on a really important point. Many conversation questions are inherently at scales (whole lakes 10 km^2 patches, reserves, etc) that are not amenable to replication. Saying we should toss these questions out due to difficulties with replication is tantamount to saying we should be irrelevant to conservation.

  2. I think the goal of inference is important here, with a distinction between finite samples and random samples. Replication is much more important for the latter when the “population” of interest (e.g., individuals, sites, ecosystems, scenarios) is larger than your sample. Like you said, with a small sample you are limited in the number of factors that can be studied, particularly with observational data. So you have to be careful and realistic about what you can actually achieve with such a study given so much potential uncertainty.

    But for inferences made on large and important phenomena, whether it’s planet Earth or a natural/human disaster, the information content outweighs the uncertainty. And you are probably not attempting to make inferences to a larger population – if you are (e.g., future disasters), the burden of proof increases. I think most ecologists are trying to make inferences beyond their sample (whether observational or experimental units) and replication helps them to achieve this.

    I agree that good statistics is not the only way to do good science. But you had better be working on something profoundly interesting, or good luck handling the scrutiny!

  3. Nice, balanced discussion Brian, and of course you know from my previous comments what my view is! The study I referred to is now out to review in a good journal and, should any of the reviewers query our lack of replication, I’ll refer them to your post 🙂

  4. Good post. Thanks for clarifying some slippery issues so neatly.
    I think that discussions about what is and isn’t pseudo-replicated (or appropriately sampled) often come down to the aims of the study and what inferences are required at what scale and scope.

    The problem is often the generalization being inferred not necessarily the study itself. I work mainly in tropical forests and it remains common to see the result from one study on one continent cited as if it the conclusions are necessarily true in all forests and on all continents … it may be correct … but we need to acknowledge the assumptions.

    Replication of results in space and time is necessary to show (rather than assume) that the results are robust in space in time (ideally we know how the cases were selected from the larger universe of alternatives). Sometimes we can safely assume these results are robust but I see no simple definition of good and bad there … except that the “match” between data, context and justifiable inference is key.

    • Good points. When I teach stats, I spend a whole day on “generalizability” – how far can you generalize results on a specific species in specific space at a specific time. Students are always very interested – they know the key to publication is to claim generality, but also know the dangers in this.

  5. I’d be interested to know how the commenters here feel about experimental versus observational studies, and whether, on some deep statistical or philosophical level, experiments require more replication than observations? That was the essence of the exchange I had with an editor of journal (see comments on Brian’s Biosphere 2 post for more details, if you’re interested). I’d never really thought about it before, but it seems to me that there’s no a priori reason why one should require more replication than the other, or am I missing something?

    • Interesting – I would actually say observational studies if anything have a stronger need for replication because the chance of confoundment is partly broken in experimental studies just by randomly assigning treatments and doing the manipulation yourself. As, I’ve argued, confoundment can still be dealt with in low observation observational studies, but the issue of confoundment looms larger there.

    • I agree that there is no a priori reason as replications depends on the question pursued (maybe someone doesn’t care to generalize) and the magnitudes and sources of error (stochastic, confounding, heterogeneity, etc). If any of these sources of error are big relative to the effect or if there is interaction with the effect, then one needs to replicate, at least if one wants to rule out stochastic error or to generalize. These sources of error can be big or small in either observational or experimental designs.

  6. Great post, Brian! I think about this issue in paleoecology frequently, where replicates can be really difficult– sometimes because of sample effort or cost, or because of serendipity (e.g., a Clovis kills site), or because of events that are discontinuous in space (e.g., a wildfire).

    In paleo, we often identify a phenomenon (say, a dramatic decline in a taxon at a site), and then immediately try to replicate it regionally. What we often find is that events that are coherent at a small spatial scales break down at the regional scale (even though they might be apparent at, say, the continental scale). At that intermediate scale, there may be site-specific or even stochastic reasons for between-site variation, but the lack of coherence is sometimes seen as a failure to replicate the original event. But in paleoecology, our samples are very often correlated in time or space, so it’s not like we can get true replication anyway!

  7. Pingback: What we’re reading: Stick insects, Gulf of Mexico oysters, and how many peer reviewers it takes to change a lightbulb joke? | The Molecular Ecologist

  8. Interesting post! Sorry for dropping in late. I think it contains many valuable thoughts and it’s important to judge individual cases instead of invariably calling for replication. However, I think you are mixing different things in your examples that might be misused as a justification for poor science.

    First, the cases where one doesn’t intend to generalize are separate from other cases. It’s totally valid to study just one sun or one earth when you’re mostly interested in this one entity. Similarly, in applied ecology one might not need replicates (e.g. of cities or species) when conservation policies will just be applied to this one city or species.

    Second, there are cases where just one entity is studied, but conclusions are generalized / transferred to other entities. This is often problematic, but complex case studies may give valuable insights as long as this transfer is done cautiously. This might be the case for Biosphere 2. Actually, I don’t see a fundamental difference to many studies that do replication, but generalize beyond the statistical population samples were taken from. Cautiousness in generalization is always called for, and on which scale replicates are considered independent always depends. For example, a study may be replicated in one region, but conclusions are suggested to be general. As ecologists, we often have to judge by ourselves how the different case studies (in some way, all ecology papers are case studies) fit together on a global / general scale.

    Third, there is the case where one compares two entities and makes conclusions about the differences between them, attributing the differences to a single factor (or set of factors). This is also your example of a single control and a single treatment (the same goes for multiple unreplicated treatment levels). In my view, this is the most problematic case because it claims to show effects of a single factor while there can be endless reasons for a difference. It should only be used in exceptional circumstances, and extreme scrutiny has to be used to make it publishable as a paper. With extreme scrutiny, I mean something like a before/after (BACI) approach with estimating the underlying variability and (as you write) checking for confounding factors. And even then, it will rarely be good evidence without other supporting studies.

    In conclusion, I think it’s a good thing to say that we should address large-scale or complex questions even if we need to sacrifice replication to start it and that a judgement of what is good science should be based on considering number and independence of replicates together with confounding factors, variability and intended generality of conclusions. However, I think it is dangerous to use too many (unfair) examples from physics or astronomy to justify lack of replication in ecological studies. The standard that replicates are generally needed should not be eroded and we should be extremely careful when we allow for generalizations from unreplicated studies. Or else we go back to the old naturalists who told enjoyable and thoughtful stories that are hard to put together in a quantitative picture.

    (NB: when I write ‘replication’, I mean statistical replication and not reproducibility, which is another difficult issue for ecology)

  9. Pingback: Friday links: lessons from a recent NSF panelist, have a Secchi day, and more | Dynamic Ecology

  10. Pingback: Statistical Machismo | brouwern

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.