The last experiment I did as a graduate student was one where I wanted to experimentally test the effect of predation on parasitism. To do this, I set up large (5,000 L) whole water column enclosures (more commonly called “bags”) in a local lake. These are really labor intensive, meaning I could only have about 10 experimental units. I decided to use a replicated regression design, with two replicates of each of five predation levels. These were going to be arranged in two spatial blocks (linear “rafts” of bags), each with one replicate of each predation level treatment.

Left: two experimental rafts; right: a close up of one of the rafts, showing the five different bag enclosures
As I got ready to set up the experiment, my advisor asked me how I was going to decide how to arrange the bags. I confidently replied that I was going to randomize them within each block. I mean, that’s obviously how you should assign treatments for an experiment, right? My advisor then asked what I would do if I ended up with the two lowest predation treatments at one end and the two highest predation treatments at the other end of the raft. I paused, and then said something like, “Um, I guess I’d re-randomize?”
This taught me an important experimental design lesson: interspersing treatments is more important than randomizing them. This is especially true when there are relatively small numbers of experimental units*, which is often the case for field experiments. In this case, randomly assigning things is likely to lead to clustering of treatments in a way that could be problematic.
Hurlbert’s classic paper gets a lot of attention for its focus on pseudoreplication, and of course the parts about demonic intrusion are pretty memorable, too. But he also strongly emphasizes the importance of interspersing treatments, noting:
As a safeguard against both [nondemonic intrusion] and preexisting gradients, interspersion of treatments is argued to be an obligatory feature of good design.

The most likely source of demonic intrusion in my experiment, which was done in the Kellogg Bird Sanctuary: Canada geese. (Image source: Kellogg Bird Sanctuary; http://birdsanctuary.kbs.msu.edu/wp-content/uploads/sites/2/2015/02/Happy-Mothers-Day.jpg)
I was thinking about all this recently based on some twitter discussions about randomization. (See here for the first tweet (by Linda Campbell) in the thread that inspired this post.) The main point of those discussions is that it’s common for people to say they randomized the arrangement of replicates within an experiment, but, unless you use a random number generator or some other process for ensuring true randomization, you haven’t really randomized your replicates. That is definitely true and is a topic for a whole other post (can I draft Brian to write it?), but it reminded me of this related point.
Part of why this is an issue is that randomization is likely to lead to large runs of the same treatment—I’ve heard of people who teach statistics classes assigning one group of students to flip a coin 100 times and record the results on the board, whereas another group just writes up what they imagine would happen if they flipped a coin 100 times in a row. The instructor is out in the hallway while they do this, but then is easily able to tell which group actually flipped the coin, just by looking for which list has long runs of heads or tails in it. If you have relatively small numbers of experimental units, you could easily end up confounding space and treatment in your experiment.
So, if you are designing an experiment, and especially if you have a relatively small number of experimental units: focus on interspersing your treatments, not randomizing them.
Postscript from Brian:
(Note from Meghan: Brian, Jeremy, and I discussed this post idea a bit via email. After I drafted this post, Brian emailed the results of a simulation he did that really highlights how important it is to intersperse rather than randomize when you have few replicates. I think it’s a really valuable addition, so asked him to include it as a postscript)
Here’s an R code snippet that really hammers the point of Meghan’s post above home, especially the point of sample size:
nreps=1000
par(mfrow=c(3,3))
for (n in c(3,5,10,20,30,50,100,200,1000)){
r=rep(NA,nreps)
for (reps in 1:nreps) {
v1=rnorm(n)
v2=rnorm(n)
r[reps]=cor(v1,v2)
}
e=ecdf(abs(r))
plot(e,xlim=c(0,1),xlab=”|r|”,ylab=”% with lower correlation”,
main=paste(‘N=’,n,’ Odds |r|>0.5=’,format(1-e(0.5),digits=4)))
}
If you run this you get 9 panels each showing the cumulative probability (on y-axis) of getting a Pearson correlation r value whose magnitude (|r|) is less than the x-value. So for N=10, only about 40% of the time do you get an |r|<0.2 (equivalently 60% of the time you get |r|>0.2). And in particular it shows in the title the odds of getting |r|>0.5:
So if you set up an experiment testing how pH drives chlorA by manipulating pH on your experimental units, and have N=5 replicates and randomly assign treatments then there is about a 38% chance that your pH treatment and any other random variable like temperature will be confounded (demonic intrusion) at a strength of |r|>=0.5 between pH and temperature (or ~66.5% for N=3 and even a 11% chance for N=10). That makes it almost impossible to fully attribute your results to pH rather than temperature (or other uncontrolled variables). You really have to get up into the N=50-100 range before the odds of random confoundment disappear. But even with N=50 you have about a 50% chance of |r|>0.1 or 0.15. How confident can you be in ruling out a confounding factor, especially if your own preferred factor has only weak effects?
If in the above example temperature is distributed randomly across your experiments (i.e. across the pond) all you can do is increase sample size. But if it is along a gradient (as it often is) then interspersion will break the link between the pH treatment and temperature much more strongly than randomization, making interspersion more powerful than than randomization up to about N=50.
Footnote from Meghan:
*This can also happen in large experiments, though—a friend I discussed this with at the time I was setting up my bag experiment told me about a large project she’d be involved in where individuals were randomly assigned to two treatments. No one realized until after they’d collected all the data that there was a pretty significant sex skew between the two treatments, which made it very hard for them to interpret their data.
Nice! Will use it in my stats classes 🙂
Question: what about restricted randomization schemes? For instance, I may have an algorithm in which I randomize treatments, but define that they must be equally spread along a gradient? Or stratified randomization, defining zones and randomizing within them?
Personally I think your two options (a priori rejection criteria on randomizations that work poorly) and stratified randomization (I prefer the 2nd) are vastly underappreciated in ecological experimental design. Why settle for just using site as a blocking factor. With minimal extra work we can pull out temperature, soil moisture, standing biomass or similar variables that characterize those sites and use them both up front in experimental design as you suggest, and use them in analysis as control variables (as people who study humans from doctors to sociologists do).
Yes! Something I was thinking about, for landscape ecology studies, is an algorithm to randomly select sampling sites by simultaneously ensuring independence (e.g. a minimum distance among them) and maximizing variation in the explanatory variables of interest. I think this could be applied to ecological experiments in general, if information on explanatory and confounding variables is available. 🙂
Any suggestions for books or articles on good approaches to interspersion?
I actually think Hurlbert’s classic “Pseudoreplication and the design of ecological field experiments” paper in Ecological Monographs is a good introduction to many aspects of experimental design including interspersion vs. randomization.
The simulations at the end remind of me of this recent paper in Ecology:
https://esajournals.onlinelibrary.wiley.com/doi/10.1002/ecy.1506
(shameless self promotion)
Slightly different point but definitely related. The hazards of small sample sizes are definitely not just “low power=high risk of type II error/non-significant result”. As both your paper and this post show, small sample sizes run a chance of increasing the risk of Type I error as well. Good work can be done with small replication, but it takes careful thinking.
In simulation studies, I often use Latin hypercube sampling to make sure I have a sample of parameters that are representative of the possible variability but still “near random”. The idea has probably been around in field ecology much longer — when I told an entomologist how I sampled parameters, he said, oh its just like a “Latin square” where we make sure there is one of each treatment type in each row and column, we do our greenhouse experiments like that too.” This post reminded me of that conversation.
I think “old school” ecologists are very familiar with notions like Latin square designs (they come out of agriculture), but I don’t think they’re being taught very much these days. Which is unfortunate.
As someone who spent many years working in forest entomology where manipulative plot sizes are huge and what is humanly possible means that small sample sizes are inevitable, I totally agree. Regarding randomisation, I had a little rant here a few months ago 🙂 https://simonleather.wordpress.com/2018/12/19/at-random/
Ah, this is the post that the twitter thread was trying to get back to but we couldn’t find! Thanks!
I agree that interspersion of treatments is important but the design above illustrated also has other (very common) problem. It’s a block design with no replication within blocks. This means that it is impossible to tease apart spatial and treatment effects. If at least two replicates of each treatment were available per block, one could examine block (or site) effects. To tell the truth, I see no benefit what-so-ever to arrange treatments in two blocks in the example provided.
Super cool post – thanks! I totally agree that interspersing treatments is a great idea, and probably not taught enough in intro stats classes (I might be misinterpreting, bit I think that this is what I was taught as a “regression experiment”). One potential addition: I’ve learned both from statisticians and from analyses gone wrong that it can be helpful to still have a few levels where you include replicated treatments (e.g. you might have treatments 1,1,1,2,3,4,5,5,5,6,7,8,9,9,9). This can be helpful for estimating the “residual variance” that you get with a fixed treatment (e.g. effects of spatial heterogeneity, unmeasured environmental variables, acts of God, etc). There are ways to try to milk this information from totally interspersed data, but I find that especially with low sample sizes, they tend not to work so well.
Pingback: Mais um texto sobre réplicas e pseudoréplicas – Mais Um Blog de Ecologia e Estatística