The most common way to fish for statistical significance in ecology

Based on a large random sample of data my own not-inconsiderable but admittedly anecdotal experience, here’s the most common way to fish for statistical significance in ecology: analyze several different measures, indices, or indicators of the “same” thing.

The problem arises because many important concepts in ecology are only vaguely defined. “Diversity”, for instance–ecology is awash in diversity indices. But that’s far from the only example. Indeed, at some point in pretty much any ecology study the investigator will have a choice as to how to quantify something. In my own work, for instance, I have to decide how to measure spatial synchrony of population fluctuations. There are various ways one might do it. One could look at synchrony of abundances, or population growth rates, or per-capita growth rates. One could quantify synchrony with the cross-correlation coefficient, or some other measure of association. Etc. Often, different choices will lead to at least slightly and perhaps substantially different answers. And while in some cases there may be some mathematical, theoretical, or empirical reason to prefer one measure over others, those reasons often aren’t decisive. And in many other cases there’s no obvious reason to prefer one measure over another.

In such cases, it seems reasonable to look at, and report, various different choices. Report results for several different diversity indices, for instance, or several different measures of synchrony, or whatever. This feel less arbitrary than just picking one possible measure out of many. And it probably feels like you’re doing reviewers a favor, or at least pre-empting them. After all, aren’t they just going to ask you to report a bunch of alternative measures anyway? And heck, there are basically no page limits anymore thanks to online appendices, so why not just report results for a bunch of different measures?

This seems reasonable–but on balance, it’s not a good idea. In practice, it’s mostly just a (presumably unintentional) way to fish for statistical significance, by disguising exploratory analyses as hypothesis-testing analyses. I can’t recall ever reading an ecology paper where someone learned something scientifically interesting by comparing and contrasting results for different measures or indices of the “same” thing. Instead, having multiple measures of the same thing just gives authors more chances to come up with a statistically-significant result on which they can then focus. Or at least more excuse to wave their arms about what might be going on.

There’s a deeper issue here as well, that I’m still mulling over. In the past, I’ve said that if different measures of the “same” thing give you different answers, then that shows that they’re not actually measures of the same thing after all, and you don’t really know what you’re trying to measure. And if different measures of the “same” thing give you the same answer, they’re redundant with one another and you shouldn’t report them all. I still think that’s mostly right, but now I worry that it’s a bit misleading. I now think you can have a measurement problem even if various different choices of measure give you the “same” results. So you shouldn’t just rely on your data to warn you when you have a measurement problem. The problem here is different and deeper than just fishing for statistical significance. Andrew Gelman is good on this deeper issue.

11 thoughts on “The most common way to fish for statistical significance in ecology

  1. Hi Jeremey,

    In psychology we use multiple measures all the time. The reason is that a multi-measure approach can help to reduce measurement error. To do this properly without fishing for significance you (a) first check that the measures measure the same construct using correlations among measures and/or factor analysis, and (b) you create a composite measure of correlated measures and use the composite measure as the dependent variable. As there is now only one dependent variable you do not run into the multiple-comparison problem and temptation for cherry-picking.

    • Yes, ecologists rarely take this approach, although it’s not unheard of in ecology. It addresses the issue of fishing for statistical significance, but arguably not deeper issues regarding measurement. Besides Gelman’s comments, I’m thinking of stuff like debates over the reification of the g factor or “general intelligence” (http://bactra.org/weblog/523.html).

  2. “I can’t recall ever reading an ecology paper where someone learned something scientifically interesting by comparing and contrasting results for different measures or indices of the “same” thing.”

    Actually, I published on this in 2013, although I was not looking for something “statistically significant,” but rather insignificant. Initially I had regressed Simpson’s diversity with a measure I developed to assess community structure, and more specifically, stratification. The last thing I expected, and did not predict, was that structure would be independent of biodiversity… but low & behold, it was. What was even more compelling was that the coefficient of determination for this initial tet was 0.00. Stunning, to me at least.

    So I decided to “go fish,” in an attempt to ferret out this most unusual result. All told, I regressed eleven indices (total diversity, richness, dominance & abundance) with structure. This fishing trip lasted quite a long while, and initially created more confusion than clarity. So I agree, these fishing trips can make matters worse than better. Eventually I had to apply several approaches to reveal the secret… which was that not only was the mean of diversity independent of structure, but it was also preserved across all structural categories (five in all). Juxtaposed with this was the range/ variance of diversity, which was reduced from both tails of the distribution as structure became more organized.

    This fishing expedition worked out very nicely for me. However, I did not set out looking for significance (after the initial result)- but insignificance. While it is far more difficult to construct null hypotheses that are interesting, I believe the trouble people run into relative to fishing and other matters is when they come up with “silly nulls,” or at least nulls that are unimaginative. So much bad science can be traced to this very bad habit. Construct an interesting null, and one that you are motivated to support, and you will avoid 95% of the pitfalls associated with statistics. Practicing this approach also means you will be doing what you are supposed to do in science… refute the alternative hypothesis.

    “… if different measures of the “same” thing give you different answers, then that shows that they’re not actually measures of the same thing after all, and you don’t really know what you’re trying to measure.”

    I also published on this in 2013 as it concerns biodiversity. I developed an approach (reflective analysis) concerning what I mention above, using minimum line distances & PCA scatter plots. Although I have yet to publish in depth specifics on the nature of these reflective analyses, one can view these scatter plots and get a very good feel for whether or not measurements are equivalences of one another. This is a fascinating issue in of itself, and because of the plethora of bio-indicators across biology, I believe we should at some point hash out the relatedness of one index to another. One thing that was readily apparent, for example, was that dominance was indeed the compliment of abundance, as measured, because these scatter plots were virtual mirror images of one another.

    To do good science, we must always endeavor to support the null, not the alternative. Searching for significance is a fool’s errand.

  3. Hi Jeremy,

    Comparing among different measures of the ‘same’ thing, like diversity, can indeed yield very different insights, and there are valid reasons for doing so.

    For instance, I recently explored the relationship between species richness and Simpson diversity, which led me to discover that there was pretty significant differences in evenness among my study sites that I was not expecting, nor really looking for, but which turned out to be pretty exciting. Granted richness and Shannon and Simpson diversity do not claim to measure the *same* aspect of diversity, this may still fall under the general umbrella of throwing lots of ‘diversity’ indices at the same pile of data.

    Further, using entropy-based metrics of diversity, such as Rao’s Q, yields indices whose values are in the same units (provided you conduct a simple linear transformation). By comparing and contrasting values of Rao’s Q derived from simple taxonomic distances to those derived from functional or phylogenetic distances yields a measure of functional or evolutionary redundancy within the assemblage. So while the values are slight variants on one another and supposedly measure the same phenomenon, they provide some pretty significant additional insight when calculated and compared.

    Along these lines, Leinster and Cobbold proposed ‘diversity profiles’ which created a continuum of ‘diversity indices’ derived from the same general equation (see: http://www.esajournals.org/doi/abs/10.1890/10-2402.1), which allows the researcher to explore how community diversity ‘changes’ as one increasingly emphasizes abundant species. This actually combats the very problem you raise above by forcing the researcher to understand how diversity indices compare and the subtleties of what each is telling you.

    I do agree that we suffer from an overabundance of diversity indices, with more being proposed everyday. But let’s not throw the baby out with the bathwater.

    Thanks for the thought-provoking post.

    Cheers, Jon

    • “For instance, I recently explored the relationship between species richness and Simpson diversity, which led me to discover that there was pretty significant differences in evenness among my study sites that I was not expecting, nor really looking for, but which turned out to be pretty exciting. Granted richness and Shannon and Simpson diversity do not claim to measure the *same* aspect of diversity”

      That’s part of my concern, actually. As noted in that old post of mine I linked to, different indices are different. So *of course* they’re going to behave at least somewhat differently. So while it certainly behooves the investigator to be aware that different indices of “diversity” (or anything else) are different, I don’t think one should think of oneself as “discovering” something when one calculates different indices for some dataset and finds that they behave at least somewhat differently.

      “Along these lines, Leinster and Cobbold proposed ‘diversity profiles’…This actually combats the very problem you raise above by forcing the researcher to understand how diversity indices compare and the subtleties of what each is telling you”

      You’re more optimistic than me. I predict that Leinster and Cobbold’s approach will not have the effect you’re hoping it will have. The problem of an overabundance of indices of a vaguely-defined concept cannot be solved by proposing more indices. Not even “meta-indices” that include various other indices as special cases. The source of the problem is that ecologists think it’s worth trying to measure a vaguely-defined concept–“diversity”–in a quantitative way.

      Look, if you have a theoretical mathematical model of some bit of the world that includes some variable, parameter, or other well-defined quantity that could be called “diversity” (or “functional redundancy”, or whatever), then people should be all means go out into nature and measure that quantity as part of their efforts to parameterize or test the model. For instance, the reason Dave Vasseur and I measure synchrony in terms of cross-correlation coefficients is because we have a theoretical predator-prey model that measures synchrony in terms of cross-correlation coefficients. To test the model’s predictions, we have to define and measure “synchrony” exactly the same way the model defines and measures it. But if all you have is some vague verbal model (and all verbal models are vague), or maybe no model at all because you’re doing descriptive work or something, then you don’t really have any decisive basis on which to choose among the indices of whatever vaguely-defined concept you’re studying. And you can’t fix that either by finding the “right” index, or by presenting results for a bunch of different indices.

      “By comparing and contrasting values of Rao’s Q derived from simple taxonomic distances to those derived from functional or phylogenetic distances yields a measure of functional or evolutionary redundancy within the assemblage.”

      That just raises more measurement problems. “Functional” or “evolutionary” “redundancy” is as vaguely defined as “diversity”.

  4. Hi Jeremy,

    I have often wrestled with this same issue, particularly in the context of contrasting metrics for diversity. I wanted to share two thoughts I had.

    1) Many of the ideas in ecology such as diversity are conceptual constructs. They don’t necessarily have clear definitions, which is why we end up with so many different metrics. Different metrics usually measure different aspects of the conceptual construct. The Anderson et al. 2011 Ecology Letters paper on multiple measures of Beta Diversity is an excellent example of this. So exploring the response of multiple metrics to a driver can be really informative about which aspect of the fuzzy concept is actually responding. If your up front about your work flow, I think its totally reasonable to start with a general hypothesis. Test it with multiple metrics, and then (in the case of conflicting results) modify your initial hypothesis to reflect the new information gained from those results.

    The pitfall is the researcher who doesn’t show their work. If you cherry pick only the metric that worked and write the paper as if you never tested the other metrics then I think your engaging in sloppy science. You omitting important information and also failing to move the field forward by modifying the initial hypothesis.

    2) I also wanted to echo Dr. R’s comment. Multiple metrics is a perfect opportunity to use some SEM modeling and build a latent response variable or composite predictor. I think people often equate SEM to complex path diagram type models but it could easily be used in place of a simple regression model. For a super simple example, suppose you hypothesized that disturbance has a hump shaped impact on local diversity ( I know your not a huge fan of this, but bear with me!). You could calculate diversity with multiple metrics (shannon, simpson, etc). You could also estimate “disturbance” in multiple ways (frequency, intensity, duration). Each of these conceptual constructs could be modeled as latent variables and then modeled in an SEM using a polynomial (y = x + x^2 + e). I think this approach, if done correctly, is pretty robust.

    • “You could also estimate “disturbance” in multiple ways (frequency, intensity, duration).”

      I see the general point you’re making (see my reply to Dr. R above), but this specific example is a bad idea (but you knew I’d say that!). Rather than trying to test “the IDH” with latent variables representing “disturbance” and “diversity”, you should be trying to test some specific, fully-specified mathematical model of disturbance-diversity relationships. The test could take various forms–you don’t necessarily have to be testing a highly biologically-detailed model specifically tailored to your system, or paramaterizing the model, or whatever. But at least if your starting point is a well-specified model, you know what you’re supposed to measure. E.g., if in the model “diversity” is measured as “species richness” and “disturbance” is “frequency of density-independent mortality events”, then that’s what you need to measure. Rather than trying (futilely) to make a vague conceptual construct precise with latent variables. Latent variables solve the problem of fishing for significance, but they don’t solve deeper problems of knowing exactly what you’re trying to measure.

  5. It would be interesting if experimental protocols/statistical models/tested hypotheses were published/uploaded before data were collected. This wouldn’t preclude follow up analyses or additional data mining, but it would be clear which was which.

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s