For better and for worse, big data has reached and is thoroughly permeating ecology. Unlike many, I actually think this is a good thing (not to the degree it replaces other things but to the degree to which it becomes an another tool in our kit).
But I fear there is a persistent myth that will cripple this new tool – that more data is always better. This myth may exist because “big” is in the name of the technique. Or it may be an innate human trait (especially in America) to value bigger house, car, etc. Or maybe in science we are always trying to find simple metrics to know where we rank in the pecking order (e.g. impact factors), and the only real metric we have to rank an experiment is its sample size.
And there is a certain logic to this. You will often hear that the point of big data is that “all the errors cancel each other out”. This goes back to statistics 101. The standard error (description of the variance in our estimate of the mean of a population) is . Since n (sample size) is in the denominator the “error” just gets smaller and smaller as n gets bigger. And p-values get corresponding closer to zero which is the real goal. Right?
Well, not really. First (standard deviation of noise) is in the numerator. If all data were created equally, shouldn’t change too much as we add data. But in reality there is a lognormal-like aspect to data quality. A few very high quality data sets and many low quality data sets (I just made this law up but I expect most of you will agree with it). And even if we’re not going from better to worse data sets, we are almost certainly going from more comparable (e.g. same organisms, nearby locations) to less comparable organisms. The fact that noise in ecology is reddened (variance goes up without limit as temporal and spatial extent increase) is a law (and it almost certainly carries over to increasingly divergent taxa although I don’t know of a study of this). So as we add data we’re actually adding lower quality and/or more divergent data sets with larger and larger . So can easily go up as we add data.
But that is the least of the problems. First, estimating effect size (difference in means or slopes) is often only one task. What if we care about r2 or RMSE (my favorite measures of prediction. These have sigma in the denominator and numerator respectively so the metrics only get worse as variance increases.
And then there is the hardest to fix problem of all – what if adding bad datasets adds bias. Its not too hard to imagine how this occurs. Observer effects is a big one.
So more data definitely does NOT mean a better analysis. It means including datasets that are lower quality and more divergent and hence noisier and problably more biased.
And this is all just within the framework of statistical sampling theory. There are plenty of other problems too. Denser data (in space or time) often means worse autocorrelation. And another problem. At a minimum less observation effort produces smaller counts (of species, individuals, or whatever). Most people know to correct for this crudely by dividing by effort (e.g. CPUE is catch per unit effort). But what if the observation is non-linear (e.g. increasing decelerating function of effort as it often is). Then dividing observed by effort will inappropriately downweight all of those high effort datasets. Another closely related issue that relates to non-linearity is scale. It is extremely common in meta-analyses and macroecology analyses to lump together studies at very different scales. Is this really wise given that we know patterns and processes often change with scale. Isn’t this likely to be a massive introduction of noise?
And it goes beyond statistical framing to inferential framing to what I think of as the depth of the data. What if we want to know about the distribution of a species. It seems pretty obvious that measuring the abundance of that species at many points across its range would be the most informative (since we know abundance varies by orders of magnitude across a range within a species). But that’s a lot of work. Instead, we have lots of datasets that only measure occupancy. But even that is quite a bit of work. We can just do a query on museum records and download often 100s of presence records in 15 minutes.But now we’re letting data quantity drive the question. If we really want to know where a species is and is not found, measuring both sides of what we’re interested in is a far superior approach (and no amount of magic statistics will fix that). The same issues occur with species richness. If we’re really serious about comparing species richness (a good example of that aforementioned case where the response to effort is non-linear), we need abundances to rarify. But boatloads of papers don’t report abundances, just richness. Should we really throw them all away in our analyses?
As a side note, a recurring theme in this post and many previous ones is that complex, magic statistical methods will NOT fix all the shortcomings of the data. They cannot. Nothing can extract information that isn’t there or reduce noise that is built in.
So, returning to the question of two paragraphs ago, should I knowingly leave data on the table and out of the analysis? The trend has been to never say no to a datset. To paraphrase a quote from Will Rogers, “I never met a dataset I didnt’ like”. But is this the right trend? I am of course suggesting it is not. I think we would be better off if we only used high quality datasets that are directly relevant to and support the necessary analytical techniques for our question. Which datasets should we be omitting? I cannot tell you of course. You have to think it through in the particulars. But things like sampling quality (e.g. amount of noise, quality control of observation protocols), getting data that make apples to apples comparisons, and the depth of the data (e.g. abundance vs occupancy vs presence/absence) may well place you in a realm where less is more!
What do you think? Have you had a situation where you turned away data?