Big data doesn’t mean biggest possible data

For better and for worse, big data has reached and is thoroughly permeating ecology. Unlike many, I actually think this is a good thing (not to the degree it replaces other things but to the degree to which it becomes an another tool in our kit).

But I fear there is a persistent myth that will cripple this new tool – that more data is always better. This myth may exist because “big” is in the name of the technique. Or it may be an innate human trait (especially in America) to value bigger house, car, etc. Or maybe in science we are always trying to find simple metrics to know where we rank in the pecking order (e.g. impact factors), and the only real metric we have to rank an experiment is its sample size.

And there is a certain logic to this. You will often hear that the point of big data is that “all the errors cancel each other out”. This goes back to statistics 101. The standard error (description of the variance in our estimate of the mean of a population) is \sigma/\sqrt{n}. Since n (sample size) is in the denominator the “error” just gets smaller and smaller as n gets bigger. And p-values get corresponding closer to zero which is the real goal. Right?

Well, not really. First \sigma (standard deviation of noise) is in the numerator. If all data were created equally, \sigma shouldn’t change too much as we add data. But in reality there is a lognormal-like aspect to data quality. A few very high quality data sets and many low quality data sets (I just made this law up but I expect most of you will agree with it). And even if we’re not going from better to worse data sets, we are almost certainly going from more comparable (e.g. same organisms, nearby locations) to less comparable organisms. The fact that noise in ecology is reddened (variance goes up without limit as temporal and spatial extent increase) is a law (and it almost certainly carries over to increasingly divergent taxa although I don’t know of a study of this). So as we add data we’re actually adding lower quality and/or more divergent data sets with larger and larger \sigma. So \sigma/\sqrt{n} can easily go up as we add data.

But that is the least of the problems. First, estimating effect size (difference in means or slopes) is often only one task. What if we care about r2 or RMSE (my favorite measures of prediction. These have sigma in the denominator and numerator respectively so the metrics only get worse as variance increases.

And then there is the hardest to fix problem of all – what if adding bad datasets adds bias. Its not too hard to imagine how this occurs. Observer effects is a big one.

So more data definitely does NOT mean a better analysis. It means including datasets that are lower quality and more divergent and hence noisier and problably more biased.

And this is all just within the framework of statistical sampling theory. There are plenty of other problems too. Denser data (in space or time) often means worse autocorrelation. And another problem. At a minimum less observation effort produces smaller counts (of species, individuals, or whatever). Most people know to correct for this crudely by dividing by effort (e.g. CPUE is catch per unit effort). But what if the observation is non-linear (e.g. increasing decelerating function of effort as it often is). Then dividing observed by effort will inappropriately downweight all of those high effort datasets. Another closely related issue that relates to non-linearity is scale. It is extremely common in meta-analyses and macroecology analyses to lump together studies at very different scales. Is this really wise given that we know patterns and processes often change with scale. Isn’t this likely to be a massive introduction of noise?

And it goes beyond statistical framing to inferential framing to what I think of as the depth of the data. What if we want to know about the distribution of a species. It seems pretty obvious that measuring the abundance of that species at many points across its range would be the most informative (since we know abundance varies by orders of magnitude across a range within a species). But that’s a lot of work. Instead, we have lots of datasets that only measure occupancy. But even that is quite a bit of work. We can just do a query on museum records and download often 100s of presence records in 15 minutes.But now we’re letting data quantity drive the question. If we really want to know where a species is and is not found, measuring both sides of what we’re interested in is a far superior approach (and no amount of magic statistics will fix that). The same issues occur with species richness. If we’re really serious about comparing species richness (a good example of that aforementioned case where the response to effort is non-linear), we need abundances to rarify. But boatloads of papers don’t report abundances, just richness. Should we really throw them all away in our analyses?

As a side note, a recurring theme in this post and many previous ones is that complex, magic statistical methods will NOT fix all the shortcomings of the data. They cannot. Nothing can extract information that isn’t there or reduce noise that is built in.

So, returning to the question of two paragraphs ago, should I knowingly leave data on the table and out of the analysis? The trend has been to never say no to a datset. To paraphrase a quote from Will Rogers, “I never met a dataset I didnt’ like”. But is this the right trend? I am of course suggesting it is not. I think we would be better off if we only used high quality datasets that are directly relevant to and support the necessary analytical techniques for our question. Which datasets should we be omitting? I cannot tell you of course. You have to think it through in the particulars. But things like sampling quality (e.g. amount of noise, quality control of observation protocols), getting data that make apples to apples comparisons, and the depth of the data (e.g. abundance vs occupancy vs presence/absence) may well place you in a realm where less is more!

What do you think? Have you had a situation where you turned away data?

This entry was posted in Issues by Brian McGill. Bookmark the permalink.

About Brian McGill

I am a macroecologist at the University of Maine. I study how human-caused global change (especially global warming and land cover change) affect communities, biodiversity and our global ecology.

16 thoughts on “Big data doesn’t mean biggest possible data

  1. Ahhh- Excellent topic, & one near & dear to me. In 2011 I was invited to join a group of ecologists gearing up for meta-data analyses. I was just 2 years removed from my career in medicine, so I had a lot to offer. I think it is fair to say meta-analysis was refined by medicine many moons ago, primarily a result of evidence-based clinical practice, and now very rigid guidelines for testing of pharmaceuticals.

    In my view, ecology is now going through what medicine did long ago, but for different reasons. If you visit the medical literature, you will discover they usually have the luxury of selecting from thousands of studies for any meta-analysis, thus they can be very picky. Standardization of protocols allows for this, but so too does the tendency for any given set of disease sequelae to occur among specific age groups, genders and often geographies.

    In ecology, you might be limited to but a handful of papers to choose from, and at best maybe a few hundred. There are many approaches to meta-analysis, and some allow more wiggle room than others concerning the quality and continuity of data. But your point about noise is a good one. If noise swamps out significance, why bother? Things are changing for the better, though. So for example, decades of remote sensing data, a century plus of hydrologic data for major rivers, long term biomonitoring programs, etc shall eventually allow ecology to catch-up with medicine on this new frontier.

    Yes, I have rejected data concerning meta-analysis… boat loads of it. Currently I am conducting a meta-analysis relative to percent cover estimates of species from myriad communities, habitats, ecosystems & biomes. Where such studies introduce noise, via observer error, concerns ocular estimates of percent cover. Often, observer error in of itself exceeds 50% for these approaches (see Elzinga, Salzer & Willoughby 2001). Thus I have limited myself to studies applying the point/ line intercept protocol, where observer error is vastly reduced. But the trade-off is one never captures true richness because many rare species are not sampled. Fortunately that does not adversely impact my research questions.

    Meta-analysis is a very tricky business at best. You are absolutely correct that more is not always better. I have read many a paper in ecology where meta-analyses left me feeling like I just lost an hour of my life reading a pointless paper that I will never get back. I find this very frustrating, but my view is biased by the hundreds of quality papers I read in medicine on this topic. So I beg of everyone to exercise discretion. Rather than being a bull in a china shop, try instead to sample the very finest of wines.

  2. And then there is the hardest to fix problem of all – what if adding bad datasets adds bias. Its not too hard to imagine how this occurs. Observer effects is a big one. I’d argue it’s not just bad datasets, it’s using datasets that are combined in a manner such that the sampling scheme does not match the question you are trying to answer. The recent spate of biodiversity change synthesis papers are an excellent example. Those datasets are amazing, but the spatial sampling scheme of the dataset is not designed to detect a global signal of biodiversity change. Indeed, many of the datasets – across papers – are from systems that were sampled for some unusual property – either because they were disturbed or unusually pristine. So, trying to find definitive grand trends in such data is somewhat difficult given that the sampling design across the globe doesn’t match the question.

    More useful for this or other global change ecology papers either a) determining if we can make conclusions about certain regions or b) asking if there is sufficient variation in predictors that we can begin to model the causes of global patterns. On a), I have a dataset I’m working with or population timeseries that does produce an average global trend, but the data is so biased towards four ecoregions around the planet, that I have no confidence in reporting global trends. However, analyses of each ecoregion independently does show some *very* cool global variation that is incredibly informative when paired with local natural history. On b) means are often, eh, interesting, but it’s the variation that we are *really* concerned about when we want to know cause and effect (or even correlation!) The same issue comes up of course – does your analysis have sufficient variation in predictors that, even though the study wasn’t designed to assess a correlation, it has sufficient power to do so.

    This issue is one that goes well beyond just global questions. I’d argue it’s embedded in every grad student or postdoc’s desires when they come to work with the massive and wonderous datasets the LTER network or similar groups generate. Does post-hoc big data have the structure in it to answer questions that it was perhaps not originally designed to answer? Bigger isn’t Better if the design is Bad (for the question you want to ask) (it might be great for another question).

    • You raise a very good point that nearly always the data ecologists collect are spatially, temporally, taxonomically, and habit-wise biased. We don’t have great methods to deal with this.

      • Indeed – often it’s “This is a site I like!” even. For completely non-random biased reasons. Heck, I’ll admit one of my long-term study sites is simply due to the fact that I was able to get access to it, and not some surrounding areas. Who knows how this biases my work!

        Which is why I think the ideal method is to really think in terms of drivers and covariates rather than aggregate trends that might be biased by site selections. The variation is what we often want to explain, after all. And with enough data, there is often enough variation in drivers to begin to discern a signal. Often, but not always. Any data set is buyer beware. It’s a matter of being a savvy buyer – particularly when the set is large.

      • Its issues like this that make me think that occasionally and in some areas, ecology would benefit from a more top-down approach where resources were allocated towards more careful, systematic sampling.

    • And on your other point about regional variation and in general ecology’s (science’s?) obsession with boiling everything down to a grand mean instead of reporting and understanding the variance is something I also strongly agree with.

  3. Brian, you are making really good points. To my mind, it brings up a question how to make a decision about what is good enough. Your post makes me think about statistical tools to identify a threshold for a variance vs data set size relationship/coefficient of variation etc. Do you know any good papers on this topic?

  4. I know I’m a bit late to the comment game, but I really enjoyed this post. Implicit in your post seems to be the notion that big data in ecology will come from combining small data sets of varying quality. While this is certainly one possible scenario, I think large data sets can come from a variety of sources. You’re certainly right that careful consideration needs to be given to how large data sets are assembled, however I believe this varies by what a researchers goal is.

    If your primary goal is large scale inference, I think you have some salient points about variance and scale. It’s clearly important that features are well matched if you’re building statistical models to make inference about patterns. Predictive model are a slightly different animal though. Inferential models are designed with the goal of creating accurate estimators, effect sizes of parameters etc…therefore minimizing bias is important. Predictive models on the other hand are designed to minimize error (or MSE as it were) therefore an increase in bias could be worth it if there’s a greater decrease in variance. I think the short term gain from predictive machine learning models in ecology is the ability to combine disparate data sources and make decent predictions. Of course if you’re doing your train/test or CV on low quality data at best you can hope to be good at predicting the wrong answer. However if you can get good out of sample prediction on a validation set out of models built with those 100’s of museum records are you really that concerned with data quality? Yes, maybe someday NEON will be at full scale or NutNet will be pumping out amazing data at huge scales, but until then, I think there’s still a lot of value in GBIF records.

    One last point, I think the persistent myth that “more data is better” is driven in part by an attitude I find prevalent working with machine learning practitioners. In many fields data is now cheap and easy and a sentiment I encounter is “I’ll just throw it in my random forest and if it’s predictive great, if not nbd”. This is obviously not the case in ecology. To begin with common sources of “big data” are often collected all on the same scale, e.g. click stream data, netflix movie preferences, sensor data, etc… so problems of mixing different scales isn’t encountered that often. Ecology has a different set of challenges for generating more data. A short list of ways that we can create “big data” sets off the top of my head are: 1. Increase temporal resolution of sampling, 2. increase spatial resolution of sampling, 3. increase sampling area. One challenge that arises from this is how will all this extra data be used? I know at NEON you can theoretically access temperature data at 1sec resolution. How useful is having 1sec temperature data though for most ecological inferences? As you rightly point out, we need to move beyond just aggregating data to larger scale collection. I think the next question is how.

    • You raise an important question of how we can productively generate big data. In my world of biodiversity, I would love to see more systematic long term monitoring data globally. Groups look GeoBON are aiming to do this but we’re not there yet. In the mean time we’re left with a handful of datasets like BBS and USFIA. One solution would be that this data also exists in Europe (and Australia?) but is fairly tightly held so more sharing would be helpful. Beyond that what we can do today is start adding lots of smaller datasets. And I guess you’ve flushed out my core concrete example in my mind behind this post – I kind of shudder at some of the recent meta-analyses that take the BBS and reduce it to one data point and then add in many other much smaller (often single location, few years), less controlled datasets as other equal datapoints.

Leave a Comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s