Hoisted from the comments: what do ecologists have Big Data on, and what don’t they?

It’s often said that we’re in, or will soon enter, the era of Big Data. We’ll have all the data we could possibly want, and so we’ll no longer be data-limited. Instead, the rate of scientific progress will be limited by other factors, like our ability to think of good questions.

But as Jeremy Yoder and David Hembry asked in the comment thread on this old post: what sorts of Big Data do ecologists (and evolutionary biologists) actually have? We certainly don’t have Big Data on everything–whatever that might mean! Rather, we have Big Data on certain things on which technological advances have made it easy to collect data. Gene sequences, for instance. Records of where and when species have been observed, thanks to things like camera trap networks, citizen science projects and smartphone apps, and digitization of museum records. Information that can be remotely sensed, like land cover. Probably other sorts of data I’m forgetting.

What don’t we have Big Data on, even though we really wish we did? What data that we would really like to have has not gotten any easier to obtain thanks to smartphones, satellites, drones, cheap PCR, citizen science, etc.? I’d say demographic data is a big one. Data on the births and deaths (and for mobile organisms, movements) of lots of individuals. Ideally along with relevant environmental data sampled at the spatial and temporal grains and extents relevant to those individuals. And it’d sure be nice to have this information for many generations, but of course there’s no way for technology to speed that up.* And to have it for many different species, so that we could do community ecology and not just population ecology.**

Here’s another sort of Big Data we mostly don’t have: data from controlled, manipulative, randomized experiments. A lot of Big Data is observational data. Which is great. But no matter how much observational data you have, on whatever variables you have it on, inferring causality without experimental data is going to be difficult at best. The great thing about NutNet is that it’s Big Experimental Data. Not that technological advances are irrelevant for NutNet–the internet facilitates collaboration, for instance. But information technology doesn’t make it any easier to fence plots or add fertilizer or remove a species of interest or etc.

So, what do you think are the biggest and most difficult-to-close gaps in ecologists’ collective data collection efforts?

Hat tip to Peter Adler, who got me thinking about this.

*For this reason, I wonder if there will be a long-term trend for ecologists to focus more on spatial variation and less on temporal variation. Technological advances can improve the spatial extent of our sampling, but not the temporal extent.

**And as long as I’m dreaming, I’d like my free pony to be a palomino.

24 thoughts on “Hoisted from the comments: what do ecologists have Big Data on, and what don’t they?

  1. I think you make a very good observation about it being hard to study time. Demographic info, as you note, is very hard (impossible?) to get in a systematic way with technology. Perhaps we’ll get better time data from pollen records, tree rings, etc. as these things become cheaper to do? It’s not demographic info as in birth and death rates, but might fill in some holes.

    Also a good thought about space becoming the focus rather than time. I very much think spatial ecology is a fast-growing subfield for the big-data reason. (And high performance computing that makes analyzing the data feasible.)

    I will make a little quibble with the term ‘Big Data’. At Snapshot Seregeti, we’ve got several terrabytes of images and a gigabyte of raw data (text, numbers, etc.) describing those images. And that’s ‘Big Data’ from an ecology perspective. But the more general purpose term ‘Big Data’ actually refers to data that can’t be held or processed on a single machine — usually in the petabyte range. Some genomics datasets and some satellite remote sensing datasets may fit this definition, but there are no real ‘Big Data’ *ecology* datasets (that I’m aware of). I like to call our Snapshot Serengeti datasets ‘large data’.

    • Hah – our posts crossed on the wires. Leave it to two ecologists with computer science backgrounds to point out how puny our data really is!

      • @Brian and Margaret:

        Yes, I was deliberately vague in the post about what I meant by “Big Data”. I agree completely that ecology data is almost never “big”. I just wanted a punchy title for the post, figuring the context would make clear what I meant by ‘Big Data’.πŸ™‚

  2. In my opinion, we don’t really have “big data” on anything. Our biggest projects are measured in millions of records which is nothing compared to other fields (astronomy, meterology, even particle physics, remote sensing).

    You do ask an interesting question about space vs time. I’m not so convinced. Our spatial coverage either has very small extent (e.g. BCI at 0.5 km2) or has very low coverage (e.g. US Forest inventory which has thousands of points across the US but only of 0.04 ha plots). A big hole is spatially explicit data at the individual level at scales larger than 1 ha.

    Conversely, the paleo world has been using “big” data for a while and has rather extensive coverage of say the Quaternary (see Jack Williams work) or deeper time (e.g. the Paleo Database assembled by John Alroy)

    Aside from the aforementioned hole in spatial data at mesoscales, I think the biggest holes are taxonomic (long term monitoring of insects?) and geographic (an analog of the forest inventory for the neo-tropics?).

    • In my current postdoc, my group is trying to get a handle on the meso-scale (looking at phenology, in particular). It’s really hard. We have individual observations, site observations, and then we jump up to satellite data. I think in the future, as we get more and more satellites that have high-resolution images, we may cover the meso-scale that way (at least for plant phenology, perhaps being able to ID tree species, but not, say count rabbits). And I think UAVs have a lot of potential, as well (if they’re ever allowed to be legal).

      • What sort of data do you see UAVs as being able to collect, and could that data by processed by computers? I’m imagining using a UAV to, say, take a lot of pictures of the ground–but then needing humans to identify what’s in all those images. But this probably just shows my lack of knowledge and imagination about the uses to which UAVs could be put.

      • Sure, you might need people to process the images, but so what? That’s why we have Big Citizen Science.πŸ™‚

        One of the things we want to be able to do is scale from the site level to a satellite pixel. But if you just measure at a point (or several) in the pixel, you inevitably have a lot of variability in how well it matches with the satellite pixel. With a UAV, we can actually get data over the entire satellite pixel, so we can understand how point measurements *really* scale up — how many ground observations do we need? How does that depend on how heterogeneous the pixel is? Etc. We’re working with trees right now. So with a UAV, you can take pictures of thousands of tree crown across a wide area. We can then see how, for example, the color of each tree crown contributes to the color of the satellite pixel. You could always do this sort of thing with low-flying aircraft, but with UAVs, it’s much, much cheaper.

        Meanwhile, I just was contacted by a grad student who wants to work on the same topic I abandoned as a new grad student six years ago: tracking wildebeest across the Serengeti to understand their movement patterns and how that’s affected by environmental drivers. I gave it up because while I figured it was technically possible to get a minimum of data I would need, it would cost upwards of $60K, and I didn’t feel like raising that as a grad student. Now with UAV’s, it’s much more feasible for a grad student to take on this sort of project.

      • (Note: I arrived here by circuitous route. Ecology isn’t my field. I have experience with analysis of airborne imagery and spectrally-based materials ID.)

        Meso-scale measurements could be an interesting niche for UAVs. As you note, the overhead associated with getting them up in the air and collecting data is far lower than for satellite- and aircraft-based. They could enable high coverage rates (sq.km/day) at modest spatial resolution and modest cost. What’s your desired spatial resolution for meso-scale measurements? The limitation on what’s achievable with UAVs may be driven by the payload they can carry, i.e., how big a camera they can lift.

        On the theme of airborne remote sensing, are you familiar with NEON’s Airborne Observation Platform? (Link = http://w1pub.neoninc.org/science-design/collection-methods/airborne-remote-sensing) From the NEON website: “The NEON airborne observation platform (AOP) collects annual remote sensing data over NEON field sites using sensors mounted on an airplane. The AOP consists of a hyperspectral imaging spectrometer, a full waveform and discrete return LiDAR, and a high-resolution Red, Blue Green (RGB) camera. Data from the AOP build a robust time series of landscape-scale changes in numerous physical, biological and biochemical metrics, such as vegetation cover and density, canopy chemistry, land use and land cover, and topography, including elevation, slope and aspect.”

        Last thing, are you familiar with “object-based image analysis” as described by Thomas Blaschke et al (see, e.g., http://www.sciencedirect.com/science/article/pii/S0924271609000884)? I happened on his review article earlier this year and found it interesting. The approach looks useful for meso-scale analyses in general.

      • Thanks for your note, Chris. Yes, I’m familiar with NEON’s airborne measurements. NEON isn’t yet fully operational, so I’m not sure if they’ve done more than testing flights so far. And I am familiar with the techniques of object-based image analysis (I have a computer science degree and was particularly interested in graphics and vision applications in college), but I hadn’t thought about it in connection with UAV and other remote sensing imagery. Thanks for the paper link.

  3. PlantPopNet (http://plantpopnet.wordpress.com and previously mentioned on this blog) is taking a “Big” Experimental Data approach to demography. Still, I agree with Jeremy that there’s no clear way to automate this kind of data collection. If anything, the development of integral projection models — and concurrent improvement of computer processing power — enables demographers to use sparser datasets because these models use fewer parameters than matrix models. Is there any way my free pony could come with a device that records biomass simply by connecting the leads to the organism of interest?

    • ” Is there any way my free pony could come with a device that records biomass simply by connecting the leads to the organism of interest?”

      And we have a thread winner! +1000 Internet Points!πŸ™‚

  4. i don’t think there is any doubt that 10-20 years from now we will be monitoring individual canopy trees using remote sensing probably on UAVs (possibly satellites as well although in the US NASA is prevented from very high resolution to make room for commercial satellites). They will use resolutions of 1m or so with waveform LIDAR (records every light bounce and gives good measure of vegetation structure as well as height) and hyperspectral (100+ channels of color) will undoubtedly allow us to identify individual trees, measure their size, and in the temperate zone ID to species. This sensing technology exists today. Some work on computer algorithms and ground based data are still needed. These technologies can also measure other traits like leaf nitrogen content, water content, etc as well.

  5. Sometimes I wonder to what extent dissecting an infinitely detailed dataset of demographic data over a large time series would actually improve our understanding of what is happening in ecological systems. I’m sure there would lots of cool stuff to learn… But then take economists, for example, who have huge uninterrupted data series of commodity prices etc… and yet still struggle to make any consistent predictions of what will happen next. I think the take home message once we have the technology to produce such detailed demographic datasets will be something we already know: in the end, the best predictor is always (t-1), and that sooner or later something unpredictable catches up with us because we are dealing with huge and somewhat (very?) chaotic systems.

    Nonetheless, I will continue to salivate over the possibility of playing with: “Data on the births and deaths (and for mobile organisms, movements) of lots of individuals. Ideally along with relevant environmental data sampled at the spatial and temporal grains and extents relevant to those individuals.” While we may always have difficulty with predictions, I think there will be a whole lot more to learn from.

    • That’s a good point. Think of how the flood of gene sequence data has in many cases just confirmed (or at least failed to decisively overturn) what we already knew (e.g., that most variation in quantitative traits is underpinned by many loci of small effect).

      Re: economists having huge uninterrupted data series: well, they have them for some variables, such as stock and commodity prices. But they often don’t have good long-term data on the quantities they really wish they had data on. Not long-term enough, anyway (relative to the timescale of, say, the business cycle). And they infamously struggle to infer causality despite all that time series data and having led the way on developing lots of sophisticated statistical methods for inferring causality (there’s an entire field, econometrics, devoted to this). That’s why what excites me most are things like NutNet.

  6. Nice post Jeremy, but I would quibble that there’s a bit of a red herring here. No one in any field or industry has exactly the data they want. If you’re in marketing, you don’t really want to have to sort through petabytes of twitter feeds; you’d rather have controlled & replicated experiments on every customer attribute. But a global marketing team that chose to ignore the existence of the really crappy but readily available ‘big data’ from social media in favor of a purely controlled experiment approach might be less successful than one that did both. To me, it is this message that there is this rapid emergence of non-traditional data that requires different skills and tools to generate meaning, that has put the spotlight on this data.

    So to me, the question is not whether there’s data we’d like to have more of but don’t have (that’s true for everyone in every subject, and it always will be) but whether there is any non-traditional data in ecology that most ecologists are ignoring when in fact we might learn something from using it. I believe the answer is yes (whether it is from remote sensing, simulations, or simply sharing data), at least for some areas, but I think that’s far from obvious and would be curious to see how many ecologists feel they are or are not forced to ignore some non-traditional data sources (online databases, climate layers, other people’s data) because the barriers to working with them are too high. What do you think?

    p.s. I’m always a bit abashed when ecologists are apologetic (or boastful) about the number of bytes of data. Byte size is pretty meaningless (compressed/uncompressed/etc) and one of the most trivial ways in which data can incompatible with current methods, which is really the point. A terabyte is big because we compare it to the disk or memory of our computer — not fitting in memory is the most trivial way an otherwise traditional data source requires some more clever engineering. Big data literature often refers to the “three V’s” (volume, velocity, variety) in which non-traditional data can break traditional approaches. The focus on breaking computer hardware (memory limits, bandwidth limits, etc) is just the lowest common denominator; ecology & evolution offer a much richer set of ways in which data can be available but require some clever engineering before it can be useful. To me, that’s the spirit of big data, whether it is kilobytes or petabytes.

    • “So to me, the question is not whether there’s data we’d like to have more of but don’t have (that’s true for everyone in every subject, and it always will be) but whether there is any non-traditional data in ecology that most ecologists are ignoring when in fact we might learn something from using it.”

      Good question. Hard to answer. It’s like various other questions we’ve asked on this blog over the years. Do ecologists pay too much attention to model systems, or too little? Is there too much negativity and criticism in how we evaluate science, or too little? Do we teach too much [insert name of subject here], or too little?

      My own feeling is that, in general, science never needs much of a push to move down paths of least resistance, or to keep moving in the direction it’s currently moving. I think that taking up new technologies, data sharing, non-traditional data, etc. is a path of least resistance, and that because that’s the way things are starting to go that’s the way they’ll keep going. So even though non-traditional data sources and technologies may not yet be something that most ecologists take advantage of, I think that’s going to change. And I don’t think there’s any risk that gentle pushback/concern trolling from the likes of me will prevent those changes from happening.

      For the same reason, I tend to worry more about zombie ideas and bandwagons in ecology than I worry about Buddy Holly ideas:


      As an aside, one thing I’m curious about is how people’s perceptions are shaped by the current state of the field, vs. the way the field is changing. The current value of the state variable vs. the first derivative. For instance, if few people in the field are currently doing X, but you’re one of those people and you’d like to see lots more of X, then you might well feel like X and the people who do it are undervalued. You might worry about the future of X, and about your own career prospects. But on the other hand, someone who focuses on how the number of people doing X is increasing and who sees other signs that X is the next big thing probably will feel that X and the people who do it are highly valued and that there’s no need to worry about the future of X or the career prospects of the people who do X.

      • Interesting! Yup, that’s a good way to look at it. I agree with your take that non-traditional data has positive invasion fitness whether or not it gets a boost (perhaps analogous to non-traditional statistics?).

        Perhaps the bigger concern is that there’s always more ways to go wrong when moving into new spaces, but I’m inclined to the optimistic view that mistakes are part of learning (unless they become entrenched as zombie ideas?).

        I’m not too worried about ‘Buddy Holly’ ideas per se, since I think that’s more of a matter of science culture and structure than the ideas themselves that need attention (something you’ve written on rather eloquently with regards to ‘let many flowers bloom’ approach to funding, etc.)

        Interesting question about perceptions vs participation. I think some of that does happen and it is desirable — I love the diversity of questions in ecology but also think we move forward most effectively (if only in finding that it is a dead end) when an idea really gains the participation and scrutiny of a big piece of the community. As far as what this means for careers, I think it goes both ways, some people benefiting disproportionately and some the opposite. But if everyone worked on their own subdivided fiefdoms all the time things might me more stable but less interesting.

      • “I love the diversity of questions in ecology but also think we move forward most effectively (if only in finding that it is a dead end) when an idea really gains the participation and scrutiny of a big piece of the community. ”

        Interesting comment, you could well be right. Way back I linked to a related suggestion from Mike the Mad Biologist, that science needs to focus its efforts and funding in order to make serious progress and avoid chasing noise arising from lots of underpowered, small-sample-size studies.

        But of course, letting a thousand flowers bloom is in direct opposition to focusing our efforts and getting a critical mass of people working on the same thing. Well, ok, that’s not entirely true, because for instance you could have lots of people working on the same question but using a diversity of approaches. But there is a real tension there, I think.

        “Perhaps the bigger concern is that there’s always more ways to go wrong when moving into new spaces, but I’m inclined to the optimistic view that mistakes are part of learning (unless they become entrenched as zombie ideas?). ”

        The “mistakes are part of learning” bit is I think one of the strongest argument for the importance of criticism in science. You can’t learn from your mistakes unless you realize you’re making them.

        I can feel a post or two coming on in the new year about the circumstances that favor optimism vs. pessimism, in science and in life. And whether there are any circumstances that favor either over an accurate assessment of the relevant probabilities and payoffs. And maybe a post looking at whether there are analogies between progress in science, and progress in technology and business (two other areas in which we often have to make bets about what to try, when to abandon a once-promising direction, etc.)

    • Carl – I agree with your point about unlocking data we already have.

      I can’t entirely agree about the big data. Big data IS defined by computational challenges dealing with quantity of data. And ecology doesn’t have it in either the Volume or the Velocity. I dealt with way more records in the business world 20 years ago (which is saying a lot given exponential rates of growth in computing power).

      Ecology does have challenges in Variety (data heterogeneity in spades). I find this an interesting challenge but BIG is a misnomer for it. I’ve always liked the ecoinformatics label (managing data to become information) as a description of the software integration and analysis challenges you and I both work on better than big data.

      To me “Big data” is a bit of a label justifying more hardware which is NOT what ecology needs but what I see happen all too often.

      • Brian — I agree entirely, big data is a bad name. I think we’re just disagreeing about semantics here, but I would argue that Ecology does have a Volume and a Velocity problem, it just so happens that the rate limiting things aren’t immediately related to computer hardware. That’s true of a lot of so-called “Big Data” problems — as you know, lots of interesting things are being done in statistics research to find algorithms that scale better because these problems won’t just be solved by hardware and Moore’s law.

        I think the variety of ecological data rightly gets a lot of attention, but I think the idea of unlocking data we already have is fundamentally an issue of all three in ecology. Consider the limit where all data sets are small enough that everything can be done by hand easily — then variety isn’t such a big issue — it doesn’t matter if everyone uses different units and different measurement protocols etc. It is only when the volume is big enough and we need some level of automation that variety suddenly becomes a big issue, right?

        You don’t need petabytes of data before doing an ANOVA by hand seems like a bad idea, or before you can check all your species names by eye, or a million other things. Because the variety is so high, volume and velocity become a problem much sooner. Things like standardized units, data formats, standard software implementations of methods, all help reduce variety where it’s not particularly helpful and let us deal with larger volumes of data faster.

        I agree 100% that we don’t need more hardware, that hardware isn’t where our data analysis bottle neck is. I think industry focuses on the hardware limits primarily because what they are doing is already simple enough and scalable enough that the first place size creates a problem is in memory, or bandwidth, etc. But I do think there’s value in identifying and remedying the bottlenecks we have in ecology.

        Curious what you think of the Science piece by Fox & Hendler on this, http://www.sciencemag.org/content/331/6018/705.full They argue that visualization is one such bottleneck — that as our data have gotten more complicated, visualization has taken up an ever-increasing amount of the research effort to the point where it is often only the product rather than an exploratory tool. We don’t make enough exploratory but ultimately disposable visualizations because they take such significant effort, and could be greatly reduced by more clever and standardized ways of going about these things, instead of leaving it as an untaught black art. I’m not sure if visualization is as much of a hold-up in ecology, but I think this captures the general spirit that increasing volume of data can make a part of the analysis a bottleneck in ways that have very little to do with hardware.

      • Carl – I agree with pretty much everything. 1 million records is still a challenge for ANOVA or PCA (and still often impossible to run something like mixed models). And you touched on one of my pet themes – automated checks for and improvements to data quality – if we’re serious about getting information out of large datasets we have to be serious about that.

        As for visualization – I’ve never been a huge fan of the really fancy visualizations (scrolling 3-D and etc) as adding much value – at least in my own experience doing data analysis.But I don’t like face plots and other such multivariate plotting techniques that some people love – it may just be the way my brain is wired. But I am increasingly running into cases where, say, a student will have 2,000 points and just a simple scatter plot can bog down on that which means a lot less basic visual understanding of the data happens (plotting pairs of variables against each other, doing residual diagnostics etc) . Which I think is your point and the point of the Fox and Hendler article.. GIS is an interesting case study. ARCGIS easily visualizes data with millions of polygons or pixels (I’ve never studied the technology – presumably its some sort of sampling) but open source tools (which is what I use for GIS by the way) don’t scale very well in visualization in my experience.

        I think my biggest thing is that whether you call it big data or ecoinformatics, this is something that people with ecological knowledge need to get into and be a part of – it should not just be a front for computer scientists to buy ever bigger hardware and ignore all the integration, data quality and analysis issues that really advance the science. I’ve seen too many projects where things go that way.

        There are a ton of interesting and important challenges in integration, data quality and analysis in the data ecology has now that it didn’t have 20 years ago (10,000 to 100,000,000 records) whatever you want to call that sized data. The old ANOVA statistics won’t work on that data (at least not easily).But the really scaled big data stuff (hadoop on large clusters) that you hear about isn’t really what we need in ecology either. We ultimately need to carve our own path (something which rOpenSci is doing a great job of by the way!).

  7. “Data on the births and deaths (and for mobile organisms, movements) of lots of individuals. Ideally along with relevant environmental data sampled at the spatial and temporal grains and extents relevant to those individuals. And it’d sure be nice to have this information for many generations, but of course there’s no way for technology to speed that up.”

    This we do have at least one example of: PTAGIS in the Columbia River basin. Over 12 million salmonids have been tagged and released into the Columbia River since 1987. The outward and inward migration and survival parameters of these fish can be tracked at mainstem interrogation sites as well as tributary recapture sites, here is a map:http://www.ptagis.org/sites/map-of-interrogation-sites. Considering a generation of Coho salmon is generally only three years (individuals of this and other Pacific salmon species are plastic in the amount of time they spend in freshwater and in the ocean), that covers several generations.

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s