Citizen science for creating large, usable ecology datasets (guest post)

Note from Jeremy: This is a guest post from one our most frequent commenters, Margaret Kosmala. She’s a PhD student in Ecology, Evolution, and Behavior at the University of Minnesota, finishing this summer. Before that she did her undergraduate degree in computer science. I invited her to do a guest post on anything she wanted. She decided to do a post on citizen science in ecology, a topic about which she knows a lot (she runs a citizen science project in the Serengeti), and which I think will be of interest to many of you.

By the way, Margaret is currently looking for a postdoc, preferably quantitatively oriented. So if you know of any, she’d love to hear from you!

*************************************

“What would you like for your birthday?” my mother asked over the phone. I hesitated, imagining her eyes rolling five hundred miles away, then dived ahead anyway. “Well,” I said. “There’s this thing called American Gut. It’s a big science project that’s trying to understand all the microbes in people’s stomachs. They’re both raising money for their project and getting lots of samples by letting anyone participate. I think it would be fun to get a kit for my birthday.”

I find American Gut fascinating. It’s the latest take on citizen science data gathering – this time pairing the data gathering with fundraising.  It’s been extremely successful on the fundraising side, and it looks positioned to be successful on the data side, too.

Citizen science data collection has a long history in ecology and affiliated fields. The Audubon’s Christmas Bird Count is perhaps the most well-known. More recent projects like Project BudBurst, Nature’s Notebook, and eBird are, like American Gut, leveraging the Internet and mobile devices to make data recording fast and easy.

Automated data collection is likewise being facilitated by technology miniaturization and cost-reduction. But automated collecting on a large scale leads to one of the great challenges of Big Data – how do you make sense of it? One answer is to get all those citizen scientists involved, not in data collection this time, but in data analysis.

This past December, I embarked in a citizen science project called Snapshot Serengeti, along with several colleagues. To study carnivore dynamics in the Serengeti, Ali Swanson has set up 225 continuously running camera traps there, which take about one million pictures per year. That’s a lot of images. And, as she and I quickly discovered, images do not equal data.

Our solution to the ‘images ≠ data’ problem was to launch a website called Snapshot Serengeti, which we did in December in partnership with the Zooniverse, a citizen science portal. On the site, we ask volunteers to record the species of animals in each of our images. The response to Snapshot Serengeti was astounding. Within three weeks, over 15,000 people had worked their way through our first year and a half of images.

Of course, while citizen science has a long history in ecology, so too do citizen science skeptics.

The main criticism of citizen science data is that it’s not reliable, that its quality and integrity are unknown. And this is a valid concern. Experts believe themselves to be better at what they do than non-experts. As eloquently stated in the review of our most recent NSF pre-proposal:

The use of citizen-scientists to provide meaningful and accurate data will depend on their training. It’s unclear how quality of data generated from the citizen-identified animals will be ensured.

But we’ve found that in Snapshot Serengeti, our volunteers are individually very good at identifying African mammals – even though we don’t provide any formal training – and when we aggregate them, they’re even better. To ensure good data quality, we show each image to at least ten different people. (And more, if those ten don’t all agree.)

volunteer-agreement

Volunteer agreement (out of 10) for Snapshot Serengeti images

This pie chart shows all our images, divided up by how much agreement there was by the volunteers. If we imagine that our volunteers are voting on the correct species identification for each image, then the numbers on the pie chart indicate the tally for the highest-scoring species for each image. A ‘10’ means that ten out of ten people agreed on the species. A ‘9’ means that nine out of ten agreed. And so forth.

As you can see, a majority – 57% – of images are agreed on unanimously. For 87% of our images, we have at least 7 of 10 people agreeing. And we have majority agreement for 96% of the images. Remember that we’ve got over 45 species that we’re asking people to decide among, sometimes people just click on the wrong thing by accident (experts included), and (most of) our images are decidedly not National Geographic quality:

M2E78L230-229R370B333

A typical camera trap image (Thomson’s gazelle)

While agreement is nice, being right is better. I’ve spot-checked dozens of images with low agreement, and the vote-winner has always been the correct identification. To be more rigorous, I’m currently working on an analysis that compares volunteer identifications with several thousand expert identifications. That will be the real standard of whether the volunteer data is high quality, and I’m fairly confident it will be. There’s no reason that other large citizen science datasets can’t be similarly compared to expert data to derive a quantitative analysis of citizen science data quality.

Much of ecology fundamentally involves looking at things, and it turns out that there are a lot of non-scientists out there who want to help look at things for science. And while it might seem that such engagement would be limited to charismatic megafauna, a sister project called Seafloor Explorer is successfully using a similar approach for scallops, and two more projects for plankton and kelp are in the pipeline. (And astronomy buffs get excited just looking at line graphs…)

Just as with personalized medicine and 3D printing, we’re only just at the doorway to possibility in the citizen science arena. Combining automated data collection with citizen scientist analysis, as we’ve done with Snapshot Serengeti, is one way to affordably create large spatial datasets that can yield new ecological insights.

Likewise, asking people to contribute money to participate in a scientific study is a novel way to support building large and unique datasets on human health and diet. I just used my American Gut kit and mailed off my sample, excited to be a data point in someone else’s dataset.

23 thoughts on “Citizen science for creating large, usable ecology datasets (guest post)

  1. Great post! Definitely a coming wave.

    I wonder if you have thoughts (or know literature) on the different goals of citizen science. To my mind there is a tradeoff between outreach/get people involved/education at one end and getting high quality data at the other end. The high quality data end typically involves a top down imposed sampling design with a limited subset of amateur but highly knowledgeable people doing it. The North American Breeding BIrd Survey might be the ultimate on this front. At the other end are efforts like ebird which involve anybody reporting at any time. I’ve tried to analyze a regional dataset not unlike ebird and it was rather frustrating. But it was great at getting people engaged.

    Similarly any thoughts (or literature) on which problems/organisms are most amenable to citizen science? Clearly you wouldn’t have had as good a results if citizens were IDing pictures of say flies. I was talking with a group the other day and it seemed to emerge that situations where the data is in some sense captured and handed over to the scientists (e.g. camera images or pit fall trap contents) rather than just reported (e.g. ebird) are better because it lets the scientists go back and check doubtful data points, do repeatability analysis etc (as you’ve done). It also seems to me the phenology projects where they target a few well-known organisms and give careful definitions of the phenological transitions’ they’re recording is also well-fit.

    Any a few thoughts mixed with a few questions. As you say it is early days. Maybe 10 years from now you’ll write the definitive review paper on which projects are well suited to citizen science and which ones are poor fits! Great post.

    • “a tradeoff between outreach/get people involved/education at one end and getting high quality data at the other end”

      I want to argue that this may have been the case in the past, but I think that it’s no longer the case. The reason is that it’s relatively cheap and easy to teach people now, so that you can have many, many knowledgeable people on a project. However, sampling design is really key; having focused research questions and a good design lead to data that are easier to analyze. But this is the case whether you’re doing citizen science or “regular” science. (We spent a *lot* of time and effort on the design of Snapshot Serengeti.) I don’t think that people are necessarily more attracted to open-ended data gathering than data gathering that has specific rules and requirements, but it may be the case that people participate for different reasons. Based on surveys of volunteers, it seems that a driving reason for people to participate in Zooniverse projects (all of which are designed with research in mind) is “to contribute to science.” Contributors to eBird, for example, are likely motivated by other things.

      “which problems/organisms are most amenable to citizen science”

      For people interested in contributing to science, I think very clear and easy-to-grasp problems are the most amenable. With a very compelling mission statement, it doesn’t matter if your organisms are boring/invisible; American Gut is a great example. So too is Cell Slider (cellslider.net), which asks people to help with cancer research.

      If your questions are complex (say, “dynamics of predators and herbivores on a spatially and temporally heterogeneous landscape”), then it probably helps to have charismatic organisms. 🙂 But it may not be necessary.

      “you wouldn’t have had as good a results if citizens were IDing pictures of say flies”

      I’m not sure this statement is true; if you needed to sort flies based on morphology that could be seen in an image, I think you could still get plenty of people involved and get good data. Right now there’s a Zooniverse project called Notes from Nature (notesfromnature.org) that’s just asking people to transcribe museum specimen labels. That’s it: just type in what’s written on an image of a label. They launched last month and already have had ~140,000 transcriptions done. So I don’t think that “boring” projects can’t be done with citizen science, although in such cases publicity may be particularly important.

      One area where it seems citizen science doesn’t work super well is asking volunteers to provide information based on hearing. Zooniverse tried two projects — one on whales (whale.fm) and one on bats (batdetective.org) — that asked volunteers to listen to sounds and record information about those sounds. Those projects didn’t seem to engage and keep volunteers very well. It may be that listening is harder and/or slower than looking and that turns people off.

      “data is in some sense captured ”

      Yes! I agree with this. Birds, of course, make a difficult case, because they’re hard to capture. But in other cases, I think there’s a strong case to be made for having vouchers of citizen science data — photos works great in many (but not all) cases.

      • I think there are definitely some organizations that still struggle with the goals trade off, and it really depends on who is designing the project. For my masters, I worked with a nonprofit that had about five years of citizen science data just piling up without analysis. When I began sifting through the data, it became clear that the volunteers struggled with plant identification and providing accurate descriptions of their geographical location (very few citizen scientists were hiking with GPS units). This was a project that had been conceived by a research department, but most of the design & implementation had been overseen by the education department. When scientists create citizen science projects the goals (and especially the follow-through) can be very different from projects created by schools, conservation organizations, or other stakeholders.

      • We definitely see a bias towards rare species in our data. People really want to believe they’re seeing something rare and unusual.

  2. This is really awesome!

    I was recently at a disease conference, and several of the talks were about big data. Citizens were involved in the projects, but instead of collecting or analyzing the data, they volunteered their data (e.g., questionnaires about household size, flu events per household, etc.).

    During the discussion session, someone asked the speakers what we should do with a hypothetical 40 million dollar program from NIH focused on big data projects. How could we use that money to get more big datasets and to better utilize big data? I can’t remember all of the answers, but there were things like developing new analysis tools, and maybe providing benefits to companies (e.g., cell phone companies) for sharing existing big datasets. I can’t remember any specific discussion of citizen science, but I thought about it later, since most of the projects required citizens to volunteer their information.

    So, if NIH or NSF had a new program supporting big data projects, would you suggesting using some/all of that money on the citizen science component of big data? And how would you use it? Would you divide it up to support grants involving citizen science? Is there something national that we need to engage more citizens? More websites like Zooniverse? Funding for such websites so that people are compensated for their time, instead of just volunteering? Advertising big data projects to citizens? Training scientists to better develop citizen science projects?

    • Hmm, really good questions. So my impression is that Big Data research is currently being tackled mainly by those scientists who already have large hoses-worth of data streaming in already. Genomicists, for example. Telecommunication scientists. Meteorologists. What do they all have in common? They have technology that fairly automatically senses the things they want to learn about and converts that information into data.

      If ecologists want a slice of the Big Data pie, they’re going to have to figure out how to turn what are typically visual observations into data rapidly and reliably. (I was going to expand on this in my post, actually, but preferred to keep it short.) There are two ways of gathering data — by people or by machines — and there are two ways of processing that data from raw form into data — by people or by machines. Ideally, we’d have machines doing everything — say, taking pictures in the Serengeti and then turning a JPG of a zebra into the word “zebra” that then gets stored in a database. But computer algorithms aren’t good enough yet to do all the visual pattern matching we, as ecologists, might want to do. (And that’s a bit of an understatement.) So we still need people to process much ecological information.

      So if I were to recommend funding aims for Big Data as an ecologist, I would focus money on ways to (1) gather and (2) process ecological data using both (A) automated techniques and (B) large amounts of human power. Citizen science falls neatly into 1B and 2B.

      (It sounds like the NIH/disease researchers you described have a 1B problem. They need large numbers of people involved to get the data they need. Or else the data already exists and they just need access to it, which is another problem entirely.)

      I would invest in developing technologies that can automate data gathering (1A). For example, I’m excited about the use of large-scale camera trap networks, as well as video (e.g. the Woods Hole HabCam, which “flies” through the water and takes a running video of the seafloor. habcam.whoi.edu Or the up-and-coming unmanned aerial vehicles with video camera attached.)

      And I would invest in developing technologies that can automatically process data gathering (2A). For example, I don’t want to put taxonomists out of a job or anything, but I’m itching for high-throughput environmental gene sequencing. Then instead of sampling an insect community, say once or twice per summer and spending all winter figuring out what type of insects I’ve got, I could sample weekly and understand seasonal trends in insect communities better. One Zooniverse project did something clever: once citizen scientists had identified enough pictures (of outer space), they used those pictures to train computer algorithms to do just as well as the citizen scientists. Then they retired their citizen science project and now process their images on a computer!

      As for citizen science itself, we could never have put together Snapshot Serengeti without the professional design and development team at Zooniverse; so yes, permanent funding for a (relatively small) group of designers and developers to help scientists who want to produce big data sets to put together successful citizen science projects is key. There’s also a great opportunity to identify similar “types” of data and build plug-in platforms. For example, the code for Snapshot Serengeti is open source, and in theory, other camera trap researchers could use it to build their own citizen science projects; but in reality, ecologists don’t have the technology background to easily do that. However, a platform could be built where ecologists would just specify certain metadata and then upload camera trap images and quickly have a functioning citizen science website.

      I would also recommend some modest funding for research into citizen science best practices, including studying the best way to interact with volunteers to produce high-quality data.

      At this point, I’m not concerned with volunteer saturation. Zooniverse finds that when they launch a new project, their other projects see an *increase* in traffic and not a decrease. I think there are plenty more people out there to be engaged, and I think more people are continually being engaged. So I don’t think there’s necessarily a big need for funding to “get more people involved.” A lot of people are just genuinely curious about their world and interested in being involved in something bigger than themselves. If anything, I’d recommend increased funding for the development of Big Data visualization tools so that it’s easier for _volunteers_ to explore the data they’re generating themselves.

  3. Thank you so much for the post! I really enjoyed it, especially learning about the steps you take to ensure reliable data. I wonder if we could come up with a citizen science approach to getting our zooplankton samples counted. 😉

  4. Really interesting Jeremy, thanks ​!

    Given your interest, I think that you (and the other readers here) would be really interested in some recent research that I have come across that theorizes about crowds and such similar phenomena.​ ​

    It’s called “The Theory of Crowd Capital” and you can download it here if you’re interested: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2193115

    In my view it provides a powerful, yet simple model, getting to the heart of the matter. Enjoy!

  5. I’m a huge fan of Zooniverse and have tried my hand at several of their projects. Being that I’m a science fan, first and foremost, I don’t see that plankton would be a problem. I will definitely be checking it out.

  6. Citizen science is pretty cool! I used a version of it for my undergraduate thesis while studying the goldenrod gall fly system, and with the help of different undergraduate classes, we can continue gathering data for the next few decades (hopefully)! 🙂

  7. Thanks for the great post. As a master’s student working with and studying citizen science I see a lot of interesting points here. BioDiverse Perspectives has been discussing citizen science lately too: http://www.biodiverseperspectives.com/?s=citizen+science&submit=Search

    One of the cool things about Snapshot Serengeti is that it harnesses internet addicts’ short attention spans via quick rewards: I know I’ve dedicated long stretches of time clicking toward the next photo, always eager to see what it will be and hoping to discover something “rare” or a unique/funny behavior. That model is poised to go viral.

    I think its gaining increased acceptance within the scientific community. The tension between educational and scientific goals is definitely still present; programs that fall under the umbrella of citizen science seem to exist at both ends of the spectrum. There is probably room for both, the area for growth is more a matter of whether or not individual programs are intentional in design and methodology based on their goals. We are just at the cusp: the field is becoming more self-aware, new partnerships are being formed, and best management practices are being established to help programs meet both educational and scientific goals.

  8. Pingback: This isn’t the blog you’re looking for | Dynamic Ecology

  9. Pingback: INTECOL 2013 Wednesday Thursday Wrapup – #INT13 | Dynamic Ecology

  10. Pingback: How North American ecology faculty position search committees work | Dynamic Ecology

  11. Pingback: More Friday Links!: Women in science, identifying marine plankton, and more! | Dynamic Ecology

  12. Pingback: Friday links: animals jumping badly, the recency of publish-or-perish, mythologizing wolves, and more | Dynamic Ecology

  13. Pingback: How, and why, to take a grad student sabbatical (guest post) | Dynamic Ecology

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.