Note from Jeremy: This is a guest post from one our most frequent commenters, Margaret Kosmala. She’s a PhD student in Ecology, Evolution, and Behavior at the University of Minnesota, finishing this summer. Before that she did her undergraduate degree in computer science. I invited her to do a guest post on anything she wanted. She decided to do a post on citizen science in ecology, a topic about which she knows a lot (she runs a citizen science project in the Serengeti), and which I think will be of interest to many of you.
By the way, Margaret is currently looking for a postdoc, preferably quantitatively oriented. So if you know of any, she’d love to hear from you!
“What would you like for your birthday?” my mother asked over the phone. I hesitated, imagining her eyes rolling five hundred miles away, then dived ahead anyway. “Well,” I said. “There’s this thing called American Gut. It’s a big science project that’s trying to understand all the microbes in people’s stomachs. They’re both raising money for their project and getting lots of samples by letting anyone participate. I think it would be fun to get a kit for my birthday.”
I find American Gut fascinating. It’s the latest take on citizen science data gathering – this time pairing the data gathering with fundraising. It’s been extremely successful on the fundraising side, and it looks positioned to be successful on the data side, too.
Citizen science data collection has a long history in ecology and affiliated fields. The Audubon’s Christmas Bird Count is perhaps the most well-known. More recent projects like Project BudBurst, Nature’s Notebook, and eBird are, like American Gut, leveraging the Internet and mobile devices to make data recording fast and easy.
Automated data collection is likewise being facilitated by technology miniaturization and cost-reduction. But automated collecting on a large scale leads to one of the great challenges of Big Data – how do you make sense of it? One answer is to get all those citizen scientists involved, not in data collection this time, but in data analysis.
This past December, I embarked in a citizen science project called Snapshot Serengeti, along with several colleagues. To study carnivore dynamics in the Serengeti, Ali Swanson has set up 225 continuously running camera traps there, which take about one million pictures per year. That’s a lot of images. And, as she and I quickly discovered, images do not equal data.
Our solution to the ‘images ≠ data’ problem was to launch a website called Snapshot Serengeti, which we did in December in partnership with the Zooniverse, a citizen science portal. On the site, we ask volunteers to record the species of animals in each of our images. The response to Snapshot Serengeti was astounding. Within three weeks, over 15,000 people had worked their way through our first year and a half of images.
Of course, while citizen science has a long history in ecology, so too do citizen science skeptics.
The main criticism of citizen science data is that it’s not reliable, that its quality and integrity are unknown. And this is a valid concern. Experts believe themselves to be better at what they do than non-experts. As eloquently stated in the review of our most recent NSF pre-proposal:
The use of citizen-scientists to provide meaningful and accurate data will depend on their training. It’s unclear how quality of data generated from the citizen-identified animals will be ensured.
But we’ve found that in Snapshot Serengeti, our volunteers are individually very good at identifying African mammals – even though we don’t provide any formal training – and when we aggregate them, they’re even better. To ensure good data quality, we show each image to at least ten different people. (And more, if those ten don’t all agree.)
This pie chart shows all our images, divided up by how much agreement there was by the volunteers. If we imagine that our volunteers are voting on the correct species identification for each image, then the numbers on the pie chart indicate the tally for the highest-scoring species for each image. A ‘10’ means that ten out of ten people agreed on the species. A ‘9’ means that nine out of ten agreed. And so forth.
As you can see, a majority – 57% – of images are agreed on unanimously. For 87% of our images, we have at least 7 of 10 people agreeing. And we have majority agreement for 96% of the images. Remember that we’ve got over 45 species that we’re asking people to decide among, sometimes people just click on the wrong thing by accident (experts included), and (most of) our images are decidedly not National Geographic quality:
While agreement is nice, being right is better. I’ve spot-checked dozens of images with low agreement, and the vote-winner has always been the correct identification. To be more rigorous, I’m currently working on an analysis that compares volunteer identifications with several thousand expert identifications. That will be the real standard of whether the volunteer data is high quality, and I’m fairly confident it will be. There’s no reason that other large citizen science datasets can’t be similarly compared to expert data to derive a quantitative analysis of citizen science data quality.
Much of ecology fundamentally involves looking at things, and it turns out that there are a lot of non-scientists out there who want to help look at things for science. And while it might seem that such engagement would be limited to charismatic megafauna, a sister project called Seafloor Explorer is successfully using a similar approach for scallops, and two more projects for plankton and kelp are in the pipeline. (And astronomy buffs get excited just looking at line graphs…)
Just as with personalized medicine and 3D printing, we’re only just at the doorway to possibility in the citizen science arena. Combining automated data collection with citizen scientist analysis, as we’ve done with Snapshot Serengeti, is one way to affordably create large spatial datasets that can yield new ecological insights.
Likewise, asking people to contribute money to participate in a scientific study is a novel way to support building large and unique datasets on human health and diet. I just used my American Gut kit and mailed off my sample, excited to be a data point in someone else’s dataset.