Existing data, and easy-to-collect data, cannot answer many of our questions (UPDATED)

In response to my recent post attacking the phylogenetic community ecology bandwagon, a colleague who works in this broad area wrote to say that he really liked the post, but that he had a question:

But the real problem is a lack of alternative approaches – say you want to know about community assembly, and you have a phylogenetic tree, occurrence data, and maybe some traits. What do you do? We need more methods!

I suspect many readers have the same question. On the one hand, we have all these data, on the other hand, we’ve got this scientific question. How do we bring them together?

Maybe you don’t.

Look, I’m all for learning as much as possible from existing data, and from easy-to-collect data. But I’m against trying to learn more from those data than possible! After all, the reason we have increasing amounts of phylogenetic data, occurrence data, and certain sorts of trait data is because technological advances in gene sequencing, computer hardware and software, and online databases have made those particular sorts of data easy to collect and compile. Those technological advances have had little or no effect on the ease of collecting other sorts of data. Next-generation gene sequencers, multi-core processors, and R packages haven’t reduced the money, time, or physical effort required to conduct field experiments, for instance. So here’s the thing: why should we think we’ve been so lucky that technology just so happens to have advanced in such a way that the particular data we wanted has suddenly become easy to collect? Because that would be really lucky indeed! Isn’t it rather more likely that, now that certain sorts of data just so happen to have become easy to collect or compile, people are now casting about for ways to use those particular data to answer all sorts of questions that those data aren’t really right for addressing?

For instance, it wasn’t until after phylogenetic data became widely available that lots of people started arguing that phylogenetic data could be the key to unlocking the mysteries of community assembly. Which makes me very suspicious that this is a case of having a hammer and looking for a nail. Or having a hammer and trying to figure out how to use it to drive screws. Or having a hammer and then, in order to find some uses for that hammer, trying to come up with arguments as to why things that weren’t previously thought to be nails in fact are nails.*

Further, it’s not as if phylogenetic methods are the only game in town for learning about community assembly or contemporary coexistence mechanisms. As noted in the comments on my recent post, there are lots of other ways to learn about community assembly and contemporary coexistence mechanisms. And if those ways aren’t as easy as building phylogenetic trees or querying trait databases, well, nobody said ecology was easy.

Easy ways to address whatever question you’re interested in do not necessarily exist. When they don’t, it’s incumbent on you to recognize that fact, and do whatever it takes to address whatever question you’re trying to address. You couldn’t figure out the properties of the Higgs boson with existing data and instruments (though Fermilab workers tried to argue otherwise)–so the LHC was built, at great expense. You couldn’t really get at biodiversity-ecosystem function relationships without manipulative experiments–so John Lawton and colleagues went and got millions of dollars to build the Ecotron in order to conduct such experiments, Dave Tilman went and got the money needed to do the Biodiversity I and II experiments, the BIODEPTH team went and got money to conduct that experiment, and the rest is history. And don’t think I’m arguing that ecologists should all go get bazillion dollar grants. The wonderful NutNet experiment was born out of frustration with the limitations of existing data on primary productivity, diversity, and herbivory, and is paid for by a single ordinary NSF grant. I could keep adding examples here, but you get the idea.

I’m absolutely not trying to pick on phylogenetic community ecologists here. They’re just unlucky enough to be the example we happen to have been talking about on this blog lately. An occupational hazard of using any approach, or any source of data, is that you try to make too much of it. Asking “What else can I do with this approach, or with these data?” is a perfectly natural, and perfectly good, question.

Just so long as you recognize that the answer might be “Nothing.”

UPDATE: I see I’m not the only one worried that we’re letting the data that just so happen to be available dictate what questions we ask and how we answer them. Writing in Evolution, Travisano and Shaw (in press; open access) rip the use of genomic data to search for the genetic basis of phenotypic variation as non-explanatory. They argue for a renewed emphasis on process-oriented research: selection, drift, mutation, migration, nonrandom mating, along with the density-dependent ecological processes that underpin many of those evolutionary processes.

*This analogy is deliberately silly and overstated. Nobody who does phylogenetic community ecology would be so foolish as to actually hammer on a screw, or argue that a screw is really a nail. The analogy is merely intended to illustrate and clarify the kind of mistake that I think underpins attempts to overextend any approach. It’s not meant to imply that those who do overextend their preferred approach are foolish or bad scientists or whatever. They’re not. In science, the challenge–and it can be a difficult one–often is to figure out exactly what sort of “tool” you have, and whether it matches the sort of items (“nails”, “screws”, or whatever) you’re trying to “drive”.

6 thoughts on “Existing data, and easy-to-collect data, cannot answer many of our questions (UPDATED)

  1. NPR’s Fresh Air interviewed Nate Silver yesterday and he talked about how hard it is to identify the value in our data. We’ve been amassing piles of data, but the amount of knowledge has not increased proportionally because some of the data just isn’t useful.

    I do think there is something to be said for probing questions with existing information (low hanging fruits!) but I am increasingly agreeing with the idea that in many cases people haven’t answered X question because the right data have not been collected yet (as opposed to people haven’t answered X question because no one has been clever enough to figure out how to put the pieces together with what we’ve got).

  2. Amen. But I think this very much is a methods problem. We can and should develop meaningful methods which allow us to conclude “there is not enough information in the data to answer the question,” (which is natural in either Bayesian or Frequentist paradigms) and we should be wary of methods that are incapable of providing such answers (e.g. AIC by itself).

    Without such methods, collecting more data doesn’t address the problem. Obviously I’m biased here, having published such a paper regarding phylogenetic methods (doi:10.1111/j.1558-5646.2011.01574.x) and a different method regarding ecological warning signals (doi:10.1098/rsif.2012.0125). Now I’ve probably violated some terms of use by advertising😛

    • So, between this post and the other post today, I’m batting .500 in terms of getting you to agree with me. Or .500 in terms of annoying you. Either way, I’ll take that batting average.😉

      In seriousness, I don’t think it’s just, or even mainly, a statistical methods problem, though the approaches you suggest absolutely are helpful. There’s no set statistical method for addressing a scientific question like “What’s the relative importance of different classes of coexistence mechanism?” or “What factors are most important in controlling local community membership?” The problem with phylogenetic community ecology approaches is not lack of information in the phylogeny, at least not in any formal way that you could quantify statistically. On their own, phylogenies are indeed uninformative about contemporary coexistence mechanisms, but they’re uninformative in a way that’s hard to quantify. It has to do not with the amount or type of information phylogenies provide, but with the validity of the *background assumptions* one needs to make in order to go from the formal statistical results of a phylogenetic analysis to an interpretation in terms of contemporary coexistence mechanisms. The problem is with going from one “level of inquiry” to another, as Deborah Mayo would put it.

      Certainly, I would *love* to see ecologists do much more in terms of validating new methods with simulation studies, which is broadly what I take it you’re trying to get at here. But unfortunately the problem is that ecologists who do conduct such simulation studies often do them badly. They generate simulated data in a way that effectively *assumes* that the method they’re trying to validate *is* valid. Either that, or they “validate” the ability of the method to do something other than what we actually want it to do. For instance, if memory serves (and it may not) phylogenetic community ecology methods *were* “validated” (in, if memory serves, an award-winning Am Nat paper). But they were “validated” using simulated data that effectively assume that the Webb et al. 2002 world view is correct! Similarly, randomized null models for detecting patterns in species x site matrices have been “validated”–but only by testing whether they can detect specific non-random patterns in those matrices, not by testing whether they can detect *whatever patterns (or non-patterns!) are generated by the process of interspecific competition*.

      • I agree with your point 100%. I don’t believe in ‘validating’ methods and try to avoid that language.

        A method is only as good as the data. In Frequentist terms, we need to measure power more often, as a function of the data we are actually applying the method to (or the posteriors for the Bayesian)

        I think we hold onto the idea that a method can be “validated” by some theorist and then we can apply it to whatever data we have. Of course this couldn’t be more wrong. We need to check the power with the data we have, every time. This is what I’ve attempted to say in those papers, and to provide the tools to make it easy to do that.

  3. Good post Jeremy.

    I think this touches on a big issue that I have, which I’ve voiced here before–the idea of optimal strategies in science, or the lack thereof. That is, in the lack of a coordinated and strategic research effort involving many people and/or organizations, you will get individuals and groups just out to publish whatever they can, and if that means trying to eke out some papers in which data are used questionably (or questionable data used), then that’s exactly what will happen, instead of applying the collective human and material resources expended therein, to instead making sure that the right kind of data are collected and analyzed to address whatever question is being addressed. Short version: people tend to do what they feel they can get away with, in the absence of an over-arching plan of attack in which that they are part of.

    Having said that, there are, I think, certain types of data in which “we got what we got and we ain’t never gonna got nothin’ better”. By which I refer primarily to historical data I suppose. Such data were *almost always* collected for some purpose other than that to which we desire to apply them, and they were a one-time deal and we have to make the most of them. This necessarily means a lot of careful thought wrt analytical methods, and sometimes the development of some sophisticated methods to attempt to correct for various weaknesses in the data. No way around it. Given that such data represent the potential to understand long term system dynamics that we would not otherwise have (at least so directly or clearly), these efforts are warranted IMO.

  4. Pingback: The catch-22 of slide-22: Pedagogical troubles with conjugate priors « ecology & stats

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s