Do we need a culture of Data Science in academia? (guest post)

Note from Jeremy: This is a guest post by ecologist and regular commenter Carl Boettiger. Data handling and analytical practices are changing fast in many fields of science. Think for instance of widespread uptake of open source software like R, and the data sharing rules now in place at many journals. Here Carl tries to cut through the hype (and the skepticism) about “Big Data” and related trends, laying out what he sees as the key issues and suggesting ways to address them. 

**********************************

On Tuesday the White House Office of Science and Technology Policy announced the creation of a $37.8 million dollar initiative to promote a “Data Science Culture” in academic institutions, funded by the Gordon and Betty Moore Foundation, Alfred P. Sloan Foundation, and hosted in centers at the universities UC Berkeley, University of Washington, and New York University. Sadly, these announcements give little description of just what such a center would do, beyond repeating the usual the hype of “Big Data.”

Fernando Perez, a research scientist at UC Berkeley closely involved with the process, paints a rather more provocative picture in his own perspective on what this initiative might mean by a “Data Science Culture.” Rather than motivating the need for such a Center merely by expressing terabytes in scientific notation, Perez focuses on something not mentioned in the press releases. In his view, the objective of such a center stems from the observation that:

the incentive mechanisms of academic research are at sharp odds with the rising need for highly collaborative interdisciplinary research, where computation and data are first-class citizens

His list of problems to be tackled by this Data Science Initiative includes some particularly catching references to issues that have raised themselves on Dynamic Ecology before:

  • people grab methods like shirts from a rack, to see if they work with the pants they are wearing that day
  • methodologists tend to only offer proof-of-concept, synthetic examples, staying largely shielded from real-world concerns

Well that’s a different tune than the usual big data hype[^1]. While it is easy to find anecdotes that support each of these charges, it is more difficult to assess just how rare or pervasive they really are. Though these are not new complaints among ecologists, the solutions (or at least antidotes) proposed in a Data Science Culture given a rather different emphasis. At first glance, the Data Science Culture sounds like the more familiar call for an interdisciplinary culture, emphasizing that the world would be a better place if only domain scientists learned more mathematics, statistics and computer science. It is not.

the problem, part 1: statistical machismo?

As to whether ecologists choose methods to match their pants, we have at least some data beyond anecdote. A survey earlier this year by Joppa et al. (2013) Science) has indeed shown that most ecologists select methods software guided primarily by concerns of fashion (in other words, whatever everybody else uses). The recent expansion of readily available statistical software has greatly increased the number of shirts on the rack. Titles in Ecology reflect the trend of rising complexity in ecological models, such as Living Dangerously with big fancy models and Are exercises like this a good use of anybody’s time?). Because software enables researchers to make use of methods without the statistical knowledge of how to implement them from the ground up, many echo the position so memorably articulated by Jim Clark that we “handing guns to children.” This belittling position usually leads to a call for improved education and training in mathematical and statistical underpinnings (see each of the 9 articles in another Ecology Forum on this topic), or the occasional wistful longing for a simpler time.

the solution, part 1: data publication?

What is most interesting to me in Perez’s perspective on the Data Science Institute in an emphasis on changing incentives more than changing educational practices. Perez characterizes the fundamental objective of the initiative as a cultural shift in which

“The creation of usable, robust computational tools, and the work of data acquisition and analysis must be treated as equal partners to methodological advances or domain-specific results”

While this does not tackle the problem of misuse or misinterpretation of statistical methodology head-on, I believe it is a rather thought-provoking approach to mitigate the consequences of mistakes or limiting assumptions. By atomizing the traditional publication into such component parts: data, text, and software implementation, it becomes easier to recognize each for it’s own contributions. A brilliantly executed experimental manipulation need not live or die on some minor flaw in a routine statistical analysis when the data is a product in its own right. Programmatic access to raw data and computational libraries of statistical tools could make it easy to repeat or alter the methods chosen by the original authors, allowing the consequences of these mistakes to be both understood and corrected. In the current system in which access to the raw data is rare, statistical mistakes can be difficult to detect and even harder to remedy. This in turn places a high premium on the selection of appropriate statistical methods, while putting little selective pressure on the details of the data management or implementation of those methods. Allowing the data to stand by itself places a higher premium on careful collection and annotation of data (e.g. the adoption of metadata standards). To the extent that misapplication of statistical and modeling approaches could place a substantial error rate on the literature (Economist, Ioannidis 2005), independent data publication might be an intriguing antidote.

the problem, part 2: junk software

As Perez is careful to point out, those implementing and publishing methods aren’t helping either. Unreliable, inextensible and opaque computational implementations act both as barriers to adoption and validation. Trouble with scientific software has been well recognized by the literature (e.g. Merali (2010), Nature, Inces et al. (2012), Nature), the news (Times Higher Education) and funding agencies (National Science Foundation). While it is difficult to assess the frequency of software bugs that may really alter the results (though see Inces et al.), designs that will make software challenging or impossible to maintain, scale to larger tasks or extend as methods evolve are more readily apparent. Cultural challenges around software run as deep as they do around data. When Mozilla’s Science Lab undertook a review of code associated with scientific publications, they took some criticism from other advocates of publishing code. I encountered this first hand in replies from authors, editors and reviewers on my own blog post suggesting we raise the bar on the review of methodological implementations. Despite disagreement about where that bar should be, I think we all felt the community could benefit from clearer guidance or consensus on how to review papers in which the software implementation plays an essential part and contribution.

the solution, part 2: software publication?

As in the case of data, educational practices are the route usually suggested to address better programming practices, and no doubt these are important. Once again though, it is interesting to think how a higher incentive on such research products might also improve their quality, or at least facilitate distilling the good from the bad from the ugly, more easily. Yet in this case, I think there is a potential downside as well.

Or not?

While widespread recognition of its importance will no doubt help bring us faster software, fewer bugs and more user-friendly interfaces, it may do more harm than good. Promotion of software as a product can lead to empire-building, for which ESRI’s ArcGIS might be a poster child. The scientific concepts become increasingly opaque, while training in a conceptually rich academic field gives way to more mindless training in the user interface of a single giant software tool. I believe that good scientific software should be modular — small code bases that can be easily understood, inter-operable, and perform a single task well (the Unix model). This lets us build more robust computational infrastructure tailored to the problem at hand, just as individual Lego bricks may be assembled and reassembled. Unfortunately, I do not see how recognition for software products would promote small modules over vast software platforms, or interoperability with other software instead of an exclusive walled garden.

So, change incentives how?

If this provides some argument as to why one might want to change incentives around data and software publication, I have said nothing to suggest how. After all, as ecologists we’re trained to reflect on the impact a policy would have, not advocate for what should be done about it. If the decision-makers agree about the effects of the given incentives, then choosing what to reward should be easier.

[^1]: Probably for reasons discussed recently on Dynamic Ecology about politicians and dirty laundry.

17 thoughts on “Do we need a culture of Data Science in academia? (guest post)

  1. I find it interesting and depressing that there is so much talk about data science and little to no talk about good versus mediocre science and data in ecology. It does not bode well for the future of ecology or what ecology can contribute to applied issues and conservation that young ecologists are more comfortable at a computer with other peoples data than in the field solving problems with hypothesis driven experiments. I think that this is largely driven by publication pressure rather than “progress”. What happens when the problem identifying and solving ecologists retire? Zen question of the day – How much bad data does it take to generate a novel, important and accurate synthesis paper in the field of ecology?

    • I’d like to think that here at Dynamic Ecology we talk a lot about good vs. mediocre science! Just to pick a random mix of recent and older posts off the top of my head:

      The one true route to good science is …

      The Importance of Diverse Approaches in Ecological Research

      Advice: good reasons for choosing a research project (plus some bad ones) (UPDATED)

      Advice: weak reasons for choosing a research project

      Zombie ideas in ecology

      Is using detection probabilities a case of statistical machismo?

      Can the phylogenetic community ecology bandwagon be stopped or steered? A case study of contrarian ecology

      Perhaps your depression comes in part from the fact that ecologists on Twitter, blogs, and other social media are a non-random sample of ecologists, and talk about data science more than a random sample of ecologists would? I suspect that if you compiled data systematically, you’d find that the large majority of papers in leading ecology journals report original data collected by the authors. In fact, if memory serves, Marc Cadotte at The EEB and Flow actually did this a year or two ago and that’s what he found. And I bet it would remain true if you restricted attention to papers by young authors.

    • Hi Mark, I’m never completely clear about why ‘other people’s data’ is somehow perjorative. And it seems that in the space of a sentence or two ‘other people’s data’ transforms to ‘bad data’. I have no doubt that I could go out and collect bad data and the fact that I collected it doesn’t do anything to make it better. Data will stand or fall based on its appropriateness for the question being asked and the appropriateness of a particular dataset for the question being asked will be inferred from the metadata. Best, Jeff Houlahan.

    • Mark – I’m very confused by this comment. Can we stipulate that there are examples of good important science involving meta-analysis (e.g. the Gurevitch et al and Goldberg & Barton 1992 reviews of plant competition experiments)? And that there are important results from big-data/big-science (e.g. the analyses of the population trajectories of bird species in North America coming from the Breeding Bird Survey – its hard to imagine how a truly continental assessment could have been reached any other way). And of course I think we can all agree that there are a handful of field experiments out there that have been “bad science” (poorly designed, unclear question, question inappropriate to the system?).

      If we can agree on all three above points, there is no black-and-white this method=bad-science rule. We are stuck with having discussions and exercising judgement to determine whether something is good or bad science – not just using a quick-and-dirty- this method used approach X so it is good (or bad) science.

      I’ve argued this point in much more detail in the “The one true route to good science” post Jeremy linked to.

  2. Hi Mark,
    As a young scientist who asks and answers ecological questions on my computer using other people’s data, I would suggest that not all ecological questions can be answered by going out into the field and collecting one’s own data. I don’t have 50 years in which to collect data that will answer the kinds of questions that I am interested in, and I am incredibly grateful that there is data available for me to use (such as the BBS data Brian references). And while my research is focused on theoretical questions regarding the dynamics and behavior of complex adaptive systems, I very much hope that it will be pertinent to applied ecology and conservation issues. Your comments seem to suggest that the two are mutually exclusive (computer-driven analyses and sound problem solving).

    I enjoyed Carl’s post very much (as I enjoy all the posts on this site, so props). Perhaps this is a function of my own naivete, but I associate ‘big data’ with projects which are asking questions that cannot be answered with data that is of short or small spacio-temporal scales. And statistical methods for these kinds of questions (big data questions) often don’t exist, or have not been thoroughly vetted in the statistical literature. I might be grabbing shirts off the rack, but it’s a very sparsely populated rack. While I am not exactly experienced in publishing, I have found it frustrating that space limitations in journals are such that I have not had room to discuss my data choices/manipulations or my methods choices in as much detail as I would like, as I recognize that I have had to make many choices and assumptions along the way and would prefer to be more transparent and detailed about them.

  3. I wonder if and when ecology (and all life science fields) journals will ever require a classically trained statistician to be a co-author on manuscripts? I am not of the mindset that would say ecologists do not need to be trained in statistics. However, I wonder if by requiring this collaboration, ecologists would be able to devote more time to other aspects of their research. Of course, many discoveries can be made in the process of analysis. Still, by enforcing the guidance of a true statistician, maybe more robust questions and discoveries would be found faster? Thanks for the nice post:)

    • Hmm, sounds quite impractical, doesn’t it? Not everybody even knows a statistician or works someplace that employs a statistician. And the statisticians also have lots of demands on their time and wouldn’t be under any obligation to collaborate with ecologists. Effectively what this requirement would amount to would be requiring every ecologist to pay for a statistical consultant. Plus, the truth is that, for all our talk of statistical machismo and etc., if you look through ecology journals most papers still have pretty simple stats, not at all the kind of thing on which you’d need a professional statistician to consult.

      • “…if you look through ecology journals most papers still have pretty simple stats, not at all the kind of thing on which you’d need a professional statistician to consult.”

        Well, that is changing rapidly. And given that most ecology curricula (aside from some good graduate programs) are behind on statistical training that doesn’t involve ANOVAs, a consultation is often necessary in lieu of additional learning. There is a huge difference between knowing how to follow a recipe and being able to adapt a recipe to your own objectives, resources, constraints, etc.

        Unfortunately this difference can become apparent in peer review (vis a vis statistical machismo) when well-intentioned reviewers advocate/demand the following of a recipe without truly understanding the reasoning. I suppose having a statistician take part in the review process would be one solution but, similar to having one mandated as a co-author, it’s hard to imagine how this could actually be implemented.

  4. Thanks for the reply Jeremy and Dan. I guess I have been in a university setting for so long and have always had access to a statistician that my opinion may be skewed. Just out of curiosity from a graduate student perspective, what type of ecology research positions do not have access to a statistician?

    re “Not everybody even knows a statistician…”. Maybe I am speaking from an extravert standpoint but I feel that if you have not made connections or the ability to make a connection with a statistician that is interested in ecological problems during your time as a grad student, a professor, or a research scientist, that fault would lie with the individual. I agree that time is always an issue, however, a problem that may take an ecologist who is unaware of the latest and greatest stats tools may be resolved by a 30 minute conversation with a statistician…. or maybe not hahaha. Thanks again.

    • Requiring people to consult with a statistician is just a non-starter, I think. It would be a radical change in peer review, and scientific practice more generally. You’re evaluated on the science you report, not on who you worked with to do it, and I think that’s as it should be.

  5. Coming in a bit late. A few thoughts:

    * Mark: in my (slightly extreme) opinion, knowledge of natural history, calculus, sophisticated statistics, and ‘data science’ are all tools that make people better ecologists, but I don’t actually consider ANY of them to be necessary to be a good ecologist. (I do think respect for these tools, especially the ones you don’t have, is important.) I would happily listen to a talk by a muddy-boots ecologist who knows no calculus, fancy statistics, or computer stuff, OR a theoretician who wouldn’t know an amphipod from from a shrimp, as long as they’re both creative, thoughtful scientists with an appreciation for the big scientific picture.

    * Kevin: I agree with Jeremy that there just aren’t enough statisticians to go around. My advice to ecologists who don’t like statistics (and quite reasonably don’t want to learn the “latest and greatest” statistical technology) is that they should stay away from parts of ecology where complex statistical analyses are more required (e.g., large-scale/complex observational data) and focus on problems that will allow a “strong inference” approach, i.e. systems where units can easily be independently randomized, replicated, and controlled to produce clear, simple experimental results (e.g. lab or small-scale field micro- and mesocosms). My only solution for the problem of macho reviewers who inappropriately demand latest-and-greatest statistics is to hope that the editors to whom they report will be knowledgeable and brave enough to overrule them.

    * blatant self-promotion: “other people’s data” reminds me of http://ms.mcmaster.ca/~bolker/bbpapers/Bolker2005.pdf (username=”bbpapers”, password=”research”)

    • Better late than never Ben! Always good to know wwbbd? 🙂

      I agree that no tool is necessary to be a good ecologists. Techniques aren’t powerful, scientists are.

      A bit of historical context: right now I’m reading The Silwood Story, about the cohort of top ecologists associated with Imperial College London’s Silwood Park campus from roughly the 1970’s-early oughts. Very interesting on many levels, including the discussion of how computers and theoreticians started to change ecology in a big way starting in the 1960’s. They were having much the same methodological debates back then as we are now. Except that the range of variation has shifted. For instance, pretty much all ecologists today are more statistically-sophisticated and computer-savvy than almost all ecologists back in the 1960’s or 70’s. It kind of reminds me of how in every generation of adults there are many who are appalled by the behavior of the current generation of teenagers, and so worried for the future of civilized society. Which suggests that civilized society has been doomed for as long as it has existed! 🙂

      Embarrassingly, I wasn’t aware of that 2005 piece of yours, Ben. It’s lovely, I’m going to do a post on it tomorrow just to bring it “above the fold” and encourage readers to check it out (many readers don’t read the comments).

    • Hi Ben,

      I completely agree with your replies to Kevin and Mark.

      I would be very curious to hear your thoughts on my comments in this piece about the advantages and dangers of software implementations, particularly if/as they come to be seen as important research products.

      I’m afraid I don’t agree with your lament about ‘other people’s data’, as I comment on Jeremy’s post on the topic https://dynamicecology.wordpress.com/2013/12/03/hoisted-from-the-comments-ben-bolker-on-other-peoples-data/comment-page-1/#comment-21133

  6. I really appreciate the push back Jeremy and Ben. It has been really good as a young ecologist to hear and feel out the opinion’s of those that are currently working in the field.

  7. Pingback: Hoisted from the comments: Ben Bolker on “other people’s data” | Dynamic Ecology

  8. Pingback: Stats vs. scouts, polls vs. pundits, and ecology vs. natural history | Dynamic Ecology

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.