About Brian McGill

I am a macroecologist at the University of Maine. I study how human-caused global change (especially global warming and land cover change) affect communities, biodiversity and our global ecology.

What it takes to do policy-relevant science

I just spent last week at SESYNC which some ecologists might still know better as “the new NCEAS”. It is however different from NCEAS in at least two ways. First the first “S” in SESYNC is “Socio” – not that NCEAS didn’t do plenty of applied ecology with a human dimension, but it is right up front and part of every single project at SESYNC. The second is that the focus is on what SESYNC calls “actionable” science. As some of you know, I have a joint appointment in the Sustainability Solutions Initiative (hereafter SSI) at the University of Maine which has a very similar orientation. Of if you want another hook, the NSF SEES granting program has a very similar orientation. Loyalty demands that I point out that University of Maine’s SSI was the first of these 3 programs!

I should warn you this is a long post. This represents 3+ years of active thinking. Feel free to skim or skip if this is not a topic of interest to you.

I don’t say you need to do the kind of science these three institutions/programs are aiming for. Nor do I say that all science should be of this kind. I still have a healthy component of basic research in my own work and think this needs to keep going. But many, many scientists, especially I find earlier career scientists, want to deliver solutions. And I think I could (if I weren’t lazy) build a case using data from NSF that funding for this kind of science is on the upswing. There is an increasing discomfort with the “loading dock” model of science (do the science, put it on the loading dock, and wait for the adoring users to back up their trucks and do all the work to load it up and use it)

The rest of this post reflect my musings, hard won experience, and education by my colleagues on what it takes to deliver policy-relevant (or actionable science or solutions) .A good part of what I am writing here was learned from my colleagues in SSI*.

So without further justification, let me present my version of the 7 P’s of policy-relevant science (yes I am a sucker for alliterative memory gimmicks). . Also just to be clear I am trained and experienced as an ecologist, so that is obviously my bias, but in this piece I am saying things I believe to apply to science generically (and broadly to include all empirical quantitative researchers and maybe even all academic researchers, certainly not just biophysical science, but remember I can only really claim expertise in ecology).

  1. Presentation – one common claim is that science would be used for policy decisions more if we only presented it better. This is the idea that scientists are terrible communicators. It is epitomized in Randy Olson’s movie Sizzle. I might even go so far as to say this has been the biggest effort by NSF to make work more useful – they have been conducting extensive media training under the title “Becoming the messenger” conducting workshops all over the country. Personally, I think this one is a big cop out  Are there some terribly poor science communicators? Yes. Would all scientists benefit from additional training in communication? Yes. Will the world change if all science were well communicated? NO! There are already plenty of great scientist communicators. There are also plenty of NGO organizations functioning on the boundary with professional communicators repackaging the work of scientists. Improving presentation is relatively cheap and requires little effort and change on the part of scientists – it would be nice if it were the solution. But its not.  You can put me down for this being 1% of the problem.
  2. People – scientists should talk to people who are going to be affected by or care about the problems they addressing. This more commonly goes under the label “stakeholder engagement” and has been a movement building since the social activism of the 1970s and become a US government mandated part of land management under Clinton in the 1990s under the label Community Based Management or CBM. I 100% think policy-relevant science requires engagement with stake-holders (lots of engagement in most cases). However, I don’t think there is much training for scientists in how to do this. And even more importantly, I don’t think there has been much thought in how this process should occur in a way that leaves the scientist doing research instead of just turning into another activist voice at the table. Many social scientists have written research arguing that scientists are just another voice at the table. And of course many politicians are also framing scientists in this just-another-voice box to pursue their own agendas. Thus, while I think stakeholder engagement is critical, I think scientists need to take some real ownership and leadership in figuring out what this looks like and in training our peers. Iin my experience most stakeholders (and most politicians who aren’t pushing an anti-science agenda and many social scientists like Cash) also think science should not just be another voice at the table, but they don’t yet have a well-formed idea of what science-stakeholder engagement should look like either.  In my opinion David Cash has written some of the most thoughtful work on this topic (e.g. this piece which coined the idea of “loading dock” research). So to repeat myself, scientists need to be thought leaders in what stakeholder-engaged research looks like. The next 5 P’s are some of my thoughts of what this should look like.
  3. Problem co-defined – this is probably the most radical but most important of the 7 P’s. The vast majority of scientific questions that get asked come from the scientists themselves. The proportion is a bit lower in natural resource departments  but even there a high proportion of the questions are scientist driven. If one steps back though, it is blindingly obvious that if one wants to deliver useful, policy-relevant science, one ought to ask potential stakeholder constituencies and policy makers what science might be useful to them! This is no different than a business finding out what their customers want. This is not to say that scientists are then obligated to do whatever the stakeholders ask. Often they ask impossible questions, questions more expensive than what they or society are willing to pay for, questions outside the expertise of the scientists talking to them, and yes, questions uninteresting to the scientists talking to them. Question definition is best done as a joint negotiation between scientists and stakeholders, and this has been called in the literature as “problem co-definition”. Despite the obviousness of the necessity for this approach, there is an equally obvious reason why it is rarely used – it involves loss of control for the scientists! it is frightening. It is a radical break from existing practice. The experience of SSI with over 20 separate projects is that every time you start a project thinking you know what the stakeholders need but then go ask them before starting, you are usually only somewhere between 0-50% right. Every single SSI project has changed their research questions in fundamental ways in response to stakeholder engagement. This approach fully embraced will radically change the kind of science that is done. That said, in SSI every scientist is still happy with the questions they co-defined and even happier with the fact they are sure the work will be useful. This might be radical, but it might not be as painful as people think at first glance. It might even be good for science for us to be pushed to pursue some questions we avoid because they’re hard!
  4. Place-based – One common difference between what the researchers want and the stakeholders want is that the stakeholders want much more specific research that is useful to them. In contrast, most scientists are trained to generalize, generalize. It can seem a come down to be asked “what is the population size of deer over the last 5 years in Orono, Maine, USA?” Asking what are the general drivers of deer populations is a more scientific seeming question. And again, natural resource department researchers are already much more used to this kind of research (for which biology/EEB departments sneer at their colleagues for not doing general research and for which the natural research department folks sneer back at their colleagues for building castles in the sky). This is a real issue and research really does fall on a spectrum from highly general to place-based (and organism based and time-specific). To some degree people wanting to do policy-relevant science might just need to sacrifice a bit and move past their comfort zone to do more place-based research. But I would like to argue this dichotomy is made out to be bigger than it really is. How often does a researcher aiming for general results really span more than a handful of sites in a small geographic area at a limited point in time? This is research that is place-based even if it might also be designed to be generalizable. Many scientists can do policy-relevant place-based research funded by policy making agencies (e.g. USDA, state wildlife departments) while still writing very general science papers from the data in addition to the detailed grey-literature place-based reports they deliver.
  5. Poly-disciplinary – OK – I made this word up to fit my P fetish. Interdisciplinary would be the more common word. Or the hip word would be transdisciplinary (so fully merged the individual disciplines aren’t recognizable). But I think most ecologists recognize that if you want to make the world more sustainability the biggest challenge is humanity. So you better study humanity. The NSF program Coupled-Natural Human (CNH) systems gets at this, although I think it is fair to say that not all policy-relevant science need use a CNH approach (often times the human system completely dominates the natural system and the idea of delicately balanced back-and-forth feedbacks isn’t too useful) nor do all CNH studies lead to policy-relevant science. Instead it is often important to understand the psychology of what motivates people to change, the economics and policy of creating appropriate incentives, human dynamics of population and consumption growth, technology change, etc.
  6. Post-research-engagement - There is some talk in the literature about co-definition of the problems (see #3 above) and some talk about improving mass-communication (See #1 above), but I think truly successful policy-relevant science requires more – namely stakeholder engagement during research to some degree, but especially after the research is done. As science transitions into policy, there is a very important period of interpreting research. Again scientists like to pretend we don’t do interpretation, but there is no denying it happens in a policy arena. Being actively engaged with stakeholders during this stage is critical. In part it is because stakeholders can help with the communication and help us strip out jargon and avoid obvious pitfalls (like communicating climate change in degrees Celsius instead of Fahrenheit to the American public). More subtle issues of presentation are often important too – for example there have been interesting studies of “boundary objects” – things that sit on the fence between science and policy like maps and what makes them successful or not. But things beyond communication like pulling out what is important to policy, the implications for policy etc also must occur. To my mind this aspect of stake-holder engagement in interpreting research after the research is done (or at least a round of research is done) is the most overlooked and ignored part of doing policy-relevant science. This is why I prefer the phrase co-production of science with its focus on engagement with stakeholders from start-to-end over the more common phrase co-definition of problems or the other common-phrase of stakeholder-engagement that doesn’t given an explicit relationship of stakeholders to research (opening the door to the scientists are just another voice at the table paradigm)
  7. Personal relationships – stakeholder-engagement has already been discussed above. But it is important to note that successful stake-holder engagement will depend directly on personal relationships built up over time and through informal contacts over meals and what not as well as formal meetings. In a boundary-spanning class I co-taught (more about this below) we brought in 20 people with experience in spanning boundaries between science and policy and literally all 20 people said that in the end personal relationships were the most important thing. This is pretty different from science with its efforts (although never fully successful) to focus on objective progress independent of personal links. But if a legislator is going to take a vote based on science, you can bet they want to know the person who did the science and trust them.

So to do policy relevant science, all you have to do is spend a lot of time  with stakeholders, let them have 50% of the job of deciding what research to do, study complex social behaviors in addition biophysical sciences, become more place-based, and open up the research interpretation process to other people and invest time in building relationships with them. No problem, knock it off with a few extra hours of work, right? Of course not! Policy-relevant science involves a fundamental change to the way science is done (again assuming policy relevance is your goal – not all science should have this goal). I have argued above that some aspects like the place-based research and the stakeholder co-defined research need not be as frightening as they seem. But in total, there is no denying doing policy-relevant science is a lot of work! And a big change for many scientists.

So for those scientists who are making this change, or for those educators and administrators trying to facilitate this change, how should the change happen? As always, I am ready with a silly mnemonic. The four T-s of transition.

  • Training – This should be obvious to professors, but in truth we have a blind spot – we have gotten so good at teaching ourselves things related to our discipline that we think we can teach ourselves anything. Or that science is the hard part, policy is easy to pick up by osmosis (this is obviously false). If we want scientists to do policy-relevant science we need to train them in skills related to the 7 P’s and this includes soft skills like facilitation  To my mind this can be best summarized as the task of boundary-spanning. A transformative event for me (and I think the 3 colleagues I co-taught the course with – David Hart, Laura Lindenfeld and Kathleen Bell) was teaching a 1 week intensive course in boundary spanning for graduate students. In addition to reading some of the literature and theory of boundary spanning (probably a whole other post in there), we brought in panels of people who successfully span boundaries and had them speak about what they saw as key success factors. But whatever you call it, there are a lot of skills that boundary-spanning scientists doing policy-relevant science need. Media training (P #1) is just the tip of the iceberg.
  • Teams – the biggest error in thinking is that one person can or should do all 7 P’s. It is a rather laughable idea actually when you really stop to think about it. This means that policy-relevant science almost always happens in teams. In these teams everybody needs to have the rudiments of the 7P’s but different individuals with different strengths and expertise can complement each other to actually fulfill the full range of the 7 P’s.
  • Tag teams- policy relevant science can also happen by one person doing one piece and making it available in a public manner and then somebody else picking it up and doing another piece, not unlike children’s tag-teams where when you tag somebody you go out and they step in to do something (or WWF wrestling if you prefer). People can do this without ever thinking of themselves as a team or ever even meeting. There is obviously a continuum from strong teams through weak coalitions to tag-team processes. And tag-teams are dangerously close to the loading dock model that was derided above and in the piece linked to above by Cash. Thus I expect the majority of successful policy-making will occur closer to the team than the tag end of the spectrum. But some good policy-relevant science happens through tag if people do it in a thoughtful way. And no doubt this approach can be used to channel more basic research in useful ways. As an extreme example you don’t need stakeholder engagement to know that cold-fusion or understanding the causes of extinction are important research topics. Just don’t use this as a cop-out to avoid teams.
  • Thinking – I’m talking about big changes. Scientists need to apply their academic approach of discussion, analysis and data to the 7 P’s and success factors in doing policy-relevant science. Social scientists have been the ones studying and writing about this problem most to date. Many of them are doing a good job. A few of them in my opinion don’t get biophysical research at all. But it is kind of ironic that if the ideas which emerge are things like teams and inter-disciplinarity that the topic of how to policy-relevant science is not team-based and inter-disciplinarity. A full understanding of how best to do policy-relevant science needs to include the biophysical scientist’s expertise and perspectives too. This has not been happening enough.

You can call everything I have talked about boundary spanning or co-production of science (as Cash that I cited above does). I think both of these phrases are useful. But whatever you call it, it is a lot of work and the world needs more of it. And as a scientist trying to learn this material, I found very little published material (journal articles or otherwise  to help me learn. That needs to change.

I am very curious to hear what others think. Again I don’t want this to turn into a debate about whether policy-relevant science should happen and scientists should be part of it. They should but not everybody should be forced to do so – basic science is important to. For those who would describe themselves as already engaged in policy relevant science, what did I get right or wrong? For those just starting out, was this helpful? what are your impressions? For those who have never done policy relevant science, does this scare you away or appeal to you?

* Including amongst many others David Hart, Tim Waring, Kathleen Bell, Mac Hunter, Aram Calhoun, Laura Lindenfeld and Shaleen Jain

A calm and balanced case for math in biology (UPDATED)

So as readers of this blog will be aware Jeremy Fox of Dynamic Ecology got major play on the social media for this recent piece on why EO Wilson’s folksy Wall Street Journal editorial on why/how you can be a great biologist without math is wrong. It broke pretty much every readership statistic this blog has.

As with all things internet, it appears the attention span has moved elsewhere. But I wanted to lay out my case for why math is an important skill for biologists in a calmer context, quite independent of and not in reaction to Wilson’s folksy rhetoric.

Before I do that, I want to make clear that my answer to whether empirical work or theory is more important is an emphatic “both”. Ecologists love to frame things in either or debates (density dependent or independent, competition or predation, …) and have wars, but both is a safe bet in most of these cases and it is here too. The theory vs field debate is just another flavor of this. As Terry noted this debate has been going on for a long time. Sharon Kingsland in her history of ecology uses this debate as her central organizing theme. (this debate is actually as old as science itself – see footnote*). But I just want to be clear I am not in an either/or frame and I will return to this at the end.

My central thesis here is that math is important in science (I choose the word “important” carefully – I don’t want to go so far as to say “necessary” but “useful” is rather weaker than what I want to claim). Although there are more reasons, I want to give two arguments for why math is important: modelling and variance.

Argument 1 – Math and modelling

So first let me give my view of what modelling is. Here is my definition of modelling:

An abstraction of the real world into a domain where logical inference (deduction) can be applied to transform explicit assumptions into new predictions

Which I demonstrate with the following figure:

model
First, I would like to make the case that modelling in this sense is central to the scientific method. I am not a big fan of the 4 step scientific method taught in grade school. But producing predictions and bringing them into reality and testing them is clearly part of that. I think this view of models also fits into more complex, realistic views of how science works as well. But as long as you buy that science has anything to do with predictions OR general principals, you either have to become a very pure sensu strictu empiricist (i.e. no thinking or processing involved – not the more inclusive branches of empiricism like pragmatism) or you have to bring this very sensu latu definition of modelling into having a role in science.

Note that this definition of modelling heavily emphasizes a domain of logic separate from reality but does NOT say what it is. And in fact there are many domains of logic including verbal, pictorial, mathematical, scaled (think of an architectural scale model, a map, or a scaled engineering prototype), etc.

OK, you say, enough philosophy, talk about math. Well I absolutely think modelling includes verbal models, and I would never say “you aren’t doing modelling” or “you aren’t doing science” if you have a verbal model. But I think ecology is replete with stories where verbal models are just a tad too fuzzy and get us into trouble. There are two main benefits to having a precise language in the right hand (abstract domain) portion of the figure. One is that it documents the assumptions (including how variables are measured) in a way about which there is not disagreement of what we’re talking about (which may inspire disagreement about what we should talk about). The second is that it allows us to apply rules of logic to deduce new statements (aka predictions or hypotheses). Something with more precision advances science.

One example is Elton’s hypothesis that increased diversity increases stability of a system. A nice intuitive idea argued verbally by Elton. Except then May gave a counter example. A decade of confusion ensued and eventually we ended up with Pimm’s paper pointing out that diversity and stability are both vague concepts (we have this habit in human languages of overloading a word with many meanings) that need precise quantitative definitions before you can even empirically measure and test the idea (an example of having precisely stated assumptions even if only to let us start debating what assumptions are right). Since that paper, the world has largely focused on CV (coefficient of variation) in total community biomass, and a productive conversation has achieved real progress (with more work to go).  May’s model also is an example of the second benefit of moving into the abstract world – using logic to deduce implications. May took a very specific definition of stability (based on a steady equilibrium in a quadratic differential equation system using eigenvalues) and showed that this stability depended not just on diversity (i.e. just number of species) but number of interactions (something that had not until then been conceived from empirical observation) and thereby broadened the debate to complexity instead of just diversity. A clear example of a precise logical deduction that was useful. (UPDATE: Jeremy has an old post reviewing the many different things ecologists have meant by “stability”, which emphasizes the importance of mathematically-precise definitions, independent of their empirical tractability)

Jeremy’s favorite zombie idea (the intermediate disturbance hypothesis IDH) is another great example. The IDH theory was based on verbal intuition about competition not running to completion (assumed to be competitive exclusion) with frequent enough disturbance. However rigorous mathematical models show that “running to completion” is not really the important point for competitive exclusion but whether or not one species gains relative to another competitively over time is more important. Again, people could and did have these ideas verbal-intuitively, but the math just made things more clear.

So to recap – modelling is critical to science, and a precise language of abstraction is critical to modelling. You can see where this is heading … mathematics is a heck of a precise language of abstraction. Its not the only one, but it is a heck of a good one. And here I define math rather broadly to include pictures, simulations etc.

Argument 2 – variance

Let me completely leave modelling for a minute and give another argument for math – variance.

Ecological data is full of variance. Indeed variability makes ecology exciting. But it also makes ecology challenging. The human mind is not well adapted to reasoning about variability. You can cite Kahneman on this or our the fact that almost as many people died in car accidents because of an irrational avoidance of flying after 9/11 as died in the 9/11 terrorist attack. This is a well-documented phenomenon in risk science. Personally, I like to cite the birthday problem. It asks how many people you need to have in a room to have at least a 50% chance that two of them have the same birthday (day of year, not necessarily same year). I regularly pose this to my statistics class, and they are way off, as indeed is almost everybody who hasn’t heard this problem before. What is your guess? The correct answer is below**. Even when I show my students the probability argument they are disbelieving. Often they want to put it to an empirical test, so, even though my class is always a bit below the 50% chance, and its only a 50% chance, I’ve tried it and actually three times in a row it has worked – somebody in the room has shared birthdays! Bottom line point – people are terrible at reasoning about variability and probability.

More relevant to ecology, is the data in the picture below inconsistent with the idea that the x variable is causing the y-variable?

noisyline

It is noisy. But it is definitely not inconsistent with the x causes y hypothesis (indeed, this was my first and only attempt at generating random, noisy data assuming the hypothesis was true). Of course this data is also not inconsistent with a null hypothesis of a flat line, so it is not statistically significant (I didn’t ask that question because most of us have asked that question often enough to have trained our intuition). But I wager you had no real thought process to answer the question of whether the data was inconsistent with the hypothesis.

So what is one to do in a field that is full of variability and stochasticity when the human mind is terrible at reasoning about probability? It will not surprise you that my answer is … duh, duh, duh … mathematics. This would be why the one required math course in many graduate ecology programs is statistics. And every graduate student I know wants to learn more statistics.

Implications for individual scientists

So, I have argued that mathematics is important for science in part because of its utility in modelling and its ability to counteract our poor reasoning skills around variance. I have not anywhere argued that it is not science without mathematics, that the essence of science is mathematical description or any such. I’ve just argued that mathematics is really, really useful to science to the point of being more than just useful but important to science. What does this mean for the individual and for the social structure of science?

emp_math
It seems rather obvious that one can use a Venn diagram to locate work – namely pure empirical, pure mathematical and an intersection or blend (you can use a continuum along an arrow if you want to be less binary). The first point I want to make is that this diagram is scale dependent. One paper that is purely mathematical but then followed by a test is blended at the scale of two papers (or perhaps one career).

But leaving aside the issue of scale for a minute, first I want to note that there is a lot of work outside the intersection, i.e. in just the empirical or just the mathematical. There are whole journals full of purely mathematical work that are just cloaking themselves in biology. I recall interviewing for graduate school with a famous evolutionist who was bragging to me about how they had solved an equation that even people in the math department thought was hard – but he’d never mentioned why it was biological interesting. I didn’t go to graduate school there. Although I am in the minority given the impact it had, I would argue May’s aforementioned diversity-stability work is also in the purely mathematical box – he had no data and his definition of stability was mathematically convenient but biologically unrealistic. I am asked to review papers like this all the time. Lest I be perceived as the outsider criticizing, I won’t take the pure empirical to task, but it definitively exists. So a choice of where to do work in this Venn diagram does exist.

Given you have a choice,  I want to make the point that by far the most influential bodies of work are in the center of the Venn diagram (check out the list of ESA’s Mercer award winning papers). And the most influential ecologists were in the middle (check out the list of ecologists honored in the National Academy of Sciences). There are plenty who lean more empirically (Gene Likens) or theoretical (Simon Levin) but all have managed to avoid the extremes. And more particularly, all have had collaborations to move to the center. Or they publicized and translated their work well enough that the other side picked up their work from the literature. So in other words, they found ways that at larger scales their work was in the center. And I’m pretty sure they didn’t have successful collaborations by telling their partners that the partner’s half of the story was trivial and you could replace them easily! And a surprising number of members of the National Academy are in that rare but lucky category that are able to move fluently between theory and field by themselves.

So this is my bottom line. Both horizontal arrows in my first diagram involving linking math and the real world. The use of variability is also about linking math and data. Good science is about the fusion of the two! It is not about one. It is not about the other. It is about both! So this choose one/which is better debate is entirely off target.

What I conclude is that if you are innately more mathematical, then if you want to do great science,  you will spend your whole career finding collaborations, graduate students/postdocs, inspiring papers and self-educating to add-in the real world component so as to move to the center. If you are on the innately more empirical side, then, if you want to do great science, you will spend your whole career finding collaborations, graduate students/postdocs, inspiring papers, and self-educating to add-in math so as to move to the center. To say that you have to be great at math to be a great scientist is wrong just as it is wrong to say you have to be great at field work to be a great scientist. But anybody who tells you that you can hang out at one or extreme and ignore the other side or trivially fill in the other side is doing you a disservice (or giving you a formula for doing less than great science). If you are driven to do great science, you will likely spend your whole life working (and I do mean working) to get to the center***.


*It is worth nothing this debate is as old as science itself. The Greeks favored a purely deductive (theory) approach based on pure logic rather than the dirty real world. Many don’t realize it but Euclid’s geometry based on postulates and proofs was seen as the best approach for all science, not just math. Plato’s cave visually captured the Greek world view that the real world was just an ugly distortion of the underlying perfect beauty (I mean that quite literally) accessible only through the logic of the mind. Then of course in the early 1600s people started waking up and realizing this was just a formula for experts to claim they were right. Bacon published Novum Organum arguing for empirical tests as the core of science and the enlightment was begun. If you think this is oversimplified (it is) go look at Newton’s 1682 Principia - it is completely framed as a Euclidean postulate proof deductive (pure theory) work even though if you read a biography it is clear he was highly empirical in his approach (e.g. his pursuit of the inverse square distance law to explain Kepler’s empirically derived elliptical orbits). But since that time empiricism has had a clear ascendancy. No Nobel prize has been awarded for purely theoretical work (even Einstein’s great theory of general relativity was never recognized – his much more empirical work on the photoelectric effect is what he won a Nobel prize for). Big data is only exacerbating this. The days of purely theory being acceptable in mainstream science are gone, and they should be. So ecologists are not unique in this debate – its really the central scientific methodological debate.

** 23 or more people in a room mean that you have a >50% of some two people sharing some birthday.

*** you might ask what about those individuals who can naturally do both well. First, in my experience they’re rare. Second, the appropriate center shifts from problem to problem. So even somebody like myself who is fairly mathematical needs to go to more mathematical people on occasion (and vice versa). It is always a constant tuning problem.

Surviving your comprehensive exams

Quite likely the comprehensive exam (aka qualifying exam) is the most feared moment in an academic’s life. It is on my mind right now because it is comprehensive exam season. I am sitting on two comprehensive exams (hereafter comps) this week, advised a 3rd student how to prepare for comps in June and did several more of both activities the last few weeks. I thought I’d share a few thoughts on surviving. I expect a majority of readers have passed their own, but this post should: a) be helpful to those who haven’t, b) provide a place for survivors to share their advice in one place on the web, and c) help those of us who have students and have to help them navigate it.

I’ve served on 3 faculties in the US and Canada, so I think I have a pretty good idea of the range of variation in North America, but to be honest, I have no idea of how things differ in other continents, so keep that caveat in mind. Comps usually have 3 parts (always in the same order):

  1. Presenting your dissertation research proposal to your committee, answering questions about it, defending and revising until your committee is happy. Some places this is a formal part of the comps and possibly public, some places it is more informal and just your committee. Many places you are supposed to write your research proposal like an NSF grant (15 pages) other places they say 10 pages, but students often go way over these limits. I don’t actually recommend exceeding the limits (but I confess mine was 32 pages).
  2. Written exam. Normally 3-5 of your committee members will come up with a question that you have to answer in writing (on a wordprocessor, not handwritten). Two places I’ve been each question was given 3-7 days, was open book and you were expected to do literature searches, find new papers and synthesize them (thus spanning 3-4 weeks). My current place you have 4-6 hours on each question (3-5 consecutive days), often closed book, and more of a “what do you know” question. There are pros and cons either way. Mostly I am partial to the longer questions since it really tests the ability to read literature and synthesize. On the other hand, students who are facing closed book knowledge questions have been in my experience more motivated to study and often transition better to the oral exam
  3. Oral exam – this is the part that really scares everybody. Just you facing 5 professors asking you questions. Most places I’ve been this is required to last at least 2 hours and no more than 3 hours. Some places have external examiners (outside your committee and department) to keep your committee from being too soft, some don’t.

Usually the written questions will center around topics related to your thesis. Technically in a qualifying exam the oral questions are also centered around your thesis, but in a comprehensive exam questions across your whole field (i.e. all of ecology and evolution) are fair game, hence the name. In practice most are strongly centered around your thesis topic with a few basic broader knowledge questions (if you’re an ecologists know at least the basics of evolution; also know the names and contributions of the 10-20 or so most famous people historically in your field).

There is no way around it – comps are one of the most fear inducing experiences you will ever have. I get an adrenaline rush but enjoy job talks and interviews, I was cool as a cucumber on my wedding day, and I was well prepared and fully expecting to do well on my comps. But it was still numbing. I found myself walking into walls culminating in slicing my finger while cooking and going to the emergency room. After this, I started keeping track and large numbers of people I knew had some major incident in the months leading up to comps (driving over their backpack containing their laptop, minor car accident, etc). It is truly a distracted, even-out of body time.

I don’t say this to scare you if you are in the miraculous 1% that isn’t stressed. But just to normalize it for the vast majority who may find this the most stressful thing in your life.So does everybody else. A great deal of this is self-induced. Take a type-A personality who has gotten good grades all their life and tell them they’re about to take an exam that could flunk them out of their life dream, and well, we put pressure on ourselves. This is not productive. And not even really rational. Your committee has already sunk time in you – they want you to succeed! Pretty much everywhere will give you a 2nd chance if you fail. And ask around to see how many people in your department even failed once (varies widely but usually in the 3%-10% range) and failed twice (i.e. flunked out – usually it is down in the 1%-3% range), and most of these people you could have predicted in advance by grades, prior negative feedback from the adviser, etc. Its not particularly likely you’re going to fail!

There is a piece of this fear that many call hazing. And I won’t deny that there is some piece of this in some departments and some individuals. But for every professor who acts this way, there are three who do everything they can to quell it. We all had to pass comps too! And again its a waste of our time to fail somebody. Less appreciated is that comps are not just a way of “keeping up the standards” but are actually designed for the benefit of the student, believe it or not! I say this for two reasons:

  1. Comps are a chance where you can force a student to learn something. I can’t tell you how many times I’ve had a student where I told them you really should read such and such paper 3 times and they never do. Then comps come around and I tell them to read the paper, and they do. And they thank me for it afterwards. Students who really need to know genetics but stay away, can be forced to learn genetics. And etc. Some part of comps will be forcing you to fill in the holes you’ve avoided filling in.
  2. You will face these situations in the future. Job talks. Postdoc interview talks. Presenting reports if you work in the government, possibly to hostile audiences. Giving testimony to legislatures or even in courts.

The last is really the main point of comps, and indeed the standard by which comps are judged. If you present as somebody who will go on and do a credible job of sounding knowledgeable and defending your ideas in your defense and job talks, you will pass your comps. This is the real goal.

A brief word on format. Since the 2-2.5 hour goal is nearly universal, this means the schedule looks similar most places. The professors will set up a rotation, usually the furthest from your research first, your adviser last. The first professor is told they have about 20 minutes to ask you questions. When they finish you go to the next professor and on through all four or five people. Many but not all places then take a 5 minute break where you leave the room and they confer. Then usually the professors go around as second time for 5-10 minutes each. Five to eight questions would be typical for one professor in the first round and one to three in the second round.

So on to advice I give students (this is all about orals – writtens vary more in expectations from place to place and are not what most people want advice about anyway):

  1. Just do it – students always want to postpone their comps. I don’t let them. I insist on 4th or 5th semester. Once you’ve got the classes and reading under your belt, you have nothing to gain (indeed possibly more time to forget) by waiting. Comps hang over everything and make you less efficient in your research. GET THEM OUT OF THE WAY!
  2. Prepare. This may sound blindingly obvious. But increasingly I am seeing students who haven’t studied adequately. You should study 10-20 hours/week for 1-3 months. And this is after your two years of course work and independent reading beyond coursework (its rather late to start cramming in learning completely new things). I produced a 20 page cheat sheet of everything I thought it was important for me to know. I still refer to it today. If you’ve gotten this far you know how to study – just be sure you do it.
  3. Do the standard tricks. Namely ask each committee member for their recommended reading list. And hold a mock oral comp (and maybe a mock research proposal defense/discussion). Mock comps work best when they are like the real thing. Not 15 grad students shouting out questions. This honestly is a waste of time. Instead, get 5 students who have already passed and are close to graduating, tell each one which of your committee members they are representing (ideally their adviser). And emulate the format (i.e. 2 hours, 20 minutes each person, etc). You’d be surprised how many professors have a question they ask in every comp. Advanced graduate students know what these are and they know what style each professor has. If you let them have fun pretending to be their advisers grilling you, you will have fewer surprises.
  4. You will not know the answer to some questions. That’s OK. Although you are at this point more of an expert than you think, there are still five of us. And it is our job to find out the boundaries of your knowledge, which we can’t do if we don’t ask you a few things beyond your boundaries. I had questions that I didn’t know the answer to. Your adviser did. Everyone of your committee members did. You will too. The real key here is how you handle that. Namely don’t panic yourself into a death spiral just because you didn’t know something. It is totally normal. A majority of failed comps I’ve seen occur when a student gets a question they don’t know early on and they lose confidence and start spiraling down. If you don’t know answers to half the questions and you’re two professors in, then panic. Otherwise, hang in there!
  5. It’s OK to say “I don’t know”. This is a corollary to #4. Part of the comp is making sure you know what you know and what you don’t. Sometimes saying “I don’t know” will be the end (usually for specific factual questions). Other times it will lead to a response of “OK, let’s see if we can think this through” followed by some leading questions to help you. Either way, you’re better off than umming and ahing and making up answers.
  6. Play to your strengths. Spend a little more time on questions you know a lot about. From the schedule described above, you can tell that most questions should have answers in the 30 seconds to 2-4 minutes range. A five minute uninterrupted monologue is getting too long for most questions and ten is way too long. Sounding meandering and unable to know when you’ve answered the question is bad. But be sure that if you get a question you know the answer to really well, don’t give a one word answer! Draw some connections. Expand! Comps may seem eternally long, but they are finite and you have some choice what you fill those two hours with. I am NOT saying you should watch your watch during your oral – your focus should be answering the question. But you might want to ask somebody to time you a bit if you do a mock comp. Some questions will be interactive. A professor will ask you to do a task (e.g. draw a particular diagram on the board), and then another (show how it changes in scenario x), and then so on that build on each other. If you sense you’re in this scenario (being asked to go to the board is a good clue) don’t drag things out because the real question is five steps in and the professor wants to make sure they have time to get there. In all I’ve given kind of mixed messages here – but be aware how long your answers are and think how well that is serving you and whether they are frustrating the person asking you.
  7. Its about attitude not knowledge. Not totally – you definitely still need to study. But attitude is a big part of deciding the outcome. Confident is what you are aiming for. Arrogant is a rare problem (and it is usually a nervous tick or lack of preparation when it happens). Timid and unconfident is a much more common problem. As I’ve said, people want to see you comfortable putting your knowledge out there and standing up for your opinions. Don’t keep checking your questioner’s face to see if you’re getting it right or not. If you’re answering (and not going for #5 above), sound like you know you’re right. It is usually blindingly obvious to your committee when you’re losing your nerve and when you’re digging in and plowing ahead even if you’re on the ropes for a moment. You will have moments of both, but try to have more of the latter.
  8. Get some rest – as I hope I’ve emphasized, there is a significant psychological component to this. Cramming until 2:00AM the night before is the wrong thing. Get your exercise, take care of yourself, go out to a movie the night before and get a good nights sleep. If possible schedule the comp in the morning or afternoon depending on when your body rhythms have you most awake (although increasing scheduling the comp with 5 professors is such a challenge you may have to let this one go).
  9. Have fun! – This might sound impossible. But the comps that go the best are ones where people go in thinking that they’re looking forward to having an intellectual discourse and treat it as a bit of a game. And two of the last three comps I sat in on the student actually did say it was fun afterwards. And not coincidentally, they both did great.

So if you haven’t yet had your comps yet, they’re not as bad as you think, have some fun, and good luck! If you have passed (or have been advising your students how to pass for 30 years), what advice would you add?

Ecologists need to do a better job of prediction – Part IV – quantifying prediction quality

I have been working on a series of posts on why ecologists need to take prediction more seriously as part of their mandate as scientists. In Part 1- I argued that an ANOVA/p-value mentality is killing us. In Part 2 – I argued that the rigorous discipline of putting out quantitative predictions and then checking to see if they are right is good for a discipline (with weather prediction as an example). In Part 3 – I argued that a pure reductionist view of where to look for mechanistic models to produce predictions was flawed. Throughout, I argued (and many commenters agreed) prediction is not just for the applied questions. Prediction, done right, brings a level of rigorous honesty that helps basic science advance more quickly too.

This is my last post on the topic (you can breathe a sigh of relief), and wasn’t originally planned. But so many commenters, especially on the first post, wanted more about the statistics of measuring predictions. In Part 1 – I made it pretty clear I thought the p-value was overrated and effect size and r2 needed more attention. But of course it is more complicated than that. Questions were raised about AIC and a bunch of other metrics (and no questions were raised about some of the metrics that I personally think are most important). So the following is a tutorial on measuring the quality of a prediction in a quantitative fashion.

Goodness of fit – continuous variables

Let’s take the simplest model y=f(x,θ)+ε. Let y (but not necessarily x) be a continuous (aka metric) variable like temperature or mass or distance, or to an approximation abundance if the abundances are large. If we define the model f and the parameters θ, then if we have a new value x (e.g. environmental conditions), then we can predict the corresponding new value \hat{y} (pronounced y-hat) and compare it to the observed value y. Of course in a perfect prediction \hat{y} would always equal y (i.e. error ε=0), but ecology is a long ways from that Nirvana! So we want measure the error/deviance/inaccuracy/failure of prediction. How do we do it?
One of the world’s alltime geniuses, Carl Friedrich Gauss, came up with the first insight. If we define \epsilon_{i}=\hat{y_i}-y_i to be the prediction error for one data point xi then we can look at sum-squared-error or SSE=\sum\epsilon_i^2. Minimizing this SSE was Gauss’ criteria for picking the best fit line, leading directly to modern Ordinary Least Square (OLS) regression. However, there is a problem with SSE as a measure of goodness of fit – it is completely uninterpretable. Is SSE=40502 a good fit or a bad fit? The answer totally depends on the number of data points and the innate variance in the data (after all if two widely different y values are observed for the same x, no model can fit that that). To solve this problem the idea of error partitioning was developed. If we define total-sum-of-squares (i.e. a proxy for total variance) as SST=\sum (y_i-\bar{y})^2 where \bar{y} is the average value of y, then we have R^2=1-\frac{SSE}{SST}.

R2 is extremely useful. In particular it ranges between 0 and 1, making it easy to compare success of models across different amounts of data, different data, and even models of different things. The last point bears expanding - R2  even lets us say things like models of predicting metabolic rate by body size (R2=0.9 or so) are much better than models of abundance vs temperature (R2=0.2 or so). There are limits to this – and I certainly wouldn’t make a big deal of small differences, but there is no other metric that is even credible for this kind of cross-disciplinary comparison. R2  also is independent of units – I get the same R2  value whether my y variable is measured in grams or kg. If all I’ve accomplished is gotten more ecologists to pay attention to and report R2, I would be very happy!

But there are of course complications! The first is it was noticed that in the simple linear model (y=ax+b+ε), R2=r2 where r is the Pearson correlation coefficient (which itself ranges from -1 to 1). That is kind of cool! But it now means we have two definitions and ways of calculating R2. And outside of the simple linear model (e.g. nonlinear models), they do NOT give the same answer. So at a minimum, you have to specify how you calculate it, which is often done by using R2 to report the error partitioning approach and r2 to report the correlation approach. If you want to look good, definitely use the r2 version. It will always be ≥0. If your model is worse than the null model (horizontal line of y=c), then R2 can be less than zero (and yes I have produced models with this outcome!). Even worse, the r2 approach can sweep systematic bias under the rug. Imagine a plot of predicted vs actual values (x-axis is predicted value \hat{y}, y-axis is observed value for y). In a good model the data should all be close to the 1-1 line (passing through the point 0,0 and slope of 45 degrees). R2 directly measures this. But now imagine you are constantly over predicting (see figure below). You correctly track that when x goes up a certain amount, y goes up a certain amount, but every prediction is greater than the observed values. This would cause a nice-tight line to form in the plot that is above the 1-1 line. This could have an r2 of 1 even if no point is on the 1-1 line! For these reasons, serious test of prediction should use R2 rather than r2 (with the added benefit that R2 generalizes to n-dimensional x-variables).

Assessing goodness of fit in a plot of predicted vs. observed (with 1-1 line in red). Blue dots are unbiased. Here R2=r2=0.9842. Green dots are biased (consistently over estimate), but otherwise identical. Here r2=0.9842 as before, but R2=0.830.

Assessing goodness of fit in a plot of predicted vs. observed (with 1-1 line in red). Blue dots are unbiased. Here R2=r2=0.9842. Green dots are biased (consistently over estimate), but otherwise identical. Here r2=0.9842 as before, but R2=0.830. RMSE for blue dots =1.28 but RMSE for green dots=4.20.

So R2 does a lot of nice things, including the fact that it relativizes the error to the innate variance in the data. But sometimes we don’t want to relativize our error. Lets take an example from my (and others) research where R2 doesn’t tell the whole story. We’re trying to interpolate temperatures between weather stations. In many regions weather stations are 100s km apart (with small weather influencing details like mountains in between). So we are evaluating different methods. If I find that a particular method has an R2 of 70%, I am probably moderately happy – my method is explaining a majority of the variability. But if you are a user looking to use these layers, do you know what you want to know? Probably not. In fact the R2 statistic is pretty meaningless if you don’t have an innate sense of variability of the inherent error. What you probably want to know is something like the average error is 1 °C or the 95% confidence interval is ±2 °C. Confidence intervals (and boot strapping methods that produce them) are well-known to ecologists, so I won’t dwell on them further here (except to say that they do require a sensible mode of bootstrapping to produce which doesn’t always exist).

I want to look at the other concept “average error”. Literally this would be Mean Square Error or MSE=SSE/N where N is the # of observations. Except it still has the square in it (recall if we didn’t square before summing, then the errors would cancel each other out and sum to zero in a linear regression). This problem of squaring is fixed not surprisingly by introducing a square root –  RMSE=Root Mean Squared Error or RMSE=\sqrt{SSE/N}. This is widely used by engineers and meteorologists.  In fact, you also already know this in a different inferential context as the standard error. But in a prediction context it is called RMSE. Its main virtue is that it is in the same units as y. So I can tell you that the RMSE prediction at a point between weather stations is 1 °C. Now this is useful to you as an ecologist trying to use these layers. It is also an excellent measure of prediction accuracy. It has lost its relativism (RMSE of 1 °C is excellent – indeed unacheivable – for my application but really poor for the accuracy on repeat measures of a $1,000 temperature probe). And it is now units dependent – predicting body mass in kg and getting an RMSE of 1 kg will turn into an RMSE of 1000 g if you measure body mass in g. But RMSE is really concrete. It also has the nice feature that it is more like R2 than r2. Indeed there is a formula directly relating RMSE to bias: RMSE=\sqrt{Variance+Bias^2}. Indeed RMSE is one measure that attempts to assess trade-offs between bias and variance (a model with low variance but very high bias may well be worse than a model with no bias and moderate variance). This came up recently deep into the comments on my post on detection probabilities. One wrinkle is that RMSE, depending on squaring errors, is very sensitive to outliers and fat tailed error distributions. If this is a problem in your data you can use Mean Absolute Error (MAE=\sum |\epsilon_i|). Note that OLS linear regression effectively minimizes RMSE (it technically minimizes SSE, but since RMSE is a monotonic transformation of SSE, the optimization chooses the same answer), while LAE or median regression, a form of robust regression, chooses the line that minimizes MAE, thus both methods select “best prediction” lines.  There are dozens of other measures you can use, but in my opinion if you look at R2 and RMSE (or MAE) you have really got 90% of the value of prediction metrics.

Goodness of fit – binary and discrete variables

Now, what if y is a categorical variable rather than continuous? There are literally whole books written about categorical variables, and I’m not going to go too far into the topic here. Indeed I am going to limit myself only to a subset of categorical variables – binary variables (i.e. 0/1 or true/false or present/absent). These are extremely common dependent variables in ecology (any model that predicts survivorship or presence/absence is likely using a binary dependent variable). Your first instinct might be to keep using R2. If you do you will be told you are wrong to do so. And there is a problem. Since the observed y values can only be 0 or 1, and the predicted values are usually some intermediate value such as 0.732 the distances between observed and predicted are artificially large. Or put in other terms, the assumption of normal errors which underlies much of the justification for R2 (and even RMSE) is violated with binary variables. But in fact, as long as you accept (and know) the fact that R2 used on binary y variables is never going to be as high as R2 on continuous variables it actually works rather well. And it even has a name – the point-biserial correlation (as the name implies this is usually calculated as r2 rather than R2). This is my favorite metric of prediction on binary variables, but I am rather out of the mainstream.

Not surprisingly giving the demonstrable proclivities of ecologists for statistical machismo, ecologists have embraced a much harder to calculate and harder to interpret statistic known as Area Under the Curve (AUC), which is based on an idea in signal communication theory known as the receiver operator curve (ROC). It is not my goal to provide a full introduction to this metric here. But basically it assumes there is a threshold or cut-off that can be varied and then it measures how much of a trade-off there is between true positive rates vs false positive rates. Imagine a model predicting species presence or absence as a function of climate. This will produce predictions at each point in space that looks like the probability of presence of say p=0.732. To get back to the binary present/absent, we have to chose a threshold. An obvious one is pthreshold=0.5. if predicted p>0.5 then we predict present, otherwise we predict absent. Obviously if pthreshold were set at p=0 we would predict present everywhere, giving us many true positives and no false negatives but lots of false-positives (and the opposite for p=1.0). In a perfect (but impossible world) any value of p>0 and p<1 would work equally well. This gives an AUC (area under curve) of 1.0 (the curve goes from 0,0 to 1,1 and stays inside the 1×1 unit box so the maximum area under the curve is 1). Alternatively, as a null model, if we just take the original proportions of presences (\hat{p}) and for each point randomly flip a coin with probability \hat{p} of coming up present, then our false positive and negative rates would depend only on the thresholds and the area under the curve would be 0.5. Even if you skipped most of the last few sentences, this is the important point – the null model value of AUC is 0.5. An AUC of 0.5 is the same as an R2 of 0. And an AUC <0.5 is the same as an  R2<0! So if next time you see an AUC of 0.6 don’t be too impressed. It is possible to rescale AUC to run from 0-1 to seem more like R2 (i.e. (AUC-0.5)/0.5) ) but the analogy is still misleading. There are dozens of other statistics commonly in use. Some even let you specify your preferences for errors of omission (false negatives) vs commission (false positives). But at that point, I’d rather just publish a 2×2 table of true and false positives vs true and false negatives.

A note on comparing models, AIC and other information criteria

R2 is in general a wondrously relativized number that can be used to compare prediction quality across datasets, types of models etc. But it is an inescapable fact (indeed a mathematically provable fact) that as the number of explanatory variables going into a regression goes up, R2  must also go up. This means R2 is not always the best tool for comparing two models with different numbers of parameters (although the problems are often overstated – big differences in R2 are always meaningful in practice). To correct for this, ideas like adjusted R2  and Mallow’s Cp were invented. More recently, Akiake’s Information Criteria (or AIC) has emerged  (where AIC=-2LL+2k and LL is log-likelihood and k is # of parameters). AIC can be used to compare two models with different numbers of parameters. It should be noted that AIC has a relationship to our starting point, SSE. Namely for normal errors, AIC=n ln(SSE/n)+2k. So AIC is in a certain fashion a measure of goodness of fit. But just like SSE it is not that useful by itself (is an AIC of 273 a good fit or bad?). But AIC is only good for comparing along one dimension – between models across the same data. It is a mistake to use AIC to compare one model on two datasets for example, and certainly it cannot be used as a general comparison of two totally different models.(unlike R2). But if you really want to compare two models on the same dataset AIC is the way to go. Or maybe AICc or BIC or QIC or … That is the main problem with AIC – it is one choice out of an infinite list of possible weightings of goodness of fit vs. number of parameters. It has some logical justification if you think information theory explains the world. Otherwise, it is a bit arbitrary. Personally, I prefer reporting different R2 values and different numbers of parameters and letting the reader choose what is the best model. The use of model comparison as an inferential approach also suffers from certain logical pitfalls (finding the best model out of a list of bad models doesn’t necessarily advance science). Thus, although I use AIC in some contexts (namely comparing different linear regression models), it doesn’t add much to assessing goodness of prediction in the sense I’ve talked about.

The experimental design aspect – predicting on independent data

So far I’ve talked about a few small cheats in the world of prediction (like using r2 instead of R2), but there is one whopper – failing to distinguish calibration vs. validation. Calibration is the process of picking the best possible parameters to fit one set of data. A simple example is using the OLS (minimizing SSE) to pick the slope and intercept of a linear regression. Validation is taking the parameters picked using one data set and testing them on a separate independent data set. The reason this is important is the issue of overfitting. Overfitting is when the model fits not just the signal but the noise/error in the data. This is easy to do when you have a really flexible model with lots of parameters. As Fermi quoted Von Neumann “with four parameters I can fit an elephant, and with five I can make him wiggle his trunk”  (which believe it or not led to a scholarly paper showing he was right!). If you have enough parameters to draw an elephant, you will fit every bump and wiggle of the noise, and if you do, you’ll get a really awesome R2, but when you move to your next dataset (this is the validation step), your  R2 will be terrible (by definition noise does not transfer repeatably).

So to avoid this problem, the proper method is to choose parameters that give the best fit and measure the goodness of fit on independent datasets. The exact method of doing this varies. In regression trees, one builds a very complex tree that one knows is overfit. Then one applies the tree to new data and walks it back (one by one removes the lowest branches) until the goodness of fit (often  R2) is maximized on the new data. In classical regression, a separate validation is not necessary IF you make certain assumptions about the data and errors (normal, independent, constant variance) because we can actually model the transferability. The methods for validation are really as diverse as the methods for modelling.

This is why I like prediction so much – it clearly separates out the calibration vs. the validation steps. Calibrate on today. Predict then validate on tomorrow. Of course as commentors have pointed out, one can do hindcasting (calibrate on today, predict in the past or vice versa) or lateral casting (predict in North America, validate in Europe). These all work equally well. But the real key is the validation data must be independent of the calibration data! Any pretense of validation that fails to do this is at least as bad as (indeed is pretty much doing the same thing as) pseudoreplication.

One common approach to getting independent data that comes from the machine-learning world is to use a hold-out or cross validation technique. Holdout is when one randomly selects say 70% of the data and calibrates on that and then validates on the remaining 30%. Cross-validation is a slightly fancier version where one calibrates on 90%, validates on the remaining 10%, then repeats this 10 times which each separate chunk of 10% held out once and averages. These techniques work great when all the data points are truly independent. These techniques work lousy when there is spatial or temporal autocorrelation linking the points (the 30% that were held out are not truly separate from the 70% used to calibrate). Nearly every paper I have seen on niche models fails to realize this (or at least plows ahead anyway). In the study I just linked to this simple effect artificially inflated R from close to zero to a very respectable 60 or 70% – pretty misleading.

Summary

So a long post! To summarize my recommendations:

  1. For continuous variables, report R(not r2) and RMSE (or MAE)
  2. For binary variables leave the prediction as probabilities and report the point-biserial correlation. If you HAVE to turn probabilities of presence (or whatever) into 0/1 binary variables, choose an appropriate threshold (non-trivial),then just report the 2×2 table. I am not a fan of AUC – it is hard to interpret and inherently misleading with a base of 0.5
  3. Probably even more important than the statistical metrics is the question of experimental design and inferential context. You must think through how to separate calibration from validation. If your validation is not statistically independent of the calibration data, you are committing pseudoreplication. And worse, without validation, you’re not predicting, you’re just curve fitting!

Ecologists need to do a better job of prediction – Part III – mechanistic or phenomenological?

So I have been arguing that in order for ecology to progress as a science, we need to stick our neck out and make risky predictions that might actually be wrong (here and here). That’s all fine and good, but the obvious question is how to make such risky predictions.

In particular, many comments on previous posts have raised the issue of whether the predictions are mechanistic or phenomenological. The mainstream view in ecology is very reductionist – to explain communities we have to make our explanations in terms of populations – to explain populations we have to make our explanations in terms of individual behavior and physiology – to explain behavior and physiology we have to look at endocrine systems, proteins, etc. With evolution mixed in there somehow. This is almost a holy doctrine in ecology. And extended to prediction, it says we have to make predictions that build up from the little pieces with a thorough understanding of what is causing things. At the other extreme is the Rob Peters instrumentalist point of view. Peters said that we can never know mechanism (he told a colleague of mine at McGill University that we don’t know that inheritance works by genes and that genes are just a human construct). His solution is a bunch of regression – variable x is related to y. And if we know y then we can predict x. For both of the readers who have followed my work closely, it will come as no surprise that I take a somewhat out of the mainstream stance – namely that mechanism is a nice-to-have, prediction is a must-have. Or a more nuanced version is that mechanism is a lot more slippery and less black-and-white than we ecologists like to give it credit for.

Before arguing my case, I want to detour to an example enough outside of our field that we won’t get emotional about. I was put on this topic by a great post at the Mermaid’s Tale blog. They talk about the question of predicting which individual humans will contract a particular disease. Obviously something of high practical relevance but also something that really tests the progress of medical science. Based on some papers mentioned, I am going to abstract the problem a little bit to predict the height of an individual since this is something we know a great deal about. One can imagine several approaches to tackling this:

  1. Big data – collect a bunch of data about an individuals geographic ancestry (different groups of people do have different average heights), per capita GDP in country of birth at time of birth (diet quality influences height), gender, etc. Build a regression model
  2. Reductionist – use QTL mapping or more modern methods to identify which genes most strongly influence height, assess the presence or absence of these genes in an individual and predict height.
  3. Phenomenological – Use Galton’s regression approach of looking at mid-parent height and heritability.

All of these methods have been used to predict human height. First question- which of these models is most “mechanistic”? Second question, which of these models is most predictive?

Most mechanistic? – Most ecologists would say #2, the reductionist approach is most mechanistic. This is because of our (trained) intuition that mechanism comes from smaller things, not things of the same size (our parents of #3) of larger (the environmental context of #1). But is it really? The chain of causality from gene presence/absence to adult height is incredibly complex (and inherently a limited part of the picture – diet really does matter). Does approach #3/phenomenology really tell us the same story (genes and environment) but at a much more useful way (regression and variance around the line). And is not #1 in some ways more comprehensive, covering both genes and environment as causal factors? I have argued along with Jeff Nekola that ecology is really causing itself grief by ignoring mechanisms right in front of our faces because of our reductionist biases.

Best prediction? I couldn’t find a paper that actually takes route #1 (although its easy to find tables for average height by ethnicity and gender which takes into account 2 of the 3 factors I mentioned), but there was a great paper that held showdown between #2 and #3. #3 won walking away. #2 (despite an extraordinarily extensive effort) explained 4-6% of variance. #3 explained 40% of variance. A more recent paper using 100s of thousands of SNPs (yes that’s right 400,000 regions of DNA) was only able to predict 15-30% of height in the test data set. Galton’s Victorian era regression is still undisputed champion!

A similar result was found recently in the specific question of predicting future diseases in individuals. What they found is that for a low-frequency, more specialized diseases like Crohn’s the genetic SNP approach worked better but that for common diseases like heart disease, family history worked better.

Before returning to ecology and prediction, I want to return to meteorology, which I cited previously as a model for prediction. As I explained the 1-3 day predictions are highly reductionist models that use fluid flow equations and have improved due to better data input and smaller grid sizes. A clear victory for the mechanistic reductionist approach. But much of our improvement in longer term forecasts (e.g. monthly, yearly) have come from a completely different source – raw naked correlation! The major breakthrough was the discovery of teleconnections or specifically when weather at one location is influencing weather at a far away location. The El Nino or ENSO was the oldest and best known. Then the Pacific Decadal Oscillation was discovered from the studies of salmon productivities on the Pacific coast of the US (it is a 20-30 year cycle). But the major breakthrough was the paper “Classification, seaonality and persistence of low-frequency atmospheric circulation patterns” by Barnston and Livezey in 1987. This paper was nothing more than a giant principle component analysis (across space and time and therefore called empirical orthogonal function analysis by meterologists) of spatially gridded timeseries of atmospheric pressures. Out of it popped half a dozen major teleconnections with frequencies ranging from months to decades. Although some later mechanistic understanding of why these teleconnections occur has been provided, current models are poor at accurately reproducing many of these patterns. But understanding these spatiotemporal correlations let us say things like the frequency of intense snow events in the NE US (bit of a personal interest in that right now) is strongly regulated by the PNA and NAO patterns. So monitoring and predicting these half dozen patterns has greatly produced our longer term (climatological) forecasting almost entirely because of to empirical correlation (#3 above). A victory for the phenomenological/big data approaches.

As an aside, I just want to note that physics has nothing like ecology’s expectation of mechanism to be reductionist. We still have no reductionist mechanism for gravity (gluons and other imaginary particles are hypothesized but not tested). Indeed all we really have is a phenomenon.

Now back to ecology.

I’m not sure what the exact analogies to #1-#3 are in ecology. But lets try for one case – predicting species abundance around the globe:

  1. Big data – throw in NDVI (a satellite proxy for productivity), mean annual temperature, temperature seasonality, water balance and maybe a few other variables and develop a regression model
  2. Mechanism – use coexistence theory or other theories of species interactions to predict diversity from first principles
  3. Phenomenological – not sure exactly what this looks like – maybe predict bird diversity from tree diversity or insect diversity?

As the reader will probably know, all three of these have been done. In terms of accuracy, by and large #3>#1>>#2. Still think we need to be reductionist for prediction?

To my mind the hierarchy is simple:

  • accurate prediction>mechanism
  • knowing mechanism>ignorance about mechanism

If you adopt this view then the big data (#1) and certainly the phenomenological (#3) methods become viable and often the quickest routes to prediction. The main argument against #1 and #3 as predictive mechanisms is that because they are missing mechanism they cannot accurately extrapolate into new conditions (for example see Dunham, Arthur E., and Steven J. Beaupre. “Ecological experiments: scale, phenomenology, mechanism, and the illusion of generality.” Experimental ecology: issues and perspectives. Oxford University Press, New York, New York, USA (1998): 27-49. -I think they’re wrong but it is a provocative read I recommend to every grad student). I think this argument is given a lot more weight than it deserves. First, who says there is extrapolation – in the example of global patterns of diversity there was no extrapolation. Second, yes, in true extrapolation the regression approaches can fail – but so do the mechanistic ones often! Ecology is highly contingent and when you change contexts enough, regression relationships fall apart but so do basic assumptions about what the most important processes are.

So in summary, I would argue that there is more than one way to make a prediction. And they’re all viable routes. Mechanism is a nice-to-have but by no means a must-have for advancing science. Or as I prefer to think about it, the problem is not so much pursuit of mechanism but pursuit of reductionist mechanism (explaining everything by smaller things). #1 and #3 are arguably as if not more mechanistic than #2 once you let go of the reductionist paradigm. People will say #2 is (in either the height or diversity examples) more mechanistic because it is more getting at ultimate causes. But really genes and species interactions are both pretty so “ultimate” they lack much direct link to the topics at hand – the links in the regression are much more obvious.

I know this is a non-mainstream view and I’m expecting a lot of discussion (with Jeremy at the lead). Which is great. But please – intelligent comments. Don’t argue by religious fervor and just say “reductionist mechanistic predictions work better” (please specify by what measure, give specific examples) or just say “its not real science if it doesn’t have reductionist mechanism” (go tell that to the physicists and the climatologists and the epidemiologists).

Is using detection probabilities a case of statistical machismo?

Back in the fall I wrote a post on statistical machismo in ecology, arguing that ecology is prone to use increasingly complex statistics without necessarily stopping to weigh the costs and the benefits. I singled out four specific techniques: phylogenetic regression, spatial regression, Bayesian methods and detection probabilities. I at no point said these techniques were bad or should never be used. But I did say that we had in many cases reached a point where the techniques had become sine qua non of publishing – reviewers wouldn’t let papers pass if these techniques weren’t applied, even if applying them was very costly and unlikely to change the results. Most of the comments were on the Bayesian (which I do NOT want to reignite here) and the two GLS (phylogenetic and spatial) regressions which lead to this follow-on post.

I got only one comment on detection probabilities. However a new paper published today in PLOS One called “Fitting and interpreting occupancy models” by Alan Welsh, David Lindenmayer and Christine Donnelly made me very excited and wanting to revisit detection probabilities.

Now, if you are based in a wildlife department you already know what detection probabilities are. Indeed, most of the committees of students I sit on in the wildlife department mention detection probabilities with a groan and a roll of their eyes but then go ahead and modify their design, at great cost – namely halving or more the amount of data they can collect – to address detection probabilities. You see, in wildlife journals detection probabilities have become a no publish line – you can’t publish without detection probabilities.

Although many in basic biology and EEB departments remain blissfully unaware of detection probabilities, the expectation is starting to creep into reviews on papers in basic research as well. As somebody who frequnetly publishes papers using the North American Breeding Bird Survey, I have now had three papers rejected a total of six times for the “sin” of not using detection probabilities (never mind that I couldn’t, didn’t need to, and it wouldn’t change the answer). So beware, this issue is coming to your population biology papers soon!

Detection probabilities are a statistic model/method designed to deal with one simple obvious fact. When you are censusing mobile organisms like birds or mammals or butterflies or … (really almost anything except plants and maybe snails), you miss organisms. You are not censusing the whole population. This has long been recognized by reporting such counts as an index of abundance rather than abundance per se (and if you need total abundance you have to use a method like mark-recpature). Detection probabilities got their start when people reported occupancy (presence/absence rates) instead of abundance. The idea was a claim of absence was pretty shaky when you’re not counting all the individuals yet the data was presented as a binary and hence large difference (presence or not). So far reasonable enough.

This paragraph has the heavy math – try to read it – but do keep going on to the paragraphs after and don’t just give up! The proposed solution was a two step model: let Oi be the true occupancy (1 if a species present, 0 if absent at site i). We can assume the true underlying occupancy rate is Ψi (a simple way of saying let Oi be distributed Bernoulli(Ψi). So far this is just a probability model of occupancy (in the simplest case where Ψ is constant across sites the occupancy rate is just Ψ). Now comes the fancy part. Let Di,j be what is actually observed at site i for observation repetition j (again D=1 if present, 0 if absent). Now Di,j can be different from Oi and we have added detection to our model (although not yet fully specified it). To fully specify we need a few things:

  1. Multiple observations of D giving the subscript j
  2. Assume Oi doesn’t change across the multiple observations j (i.e. a site doesn’t flip from really occupied to really unoccupied or vice versa between visits)
  3. Assume P(Di,j=1|Oi=0)=0 (we never mistakenly observe something at a site when it is not really there
  4. Define pi,j=P(Di,j=1|Oi=1) (i.e. the probability it is detected if it is there) – aka the detection probability
  5. Assume pi,j is constant across observations (so we can drop the subscript j giving pi)

Now you don’t need me to tell you that assumptions #2, #3, and #5 are all whoppers. But lets give the method a chance. Under these conditions it is not too hard to write down the maximum likelihood estimates (MLE) and solve for pi and Ψi which are both unobservable using the observations Di,j. It is also quite common to let pi and Ψi be functions of covariates like land cover type, elevation, etc (using logistic regression) – this more advanced model is also fairly directly solvable.

If you think about it should be obvious you cannot estimate a detection probability and an occupancy separately if you only have one observation of the site. Thus #1 (repeated observations of the same site) is critical. Right here is the nub of detection probabilities – you can only use them if you make REPEATED observations of the same data point. If you make three repeated observations, then you will in a fixed amount of time only be able to observe one third as many sites as you otherwise would have been able. This is why wildlife ecologists hate having to address detection probabilities. It has a very real cost – it is not just more computations – it is more data collection (or equivalently less independent points for all ensuing analyses). But wildlife ecologists have buckled down and done the more observations while losing power/df inherent in detection probabilities because they can’t get their paper published any other way. Now you understand the eye rolling. This also is a serious problem for people like me using historical datasets like the breeding bird survey that were never designed with detection probabilities in mind. They “only” have one observation per point in space and time.There is no way to go back and add repeated observations thus demanding detection probabilities is tantamount to throwing away historical monitoring datasets – ouch!

Well, a couple of Aussies  (the aforementioned Lindenmayer and Donnelly) were doing a nice study on the effect of monoculture pine plantations on bird communities (abundances). I don’t know the full story but judging by the paper I linked to above they must have gotten told by reviewers at least once that their paper was unpublishable and that they should: a) abandon abundances and only do occupancy so that they could b) use detection probability methods. The whole idea of throwing out abundance information and reducing it to occupancy just becaue “its more statistically proper” makes my stomach turn and apparently it did theirs too. They went out and got a clever statistician to work with them, Alan Welsh, resulting in the paper I am discussing.

It is quite a technical read, so let me boil down the main findings:

  1. In the version of detection probabilities where pi and Ψi have covariates modelled through logistic regression, the solving of the MLE equations is a lot harder than people have given it credit for. They found many cases where there were multiple solutions to the MLE (i.e. the answer depended on the initial guess you gave the solver) or where the solutions converged on the boundary (where pi and/or Ψi are either 0 or 1 which are theoretically impossible). The core issue is that there is a lot of freedom to move the solution back between high pi and low Ψi vs low pi and high Ψi , both of which would give the same observed result. When you throw in logistic regression which can have its own convergence problems when the data is very noisy, you get a mess. To be more exact, you get a lot of real-world wrong answers spit out by the computer even to the point of estimating slopes of the wrong sign.
  2. This problem is compounded when the data is sparse – i.e. either pi or Ψi is low (by which I mean say 10% which is not at all uncommon to have an occupancy of 10%).
  3. It has been fashionable recently to notice that detection probability depends on abundance (gee really – a species is more likely to be noticed=detected when there are 30 of them then when there is one of them?). But this is a major violation of #5 above (detection probability constant across sites under the very likely scenario that abundance varies across sites). There are ways to try and deal with this, but as the Welsh et al paper show, all of them have problems, leaving the detection model nearly always in violation of a core assumption of constant detection probability other than for modelled covariates.

So where does this leave us? Every model is imperfect and has assumptions violated. What are the consequences of #1, #2 and #3 for detection models? Welsh et al found  (in an extremely rigorous paper where eveyr point was supported by analysis of real-world data, analysis of simulated data and analytical results) that:

  1. Frequently the estimated relationship of detection and occupancy to covariates is very wrong. So for example in the original study which looked at how maturation of the pines influenced detection probability (old bigger forests should have lower detection probabilities) it was often estimated that detection *increased* with size/age of forest.
  2. The estimates of occupancy and detection are biased and have high variances. In fact have the same amount of bias and high variance as if you just ignored the detection probabilities and went back to the old way of doing things!!! (and this was on simulation data where detection issues were built into the data).

Bottom line – ignoring detection issues often gives misleading/wrong answers. But at exactly the same rate as if you were modelling detection which also often gives misleading/wrong answers.  When you combine this with the real world fact that often times only half or one third of the data (by which I mean independent observations) is collected that would have been collected if we ignored detection probabilities, one really starts to question the appropriateness of demanding detection probabilities.

I claimed at the start of my post that I wasn’t saying any technique wasn’t inherently bad and should never be used. And I’m not saying that about detection probabilities either.

One of the most sensible thinkers on detection probabilities I know is Steve Buckland who has been a leader in the development of detection probabilities. In chapter 3 of the edited book by myself and Anne Magurran (sorry shameless self promotion), Buckland says “Ignoring detectability might not be a major problem if the bias is consistent across time or space.” But then goes on to demonstrate quite clearly that results can be misleading if detection is ignored in other scenarios. He clearly is not black-and-white about the need to use detection probabilities. Buckland also developed a nice method where instead of repeated measures of the same site, one only needs to estimate the distance to observed individuals which can calibrate a detection decay curve. Estimating distances is not cost-free compared to just counting, but it is much less costly than repeated visits to sites and thus is a great benefit to wildlife ecologist who have to worry about detection probabilities. The distance-based detection method seems not to have made it over “the pond” to the US as well as it should have.

Here are my recommendations.

  • In light of Welsh et al’s findings it is flat out wrong for reviewers to insist that detection anaysis is a requirement for publication.
  • It is more important to address detection if you are actually studying occupancy (presence absence) and less important when you are studying other factors like community structure, abundance etc.
  • It is more important to address detection if detection probabilities are likely to vary across species (e.g. different detectabilities by species which is common enough) or space (e.g. varying amounts of brush) or survey points (e.g. varying effort levels) and that comparison (across species or sites) is what is important to you but it is less important to address detection when things are fairly constant across your axis of comparison – e.g. looking at just one species (so no issue of differing detectabilities between species) across space when there is not a reason to expect habitat to vary much (so no reason to expect varying detectabilities across space)
  • If you do have to address detection probabilities (because of your question and experimental context, hopefully, not because of reviewers), then: a) consider using Buckland et al’s distance methods, and b) consider getting serious and doing more than just two or three repetitions of each site – if you really are interested in occupancy and detection then you need real replication along that dimension just like for any other variable of interest.

I think the main theme of my post on statistical machismo is there is no such thing as  cookbook or one right way to do things in statistics. You have to know what you’re doing and think things out. Sometimes one way is appropriate. Sometimes an alternative way is appropriate. And these have to be weighed against real-world costs in data collection and loss to science of interesting studies. Detection probabilities are no exception. So if you’re a reviewer or editor, please stop telling poor authors you “have” to do detection probabilities because “its the only right way” or “gold standard” for how to do it. Its not – it very likely introduces as much error as it fixes and whether you should do it depends on the question and the data and requires thinking.

Ecologists need to do a better job of prediction – part II – partly cloudy and a 20% chance of extinction (or the 6 P’s of good prediction)

So before the holidays I started a series of posts on ecologists needing to do a better job of making predictions. I argued that we should predict more both for the benefit of applied usages but also for the better advancement of basic research. I also argued that ANOVA (at least as usually used) is a big blockage to a culture of prediction. Shortly after my post, guest commentator Peter Adler wrote a great post on prediction and the degree to which basic researchers are serious in making prediction vs using it as a front to get funding.

I have at least two more posts planned after this one (one on mechanistic vs phenomenological/statistical prediction and one returning to some questions raised in my first post about statistics).

But in this post, I want to look at a scientific field that I would argue has been the most successful at making predictions: meteorology. As Jeremy has noted in the past, one should worry when ecologists start reasoning by analogy to other fields of science instead of talking about ecology, but I have a specific goal here. I want to derive what I think are some good practices about prediction and think about the degree to which they do or don’t fit into ecology. Indeed to make it catchy and sound simple, I will boil it down to the 6 P’s of good prediction. And I will talk about how these apply to ecology.

OK – so first weather prediction. There are a number of good papers (like this and this) and even a book reviewing the history of weather prediction. Weather prediction is very different in one way – we know the laws. There are 7 equations that describe the behavior of air (see first review paper). The problem is that they are continuous in space over the whole globe and they are chaotic. Despite this, the bottom line is early predictions were worse than the obvious null models (tomorrow will be the same as today, tomorrow will be the same as the 30 year monthly average=climatology). Now they are way better than this for the 3-day-out prediction and also better for the 5-day-out prediction and even the 7 day prediction is slightly better than the null.

Reproduction of figure 4 in Simmons and Hollingsworth 2002. X-axis is year, y-axis is a correlation coefficient on air pressure deviations.

Reproduction of figure 4 in Simmons and Hollingsworth 2002. X-axis is year, y-axis is a correlation coefficient on air pressure deviations.D+3 is a prediction 3 days into the future.

Weather’s record of prediction is enviable both in absolute level (high correlation) and in the trend of constant improvement. If you read the histories, these improvements are a combination of three things:

  • Better computers leading to a finer resolution grid approximation to the continuous differential equations (the first models were a 3 degree x 3 degree grid and one vertical layer – modern global models are 1 degree x 1 degree with 5-7 vertical layers)
  • Some improved modelling tweaks
  • More data on initial conditions

I would argue that weather prediction has been such a success (Nate Silver’s new book on prediction also holds up weather prediction as a uniquely good success) because they follow the 6 P’s of good prediction (that I invented for a talk I gave a few months ago). These are:

  • Precise enough to be possibly wrong – Jeremy asked me in the first post what defines a prediction. And my answer was its not black and white, but a spectrum. Or as Lakatos said, a good prediction for testing a theory must be risky. The more risky the prediction (and also the more predictions) a theory makes, the better the test. Weather predictions are indubitably precise enough to know if they are right or wrong making them risky. They are maybe not the most risky predictions imaginable, but there are a lot of them (like 365 a year). Now compare this with ecological predictions: e.g. predation can, but not necessarily will, induce oscillations of some kind. Not very risky! (And not very many predictions from one theory). Who is really putting their neck on the chopping block with their predictions?
  • Probabilistic-  weather forecasters do something almost no other predictors do. They put the percentages and error bars in their predictions (20% chance of snow with a high temperature between 25 and 30). You might think this is an escape from the first point of being risky, but only in the short term. If you predict a 20% chance of rain and then it rains, you seemingly have an out. But not if you have 10 years of data. THen you really ought to see rain 20% of the time on days you said a 20% chance of rain. In fact the weather service gets this right to within a percent or two. My main point is a good prediction includes an estimate of its uncertainty – it has error bars. Some branches of ecology do this well (e.g. PVA analyses provide ranges of extinction probability) but many branches of ecology don’t.
  • Prolific data – if you look back at the figure you see the Northern Hemisphere predictions have gradually gotten more accurate. It is very hard to tell how much this is due to better computing power vs more data. But the Southern Hemisphere predictions have gotten better at a much faster rate and now converged to being almost as good. This is almost entirely attributable to having better/more input data to the models (its the same model and computer for both hemispheres). Weather forecasters have devoted enormous efforts to collecting data. They have more stations but also collect more kinds of data at these stations. It is impossible to get better at prediction without voluminous data! NEON in the US may be an attempt at this, but it is kind of sobering to realize that NEON wouldn’t even have a sensor in every one of the 3 degree x 3 degree cells in the oldest weather models and nowhere close to covering modern grids of 1 degree x 1 degree (and NEON is focused on a subset of ecological data). The breeding bird surveys and forest inventories sample a little more densely, but are a very limited subset of measurements (it would be like trying to predict temperature by only measuring overnight low temperatures once a year to input into the model). We have to get *REALLY* serious about data if we care about prediction.
  • Proper scales – I find it fascinating that the early weather modelers had a very explicit sense that the most tractable problem was to focus on regional scale pressure variation (i.e. the high and low pressure systems and the fronts). Other things like precipitation depend to a much greater degree on micro-scale processes (e.g. local convection and evaporation). What is really fascinating is that even though precipitation was probably the ultimate goal, the weather modellers followed their noses and modelled the scales that were most tractable first and got that right and only later started trying to add in details specific to precipitation (and anybody who has lived in an arid landscape and seen how spotty rain can be knows how hard it would be to get this really right). I’m pretty sure we ecologists are not this scale-detached. We insist on modelling the scales we want answers to not, the ones that are amenable to modelling.
  • Place specific - Here is something that will be controversial. Weather forecasts explicitly reject the Robert May strategic modelling approach. They make forecasts that are specific to a time and a place and thus highly dependent on the initial conditions, parameter values and specificities. And the National Weather Service pays big bucks to have local experts who look at the computer outputs and “correct” them for local idiosyncracies that these local modellers have come to understand relating to mountains, oceans, etc. Now it might seem in ecology we only need to make place-specific predictions for the applied side. But I would argue it is just as important for the basic research side. The main reason goes back to the predictions that are precise enough to be wrong. To take the ecological prediction that I picked on (some predator prey systems will cycle), this could be a very precise prediction if we said predator prey systems will cycle in boreal and tundra ecosystems but not elsewhere. And I’m not wedded to place – it could be condition dependent – predator prey systems will cycle when there is a 30 degree difference between summer and winter temperatures is condition dependent rather than place dependent but it serves the same purpose. So I think even for the good of basic science, we need place-specific (or condition-specific) predictions (and of course this will make applied scientists happy too).
  • Public even when worse than random – But more important than anything else, I think weather forecasters get big credit for and have received big benefit from the fact they don’t hide when their models are wrong. Going way back to the first real weather prediction which was hand calculated by an ambulance corps volunteer during World War I – he published his result even though it was very wrong. This making of public predictions leads to a strong culture of figuring out what went wrong and making things better. This incremental improvement is exactly what you see in the figure above. Weather predictions started out worse than null models and now are much better than null models. And I think all sorts of factors contributed to this, but most of these factors got invoked because of the rigor of public, risky predictions on a repeated basis. This is a central theme of Nate Silver’s book. But really, if you think about it, is the central theme of good science!

OK – so I have somewhat presumptiously (and ponderously?) given 6 P’s of good prediction. I’ve commented a bit along the way on how ecology is doing, but I wanted to expand the application to ecology a bit. To really assess how ecology is doing on prediction, I think there are two cultures of prediction in ecology and their strengths and weaknesses are rather different and need to be broken out. The first culture is the one found in theoretical ecology, that finds May’s strategic modelling approach inspiring and makes predictions like “predator-prey systems can have cycles” – I’ll call this the strategic prediction culture. The second culture is centered in government agencies and NGOs although it certainly extends into universities. I stuck my foot in my mouth in comments to Peter’s post by not really recognizing this type of prediction culture (which is embarassing because I’ve done some of it and certainly have colleagues down the hall doing it) but fortunately Eric Larson called me to task. I’ll call this the management prediction culture.

First the strategic prediction culture. I’ve been creating a little bit of a straw man by characterizing this culture as predicting “some predator-prey systems will cycle”. That is too simplistic. But by how much? This approach really falls down on the issue of public risky predictions. The P’s of precise, place-based and public are all weak here. The goals of this group are all basic research, so I won’t hold them to accountability for applied relevance, but even for basic research, are these predictions sticking their neck out predictions? Are they specific enough to be falsifiable? or is there room to wiggle and say “something else was going on” every time the predictions fail. I think these predictions have also failed on the probabilistic “P” – most of these models produce no sense of error bars or degree of confidence in the prediction. I would also argue that the proper scale “P” is largely ignored. There has been very little discussion of at which scales noise trumps signal or vice versa (and it is mostly raised by macroecologists who feel scoffed at for raising it). Probably a mixed bag on the prolific data P. Some of these modellers care immensely about testing their models with real-world data and are hungry for more data. But a good many are not.My thought is that a more rigorous prediction culture would cause this field to advance faster and there is a lot of room for improvement.

Now the management modelling culture. This group regularly makes predictions that are requested by and then used to inform management decisions about endangered species (listing and management), invasive species, climate change, acceptable harvesting levels, etc. How does this group do? They do make precise, public, place-specific (and species psecific) probabilistic predictions on a regular basis. This is to their great credit. They very often have no choice about the proper scale at which they are asked to model, but probably don’t have a healthy enough respect for the ensuing limits this entails. And I think you would have to give this group a mixed grade on prolific data. Much of the prolific data we have (breeding bird surveys, forest inventories, etc) come from management contexts. But management also has reams of place-specific monitoring data sitting in drawers and could probably do a better job of using their privileged position (policy makers want their predictions) to push the data agenda further. And I think one has to ask if they really accomplish the underlying goal of public, precise, place-based predictions, which is to have a critical culture of model evaluation and model improvement driven by clear model failures. This piece of the feedback loop is I think weaker than it should be (the slope of the line of improving prediction is rather flatter than the one for the weather forecasters in the figure above). So many predictions are for 20 years in the future and never really checked. And even the short-term predictions are last years work and not followed up in a detailed way (unless an embarassingly bad prediction makes it into the news). The modelling of ocean fisheries is an interesting example. It is complex, and I am not an expert by any stretch so I would like to hear opinions of those that are. But my impression is that while politics absolutely drove many of the decisions one cannot escape the fact that the scientific predictions regularly underestimated the threat of overfishing  and overestimated the rebound potential, thereby also playing a role in the current mess. And while one cannot use a broad brush to characterize a large population of scientists, and I know there is research on improving and fixing models, my understanding is that there is a real culture of inertia resisting change and improvement to the prediction models. My colleagues at Maine would suggest that a big part of the problem with current models is that they are at the wrong scale, but I cannot offer a strong opinion on that. So having picked on fisheries scientists for a minute, let me reverse course and reiterate that this group (and their colleagues doing similar things for deer populations, etc) are, in my opinion, closer to my 6 P’s of prediction than any other group in ecology.

So, a rather long post. Three main things I would love to hear comments on – do you agree that the 6 P’s of prediction are all important and good or am I missing anything big? How do you think the strategic modelling culture is doing with prediction and the 6 P’s? How do you think the management prediction culture is doing with prediction and the 6 P’s?

Ecologists need to do a better job of prediction – part I – the insidious evils of ANOVA (UPDATED)

When I teach a graduate statistics class, I spend some time emphasizing that most statistical analyses can produce a p-value, an effect size and an R2. Students are quick to get that p<0.05 with effect size of 0.1% and R2 of 3% is not that useful. This is not a particularly novel insight, but it is not something many students fresh out of a first semester stats class realize where all p-value all the time is emphasized. All 3 statistics have their roles. p-values are used in a hypothetico-deductive framework telling us the probability the signal could have been observed by chance (does nitrogen increase crop yield?). Effect size tells us the biological significance (how much does crop yield increase given a level of nitrogen addition?). This is the main focus in medicine. What is the increase in odds of survival for a new drug? And R2 tells us how much what we’re studying explains vs the other sources of variation (how much of the variation in crop yield is due to nitrogen). It tells us how close we are to done in understanding a system.  I am not biased. I like all three summary statistics (and the different modes of scientific inference that they imply).

But if you ask me how they are used in ecology, then I think the answer is pretty clear that the p-value is way over-emphasized relative to the other two. You can find dozens of papers making this same claim (e.g. this and this). But the one that really pounds this  home to me is this 2002 paper by Møller and Jenions. In this paper they conduct a formal metanalysis of a random subset of papers published in ecology journals and look at the average R2 across these papers. Anybody want to guess what it is? We know it won’t be 80% or 90%. Ecology lives in a multi-causal world with many factors influencing a system simultaneously. Maybe 40% might be reasonable? Nope. 30%? Nope. At least 20%?! No. Depending on the exact method it is 2-5%! To me this is astonishing. Papers published in our top journals explain less than 5% of the variance (UPDATE: see footnote). In a less jaw-dropping result but very much In the same vein,  Volker Bahn and I showed that we can predict the abundance of a species better by using spatial autocorrelation (basically copying the value measured at a site 200km away) then by advanced models incorporating climate, productivity, land cover, etc.

Whether you are motivated by basic or applied research goals, it seems clear that we ecologists need to do better than this! To be very specific we need to deliver on the predictive aspects of science that care about effect size and R2. From an applied side, policy makers and on-the-ground practitioners looking for recommendations for action care almost entirely about effect size and R2. Knowing that leaving retention patches in clear-cut forests increases species richness on average by 10 species (a large effect size) is the main topic of interest. As an additional nuance, if we further know that retention patches only explains 10% of the variation in species richness between cutting sites because it mostly depends on the history of the site and chance immigration events and weather in the first year of regrowth, then at a minimum we need to do more work and we might just pass on taking action. Knowing p is pretty immaterial once a certain credible minimum sample size threshold is passed. From a basic research side, there are also good arguments for focusing on effect size and R2. I don’t really believe science is about saying “I can show factor X influences measurable variable Y with less than a 5% chance of being wrong” and I don’t think most other people do either.  I am a big fan of Lakatos, who rejects Popper’s emphasis on falsification and suggests that the true hallmark of science is producing “hitherto unknown novel facts” (elsewhere he raises the bar with words like stunning and bold and risky) and I would agree. Lakatos gives the example of Einstein’s general theory of relativity truly being accepted when the light of a star was observed to be bent by the gravity of the sun – a previously unimagined result. Ultimately, if all we’re doing is post hoc explanation, it is at best a deeply diminished form of science. And even if one is rooted in the innate value of basic research, at some purely practical level there has to be a realization that if ecology as a whole (not individual researchers) doesn’t in some fashion step up and provide the basic tools to help us predict and navigate our way through this disastrous anthropogenic experiment known as global change, then society just might decide we deserve a funding level more like what the humanities receive.

So assuming you bought my above two arguments that:

  • Ecology is bad at prediction
  • Ecology needs to get better at prediction

then what do we do?

This question is going to be the topic of a series posts (currently planned at three). In this post, beyond introducing the topic of prediction in ecology and arguing that we need to do a better job (i.e the half of the post you just read), I want to examine one major roadblock to getting better at prediction in ecology, what I provocatively called in my title the “insidious evil of ANOVA”.

Now I want to be clear I have nothing against analysis of variance per se. It is a perfectly good technique for putting statistical significance on regressions. And indeed the basic ANOVA-derived F-ratio test on a univariate regression gives the same p-value as a t-test on the slope and has much in common with likelihood ratios (and hence with AIC*) for normally distributed errors (the –log likelihood of normally distributed errors is nothing more than a constant times the sum of squares and hence a likelihood ratio is a ratio of sums of squares just like an F-test barring a few constants and degrees of freedom). So to take away ANOVA sensu strictu you would have to take away most of our inferential machinery. And variance partitioning (including R2 that I am promoting here) is a great thing. I have published papers that use ANOVA and will continue to do so.

What I object to is how ANOVA is typically applied and used (Click here for a humorous detour on ANOVA), starting from the experimental design and ending with how ANOVA is reported in journals. I’ll boil these down to two “evils” of traditional use of ANOVA:

  1. ANOVA hides effect sizes and R2 Again, in a technical sense, ANOVA is just the use of an F-statistics to get a p-value on a regression. And you can get R2 and effect size off of a regression. But in a practical sense as commonly used (and in every first year stats class), ANOVA specifically means a study where you have a continuous dependent (Y) variable and one or more discrete, categorical independent/explanatory/X-variables. The classic agricultural example is yield (continuous) vs. fertilizer added or not added (discrete). Or abundance of target species in the presence and absence of competition. There is again nothing per se wrong with this. Its just that most software packages out there (including R for the aov command) only report a p-value when you run this kind of ANOVA. They don’t report an effect size (difference in mean yields for with vs without fertilizer). And they don’t report an R2. You can get these values out with a little extra work but they’re not in the default reports. And then reviewers and editors let the authors write the ms reporting only a p-value without an R2. This is how we end up with a literature full of p<0.00001 but R2=0.04. I would argue that every single manuscript should be required to report its R2 and effect size. I hypothesize this requirement alone would cause the average R2 of our field to go up. At least some people would be embarrassed to publish a paper with an effect size of 2% or an R2 of 3%. It would be painful, because this is the state of our field today, but it would be really healthy in the long run to mandate always publishing an R2 and an effect size alongside the p-value.
  2. ANOVA experimental designs focus on the wrong question – A typical ANOVA setup asks the question does X have an effect on Y. In ecology it is not surprising                                      that more often than not X does have an effect on Y (everything has an effect on Y when Y is something like abundance or birth rate or productivity). Indeed it would be shocking if X did not have an effect on Y. At that point it just becomes a game of chasing a large enough sample size to get p<0.05 and then walla! its publishable. Instead we should be asking “how does Y vary with X”. This doesn’t require a drastic change in the experimental methods. Just a shift in thinking to response surfaces instead of ANOVA. A response surface measures Y for multiple values of X and then interpolates a smooth curve through the data. This is just as accessible as ANOVA – specifically it does not require an a priori quantitative model. However, it sure feeds into formulating new models or testing models somebody else developed. There was a nice recent review paper on functional responses by Denny and Benedetti-Checchi that shows how powerful having a response surface is in ecology (although they focus on mechanism and scaling rather than statistics and experimental design). Having a tool and mindset  that are more focused on prediction leads directly to questions of R2 (how big is the scatter around our line) and effect size (how much does the line deviate from a flat horizontal line) AND we get something drives us immediately to models and we get a quantitative prediction (albeit phenomenological) even before we have a mechanistic model. This is a no-lose proposition.

A few technical details and an example will make response surfaces more clear. One reason response surfaces have been limited in use is that they traditionally required using NLS (nonlinear least squares regression) unless the response surface was a boring straight line, but in this day and age of GAM (basically spline regression) this excuse is gone! The figure below shows a contrast between traditional ANOVA approach (the boxplot in subfigure A) and response surfaces (subfigure B and then a 2-D version in subfigure C). Other than a shift in mindset (plot a spline through the data instead of a box plot), the only practical implication is that we ought to emphasize number of levels within a factor (e.g. four levels of nitrogen addition or competition instead of just the two levels of control and manipulation). This can be cost free because we can reduce the number of replicates at each level proportionately (i.e. 4 levels with 2 replicates instead of 2 levels with 4 replicates) (again compare subfigure A with subfigure B).

Just to expand briefly on the benefits of a response surface, it serves as a wonderful interface to modeling. If we have a model, we can test if it produces the response surface found empirically; if we don’t have a model it immediately suggests phenomenological models and may even suggest mechanistic models (or at least whether the positive or negative feedbacks are dominating and whether there is an optimum). Further we have a prediction immediately useful in a applied context. If we have a response surface of, say, juvenile survival vs. predator abundance, we can immediately have an answer for that conservation biologist who says, given that I have limited dollars, what kind of benefit will I see if I invest in eradication of the invasive predator? Being scientists, we will immediately want to caveat this analysis with warnings about context dependence, year-to-year variation etc. That’s fine, but I bet the conservation donor will still be a lot happier with you if you have this response surface than if you have p<0.05! So, there are both basic and applied reasons to use response services.

Of course there is lots of fine work in ecology that addresses both of my concerns. But without doing a formal meta-analysis I am pretty sure that the version of ANOVA I am picking on is a plurality if not an outright majority of studies published in major ecology journals. Occasionally the independent variable truly is categorical and unordered so a response surface won’t work, but this is rare and even then can often be improved to a continuous variable with a little thinking (and it doesn’t excuse not reporting a measure of goodness of fit and effect size!).

In conclusion, I did not get into science to conclude “Factor X has an unspecified but significant effect on factor Y”. This is all the traditional use of ANOVA tells us. Two simple proposed changes: 1) require reporting the R2 and effect size to publish and 2) shift to a response surface mentality (specifically more levels with fewer replicates and fitting a surface through the data) lets us go past this question. Now we can ask “how does Y vary with X and how much of Y is explained by X?”. This is much more exciting to me and I hope to you! This subtle, labor cost neutral, but critical reframing lets mechanistic models in the door much more quickly and gives an ability to answer applied questions in the meantime. The opportunity cost of failing to do this is what I call “the insidious evil of ANOVA”. I am convinced that our field would progress much more quickly if we made this change. What do you think?

*don’t get me started on AIC – that is for another post, but note for now that focusing on AIC neatly dodges having to report an R2 and effect size

Footnote from Jeremy: Upon reflection, Brian and I have decided we ought to alert readers that Anders Møller has been subject to numerous very serious accusations of sloppiness and data falsification, one of which led to retraction of an Oikos paper. See this old post for links which cover this history. We leave it to readers to decide for themselves whether that history should affect their judgment of a paper (Møller and Jenions 2002) that has not, to our knowledge, been directly questioned. In any case, Brian and I believe that his post is robust: many R2 values reported in ecological papers are indeed low, few are extremely high, and the point of the post does not depend on the precise value of the average R2.

Yield as a response surface.

This figure shows classic agricultural yield data based on amount of fertilization. Data from Paris 1992 (The return of Von Liebig’s “Law of the Minimum”) with some random noise added by me. A) A traditional ANOVA approach showing yield vs. nitrogen addition with only two or a few levels. Here there is borderline non-significance (p=0.08). B) The much more revealing response surface (it turns out this data was from the 0 phosphorous addition set of the data and the plants actually fare poorly when too much nitrogen is added without phosphorous). C) The full fitted 2-D surface of yield vs nitrogen and phosphorous. The Paris paper actually fits different functional forms to test whether Liebig’s law of the minimum is true or not. This is a far cry from (A)!

Some well-known tricks for clear writing

Jeremy recently posted a question on who are the most stylish writers in ecology. Stylish is good – scientific writing can be beautiful. But, as I mentioned in the comments, my goal is more prosaic. I just want to be a clear writer. My PhD adviser, Mike Rosenzweig, was a leading inspiration for this. I tell every graduate student of mine to read the introduction to his book on species diversity where he expounds on why it is important to be a clear writer and gives a number of quick tips. I can’t quote the whole passage here (it is pages xv-xvi if you have access to a copy), but here are two quotations to start with:

On why to write clearly (and why most scientists don’t): “Here’s another warning. Clear writing brings a grave danger: People may begin to understand you! Then they will probably disagree with you.”

And on what it takes to write clearly: “But be warned. Writing more clearly takes hard work. The more effortless it seems, the more effort it took. It all depends on whether you have something to say. If you do, you’ll care to work hard to get it across.”

Writing a piece about how to write well is one of those really hubristic, setting yourself up for failure endeavors. So let me be clear. I don’t in anyway hold myself up as a writing expert nor a great writer, although I would like to think I am (and based on feedback I probably am) at least a clearer than average writer. I expect half the readers are better writers than I, so I offer these thoughts in the spirit of use what you like, ignore the rest, and please add your own thoughts in the comments. (Plus I’m feeling guilty on how long its been since my last post, and Jeremy’s post brought this topic to mind so I’m going to go with it)

Here are some of my favorite tips for clear scientific writing:

1)      Audience, audience, audience. You may be sole author, but writing is still a dialogue – between you and the reader. There is no such thing as writing that is most clear in an absolute sense. Writing is only more or less clear in the context of a particular audience. Think about your audience before you start writing. One audience’s dense jargon is another audience’s shorthand. Regularly check in on your writing and whether you think it is reaching that audience.

2)      Much of writing is convention. This follows immediately from point #1. Introduction/methods/results/discussion is not the only way nor even necessarily the best way to write a paper. But it is what a reader expects. So fulfill their expectation. They’ll spend less time trying to figure out the overall flow of the paper and more time trying to understand the details of what you’re saying.  Points #3, #4, #5 and #6 are at least in part about conventions. An example of a convention that tripped me up for my first five papers – the methods are in the past tense, the discussion is in the present. The introduction is mixed depending on whether describing previous work or what we do and do not currently know. Partly this is logical, but partly it is just a convention and I had to find it out from reviewers.

3)      Know how to emphasize. There are numerous rhetorical devices for emphasizing a point. Repetition is a big one – when writing papers it is tempting to think you should not be repetitious. Indeed it is a fatal flaw to repeat unimportant material. But repeating your points of emphasis is your only hope to get somebody to actually remember them. Position is also important – the beginning and end of anything are the most emphasized parts. This is true for a paper, for a section of a paper or for a sentence. For example, one should always have a paragraph in the discussion mentioning the limitations you know about your work (better you get them out than the reviewers think you are ignorant of them), but this paragraph should never be the first or last paragraph of the discussion section unless what you really want the reader to remember is the shortcomings of your work. Same thing holds true for a sentence. “While travelling by train, I saw a purple unicorn standing in a meadow chewing her cud.” Rather buries the main point of the sentence, doesn’t it? Purple unicorn is so novel it tries to swim to the top, but it is ultimately buried by the preferential treatment given to travelling and chewing. Breaking a grammatical rule also calls attention. One of my favorite rhetorical techniques (to the point of overuse by me), is to start a sentence with “But”. My 3rd grade teacher, Mrs Adlof, told me never to do this. But it really grabs your attention. In the quote above from Mike he starts a whole paragraph with “But”. This was honest signalling because it was his most important point on the whole page. All of this, of course, assumes you know what you want to emphasize. Emphasizing random points is like listening to a person who doesn’t modulate their voice to help you know when to pay attention. If you don’t know definitively what you want to emphasize, then you shouldn’t be writing yet.  Your time is better spent thinking and talking to people about what your main points are.

4)      Be precise. Don’t say “productivity varied with temperature”. Say “as temperature goes up, productivity goes up”. Don’t say “we obtained samples” say “we captured birds in a mist net and drew 1 ml of blood”. And – one of my Achilles heels in writing – don’t overuse pronouns. “It”, “they” and “this” add very little value and should be used only when the antecedent is unmistakably clear. “This” at least allows the opportunity (which should be used) for some clarification like “this calculation” or “this result”.

5)      Formatting is your friend. Modern writers have the benefit of easy access to many formatting conveniences. Journals require headings, but one should almost always use subheadings (and maybe even sub-sub-headings). Similarly, a good numbered or bulleted list is stylistically very inelegant but rates very high on the clarity scale. It is the ultimate tool for parallel construction. Bold and italics can be useful as long as they aren’t overused.

6)      The battle for good writing is won sentence by sentence. A good sentence is: short, has the subject and verb together, has an active verb, has the points of emphasis at the beginning and end, and moves the reader along from a familiar launch point at the start to the new information at the end.

For understanding how to build a good sentence, I find a website on scientific writing at Duke  to be very useful. The example below started from one of their examples. Compare the following 6 sentences:

a)      The data was analyzed using multivariate statistics.

b)      The data was analyzed using multivariate statistics by us.

c)      We performed a multivariate statistical analysis on the data.

d)     We analyzed the data using multivariate statistics.

e)      We calculated a principle component analysis [on all seven of our variables].

f)       We regressed productivity against temperature.

I hope you can see that there is a general progression towards better sentences. Specific comments follow.

a)      Sentence a has no subject. Who analyzed the data?! Inquiring minds want to know. Seriously, our brains are hardwired to ask who did what from every sentence. It is jarring not to answer.

b)      Sentence b has a subject but it is miles away from the verb. Sentence b (and a) are passive constructs. Experts tell you not to use passive constructs. (cf. with how I originally wrote this sentence – “You are often told not to use passive constructs in good writing”- it would have been ironic if it was intentional). This is not because passive voice is inherently evil (it has its places), but because the passive usually leads to other evils like separating subject and verb or using weak verbs. By the way, using the verb “to be” (is are were was am been) is not the same as a passive construction, but using “to be” as your main verb  is also a warning sign of a weak sentence.

c)      Sentence c pushes the action (analyzing) into a noun and uses a weak, vague verb (performed – this could mean anything from running a centrifuge to dancing on stage to in this case clicking on buttons in SAS – blech – terrible verb). Sentence c also has a multiple-noun/adjective collision: “multivariate statistical analysis”. Putting 3 nouns and adjectives together is barely legal if used infrequently. Using four or more in a row should get you arrested.

d)     Sentence d is getting there. It now puts the continuity (the “we” who has presumably been mentioned in the 5 preceding sentences) at the start of the sentence, while also putting the  novel information (multivariate stats) in the position of emphasis and novelty (the end of the sentence).

e)      Probably sentence e is even more impactful and informative (analyze is better than perform but analyze is still pretty darn weak, calculate is stronger). Sentence e is probably about the best you can do for a multivariate statistical analysis (you can’t turn principle component analysis into a verb). Depending on context the phrase in brackets might be informative or excessively wordy. Or the phrase in brackets could be moved into the middle of sentence if it should be deemphasized (“We input seven variables into a principle component analysis”), or to the beginning of the sentence if it bridges to what went before (e.g. “Using all seven variables, we calculated a principle component analysis” when the previous 3 sentences were about the variables).

f)       Notice that in a slightly different context we can still do much better than sentence e. In sentence f, we have a very specific verb (regressed) and we replace the annoyingly vague phrase data (or variables) with our specific instances.

The one thing I haven’t talked about here in sentence construction is brevity. Like all rules it is not absolute. I’ve written some really good sentences that are 25-40 words long. But most of my really good sentences are in the 8-15 word range. This is another of my Achilles heels in writing. I love the good parenthetical phrase, the nuance, the sentence complexity that reflects the complexity of the real world. But I fight it on a regular basis. Subordinate clauses, adjectives and adverbs, parenthetical expressions and prepositional phrases should all be examined with severe scrutiny to see if they justify their existence. And if they do justify their existence, do they deserve a sentence of their own? There are many phrases that sound erudite and thus are often used in scientific writing but are really quite vacuous and wordy. Lesson 3 (on simplicity) in the aforementioned link to the Duke Science Writing website has excellent material on vacuous phrases.

This is a lot of work to rewrite a long sentence into a short one and then further work to rewrite with the issues highlighted in sentences a-f above in mind! Do you have to do this for every single sentence for every single paper? No. It does, however, mean that at points in your life when you are actively working on improving your writing, you need to go through your paper sentence by sentence, thinking about how you can improve it. Its probably also a good idea to do this for the abstract and conclusion of every paper. I have a Word macro that goes through and highlights sentences in yellow for longish and red for really long sentences, and a shadow on sentences that use forms of the verb “to be”. This quickly helps draw my attention to potential problems. Now, I don’t slavishly try to get my writing so it gets no flags. Robotic writing is not good writing either. But I find it a very helpful way to call my attention to the sentences I need to think about. I did this a lot when I was a graduate student writing my first few papers. Hopefully some of the lessons learned sunk in and carried forward, because I don’t have time to do this for every paper I write today (and certainly not every blog post or comment!). But I am going to start writing a book shortly, and I expect to spend time doing this again. Everybody needs to return to this level of attention once in a while.

My bottom line is this. The goal of writing a paper is to communicate. Communication involves things like brevity, following convention, good sentence structure, emphasizing important points, etc. In short it means saying what you’re trying to say as clearly as possible. Don’t be beguiled by the apparent erudition of complex, meandering phrases into being what is ultimately mealy-mouthed and unclear.

So, I have now pontificated on a subject on which I am not an expert and only a middling practitioner. On this blog alone, I am easily only the 4th best writer :-) . I can write this only because these ideas are not particularly novel. One can find them in most good books on writing. Aside from the Duke Science Writing website I linked to above, and pages xv-xvi of Rosenzweig’s book, the paper by Gopen and Swan (1990 -The Science of Scientific Writing – just google it you’ll find PDFs everywhere) is also good.

Note that I’ve stayed away from the larger structural issues (i.e. how to organize a paper, etc.). Maybe I will come back to that someday. For now, I look forward to hearing others’ thoughts on the nuts and bolts of what makes clear writing.


For those who are curious here is a link to the Word macro I mentioned. I give it out AS IS. It should work in anything from Word 1997 to Word 2010 and on Mac and PC. It is written in VBA which is not officially supported since Word 2003, but it worked fine in Word 2010 for me today. Unfortunately, I cannot provide technical support on how to get it working on all these different platforms, but it should basically look like:

1) Download the file (you will probably get warnings as Word macros often carry viruses – you’ll just have to accept my word this is clean and click OK).

2) Open the file (WritingMacros.dot). In Word 2010 it automatically disables macros unless you click on the yellow bar at the top to enable them.

3) Open the file (word document) you want to analyze and make it the front/active document.

4) Run the macro VerboseCheck (how to do this is the part that changes a lot between versions of Word). My macro will flag sentences longer than 20, 30 and 40 words by colors (green, yellow, red if I remember). You can (and I sometimes do) change the lower limit to 15 words. It will also put any sentence with forms of the verb “to be” in shadow font.

5) Edit your document correcting hopefully most but not all flagged sentences

6) Run the macro UndoVerboseCheck to remove all the weird formatting (note that I intentionally used odd formatting like text color and shadowing so it doesn’t overwrite all of your own formatting).

People who figure out how to run this macro on their particular operating system/word version are welcome to post more detailed instructions here. And if you’re really stuck, you can post a question and see if somebody other than I can help you.

Scaling up is hard to do

In a recent post on schools of thought in ecology Jeremy and I exchanged several ideas on the importance of linking macro-scale patterns down to micro-scale (think population & community) processes. Jeremy correctly pointed out we need to bring this conversation back to ecology and not leave it at analogies about ideal gas laws and such.  As a macroecologist, I obviously think about this and get asked about this a lot. So here is my best thought to date on this topic.

When people discuss trying to derive macro-scale patterns from detailed processes at the micro-scale (i.e. population dynamics and species interactions), a series of obvious questions pop to mind.

  1. Can we do this scaling up?
  2. Should we do this scaling up?
  3. Must we scale up to call it good science?

I would suggest the consensus of opinion in ecology at large is somewhere between #2 and #3. However, this bypasses the more basic question – can we scale up?

I am about to argue that in most cases such mapping from micro-scale to macro-scale processes is in fact basically impossible to do for simple mathematical reasons. My argument is as follows.

Imagine two scales, the micro-scale and the macro-scale. For example a 1 ha plot for mature trees is a reasonable proxy for the micro (aka local) scale. Several thousand square kilometers might be a good guess at the macro (aka regional) scale. Imagine there is a variable of interest xi,t at the micro-scale, say the abundance (or biomass) of Red Oak on plot i at time t. One can (and people have) developed detailed models for the dynamics of xi,t over time. Denote this model by the function, f, and a parameter θ representing the exogenous variables (e.g. environnment): i.e. (eq1)* xi,t+1=f(xi,ti,t ). But what if we’re really interested in the abundance of Red Oak at the larger, macro-scale. Maybe this is because we have conservation/policy motives (hard to imagine this crowd is interested in answers about a single 1 ha plot). Or maybe we just have a basic science interest in the regional/macro-scale. What do we do?

You should now skip ahead to the recap if you don’t like equations!

One possibility is to simply model each 1 ha plot and aggregate (sum) up the results, i.e. to study (eq2) Xt= Σixi,t where capital Xt represents the same variable (abundance or biomass of Red Oak at the macro/regional scale) and just continue to model the dynamics at the micro-scale by equation 1: xi,t+1=f(xi,ti,t) .This is mathematically valid. However this approach requires considerable resources to obtain data (on both x and θ) for each and every 1 ha plot and considerable computational resources to calculate a complex non-linear model for every ha. This is in practice what weather forecasting models do – but it requires supercomputers and hundreds of millions of dollars invested in data collection**. Not so easy and often in practice impossible in ecology. What else can we do?

As a short-cut is very tempting to take the detailed process based model and study it on an average 1 ha plot, i.e. (eq3a)* f(\overline{x_{i,t}},\overline{\theta_{i,t}}) (where the over bar indicates the average value) since such average data is often readily available. This model is not only data-tractable but computationally tractable because we only need to iterate the dynamic equation #1 for one case. This is known in physics as the mean-field approach, and is a common modelling tactic. Many assume this will give the correct answer for the macro-scale problem (Xt) by summing up the average plot enough times, i.e. (eq3b) X_t=n f(\overline{x_{i,t}},\overline{\theta_{i,t}}) (where n is the number of parcels – n=100,000 in our example). However, it requires that nf(\overline{x_{i,t}},\overline{\theta_{i,t}})=\sum_i f(x_{i,t},\theta_{i,t})
or equivalently that f(\overline{x_{i,t}},\overline{\theta_{i,t}}) = \frac{1}{n}\sum_i f(x_{i,t},\theta_{i,t})=\overline{f(x_{i,t},\theta_{i,t})} (or in English that the function of the average of x is the average of the function applied to each x)

However, it is well known from Jensen’s inequality that in general it is not true that f(\overline{x_{i,t}},\overline{\theta_{i,t}})=\overline{f(x_{i,t},\theta_{i,t})} . The equality holds if and only if f(x) is a linear function or variance(xi)=0. Thus the mean-field approach fails when f is non-linear and there is variance in xi. And the failure can be quite large, not just a mathematical detail. Using Taylor’s series one can approximate the inaccuracy (known as the delta method in economics): (eq4) \overline{f(x_{i,t},\theta_{i,t})}\approx f(\overline{x_{i,t}})+\frac{1}{2}f''(\overline{x_{i,t}})Var(x_{i,t})
(where f’’ is the 2nd derivative of f – i.e. a measure of its non-linearity). Thus in systems with high variance and high-nonlinearity the correction factor can be as large or larger than the original term.

A dynamical systems context (i.e. tracking X/x over time by xi,t+1=f(xi,ti,t)) further exaggerates this effect because the error is compounded at each time step. And if f is a chaotic map, then deviation of the model from the true answer will grow exponentially fast due to sensitivity of initial conditions.

A quick recap: For those of you whose heads are hurting from the equations, let me summarize the action:

  1. We have a simple dynamical system modelling detailed processes over time at the micro (1 ha plot) scale given by equation 1 – we do this all the time in ecology for some variable xi
  2. We want to study the aggregate value of this variable over some much larger macro scale (e.g. 1000 km2), call it X
  3. We can figure out X by just adding up the xi overall the plots as in equation 2, but this requires modelling the dynamics of each of the 100,000 separate plots which requires detailed knowledge about each separate 1 ha plot and computational horse power.
  4. If we aren’t the weather service and can’t do #3, we are tempted to use a mean field approach (equation 3) modelling an average plot and then multiplying it by 100,000 instead.
  5. Unfortunately Jensen’s inequality tells us that the mean-field approach (equation 3) gives the same answer as the correct massive computation approach (equation 2) if and only if the model is linear or there is no variance in the xi. That happens sometimes (e.g.the ideal gas law models a situation with no variance), but it sure doesn’t sound like ecology.
  6. We can quantify the approximate error of the mean-field approach by equation 4 – it is the product of the nonlinearity (2nd derivative) of f and the variance in the xi;. This can be HUGE in ecology.***
  7. Putting this argument into a dynamical systems context where the error propagates forward in time, especially in a chaotic system just makes it worse

Conclusion

So if you believe ecology is essentially linear and/or has no variance then scaling works easily. Otherwise we are in the realm of weather prediction where massive data gathering and computation give rather limited understanding.

When I declare that “scaling up is hard to do” the Frankie Valli/Four tops cover of the song “Breaking up is hard to do” always pops into my head. When they sing the phrase, there is high emotion – surprise, wistfulness and maybe a bit of hope and relief. This is how I feel about the idea that “scaling up is hard to do”. All my scientific training and instincts tell me that building detailed mapping between the micro- and macro-scale is the ultimate goal. It is the signal achievement of statistical mechanics in physics. Going from the quantum mechanics of the Bohr atom to macro-chemical properties (valences, types of bonds) is the essence of physical chemistry. The power of doing this bridging is undeniable. However, I am increasingly of the opinion that in many (most?) cases, this goal is unachievable no matter how hard we try in ecology.

This in turn leaves us with the problem of what to do with all the really interesting (and real-world useful) questions at the macroscale. I only see two possibilities

  1. Declare macro-scale questions off limits because traditional methods can’t cover them
  2. Charge in to macro-scale questions and muddle along trying to invent new approaches

Personally, I can’t accept the first approach and advocate the second.

What do you think? Do you see flaws in my argument why it is mathematically demonstrable we will never scale micro-theory up to macro-theory in ecology? Can you give me a counter-example to my argument in ecology where we have something like statistical mechanics of physics where we can model from the micro-scale to the macro-scale informatively? Or if you agree with my argument, what do you think are the implications?

* I have put equation references in for the convenience of those who want to comment
** and despite all of that money spent weather prediction is still rather limited and unable to project the system forward more than about five days.
*** this approximation approach could provide a way out but I’ve never seen it attempted in ecology