Are ecologists excessively macho when it comes to statistical methods? I use the word macho in a purely gender neutral way – I use it to mean “posturing to show how tough one is and place oneself at the top of the hierarchy”.
In my experience ecologists have a long list of “must use” approaches to statistics that are more complicated than simpler methods but don’t necessarily change the outcome. To me this is a machismo attitude to statistics – “my paper is better because I used tougher statistics”. It has a Red-Queen dynamic – eventually what starts as a signal of being superior turns into something reviewers expect in every paper. But often times with a little thinking, there is really no reason this analysis is needed in a particular case (the reviewer who is requiring it is so far removed from the development of the approach that they have forgotten why it is really used). And even if the more complex approach might be relevant, it can be very costly to implement but often have very little impact on the final results. Thus what started out as statistical machismo turns into wasted time required by reviewers. Here are some of my hobby horses:
- Bonferroni corrections – One should be careful of multiple comparisons and the possibility of increased Type I error. However, this is carried way overboard. First of all, one is usually told to use the Bonferroni method which is known to error way on the other side and be excessively conservative. Secondly it is used without thought about why it might be required. I recall a colleague who had measured about 35 floral traits in two populations. About 30 of them came back as significantly different. Reviewers told them to do a Bonferroni correction. To anybody who understands the biological question and the statistics, a Bonferoni correction will make no difference in the final answer (OK only 26 out of 35 traits will be significant after correction but are we now going to conclude the populations haven’t differentiated?). Now if only 2 or 4 out of 35 were significant at p<0.05, then some proper correction would absolutely be needed (but then the whole conclusion probably ought to change regardless of Bonferroni outcomes in this case). But when 30 out of 35 were significant we’re still going to waste time on Bonferroni correcionts?
- Phylogenetic corrections – Anytime your data points represent different species (i.e. comparative analysis), you are now expected to do some variation of PIC (phylogenetically independent contrasts) or GLS regression. I know there are a handful of classic stories that reverse when PICs are used. But now people are expected to go out and generate a phylogeny before they can publish any comparative analysis. Even when we don’t have good phylogenies for many groups. And when the methods assume the phlogenies are error-free which they are not. And when the p-values are <0.0000001 and unlikely to change under realistic evolutionary patterns. I once was told I had to do a phylogenetic regression when my dependent variable was abundance of a species. Now of all traits that are not phylogenetically conserved, abundance is at the top of the list (there is published data supporting this), guaranteeing there could not be a phylogenetic signal in this regression. When I argued this, my protagonist eventually fell back on “well that’s how you do good science” to justify why I should still do it – there was no link back to the real issues.
- Spatial regression – Increasingly reviewers are demanding some form of spatial regression if your data (specifically residuals) are spatially structured. It is true that treating points as independent when they are spatially autocorrelated can lead you to Type I error. But it usually doesn’t change your p value by orders of magnitude and in real world cases, many spatial regressions have hundreds of points and p-values with 5 or 6 leading zeros. They’re still going to be significant after doing spatial GLS. And, here is the important point – ignoring spatial autocorrelation does not bias your estimates of slope under normal circumstances (at worst it makes it less efficient) – so ignoring autocorrelation will not introduce error into studying the parameters of the regression. You can also use simple methods to adjust the degrees of freedom and hence p-value without performing spatial regression. Incidentally, I think the most interesting thing to do with spatial autocorrelation is to highlight it and study at as being informative, not to use statistical techniques that “correct it out” and let you ignore it – the same thing I would say about phylogenetic correlation. Note that all of these arguments apply to timeseries as well.
- Detection error – I am running into this increasingly with use of the Breeding Bird Survey. Anytime you estimate abundance of a moving organism, you will sometimes miss a few. This is a source of measurement error for abundance estimates known as detection error. There are techniques to estimate detection error, but – and here’s the kicker – they effectively require repeated measures of essentially the same data point (i.e. same time, location and observer) or distance-based sampling where distance to each organism is recorded, or many covariates. This clearly cuts down the number of sites, species, times and other factors of interest you can sample and is thus very costly. And even if you’re willing to pay the cost, it’s not something you can do retroactively on a historical dataset like the Breeding Bird Survey. Detection error also requires unrealistic assumptions to calculate such as the assumption the population is closed (about like assuming a phylogeny has no errors). Now, if one wants to make strong claims about how an abundance has gone from low to zero, detection error is a real issue (see the debate on whether the Ivory-billed woodpecker is extinct). Detection error also can be critical if one wants to claim cryptic species X is rarer than loud, brilliantly colored species Y since the differential detectabilities biases the result. And detection error alone indubitably biases estimates of site occupancy downward (you can only fail to count individuals in detection error) but this assumes detection error is the only or dominant source of measurement error (e.g. might mistaken double counting of individuals accidentally cancel out detection error). But if one is looking at sweeping macroecological questions, primarily comparing within a species across space and or time, it is hard to spin a scenario where detection error is more than a lot of noise.
- Bayesian methods – this one is a mixed bag (Jeremy has discussed his view on Bayesian approaches previously here and here). There have been real innovations in computational methods that are enabled by Bayesian approaches (e.g. hierarchical process models sensu Clark et al). Although even here, it is in most cases the Markov Chain Monte Carlo (MCMC) Method as a computational tool to numerically solve complex likelihood that is the real innovation – not Bayesian methods. (As an aside, to truly make something Bayesian sensu strictu in my mind you need to have informative priors which ecologists rarely do, but I know others enjoy the philosophical differences between credible intervals vs. confidence intervals, etc). Notwithstanding these benefits, I have reviewed papers where a Bayesian approach was used to do what was basically a two-sample t-test or a multivariate regression or even a linear hierarchical mixed model (OK the last is complicated but still not as complicated to most people as the Bayesian equivalent). Apparently I was supposed to be impressed at how much better the paper was because it was Bayesian. Nope. The best statistic is one that is as widely understood as possible and good enough for the question at hand.
These techniques all have the following features in common:
a) They are vastly more complex to apply than a well-known simple alternative
b) They are understood by a much narrower circle of readers – in my book intentionally narrowing your audience is a cardinal sin of scientific communication when done unnecessarily (but I secretly suspect this is the main reason many people do it – the fewer people who understand you, the more you can get away with …)
c) They may require additional data that is impossible or expensive to obtain (phylogenies, repeated observations for detection). Sometimes the data (e.g. phylogenies) or assumptions (closed populations of detection analysis) are error-riddled themselves but it is apparently okay to ignore this. They might also require new software and heavy computational power (i.e. Bayesian).
d) They reduce the power in a statistical sense, downgrading the p-value, thereby meaning on average we will need to collect a bit more data and simultaneously falsely paying homage to p-values instead of important things like variance explained and effect size and also erroneously prioritizing Type I over Type II error.
e) They have not in the grand sweep over many papers fundamentally changed our understanding of ecology in any field I can name (nor even changed the interpretation of most specific results in individual papers).
In short, our collective statistical machismo has caused us to require statistical approaches that are a drag on the field of ecology, allowing them to become firmly (or quickly becoming firmly) established as “must do” to publish let alone be considered high quality science. I don’t object to having these tools around for when we really need them or valid questions arise. But can we please stop reflexively and unthinkingly insisting that every paper that possibly could use these techniques use them? Especially, but not only, when we can tell in advance they will have no effect. They have real (sometimes insurmountable) costs to implement.
To make this constructive, here is what I would suggest:
- For the issues that are obsessed with Type I error (Bonferroni, spatial, temporal and phylogenetic regressions), I would say: a) stop wasting our time when p=0.00001 – it ain’t going to become non-significant (or at a minimum the burden of proof is now on the reviewer to argue for some highly unusual pathology in the data that makes the correction matter way more than usual or else that estimation bias is being introduced). If p is closer* to 0.05 then, well, have a rational conversation about whether hypothesis testing at p<0.05 is really the main point of the paper and how hard it is to get the data to do the additional test vs the importance of the science, and be open to arguments of why the test isn’t needed (e.g. knowing that there is no phylogenetic signal in the variable being studied).
- For detection error, apply some common sense about whether detection error is likely to change the result at hand or not. Some questions it is, some it isn’t. Don’t stop all science on datasets that don’t support estimation of detection error.
- For Bayesian approaches – out of simple respect for your audience, don’t use Bayesian methods when a simpler approach works. And if you are in a complex approach that requires Bayesian calculations, be clear whether you are just using it as a calculation method on likelihoods or really invoking informative priors and the full Bayesian philosophy. And the burden is still to justify that you have answered an ecologically interesting question – including a Bayesian method doesn’t give you a free pass on this question.
To those readers who object to this as a way of returning common sense to statistics in ecology, I challenge the readers to make a case that these kinds of techniques have fundamentally improved our ecological understanding. I know this is a provocative claim, so don’t hold back. But please don’t: 1) Tell me you have to do the test “just because” or because “statisticians agree” (they don’t – most statisticians understand the strengths and weaknesses of these approaches way better than ecologists) or it violates the assumptions (most statistics reported violate some assumption – its just a question of whether it violates the assumptions in an important way); 2) nor assume that I am a statistical idiot and don’t understand the implications for Type I error, etc; and 3) Please do address my core issues of the real cost of implementing them and describe how they improve the state of ecological knowledge (not statistical assumption satisfying) enough to justify this cost. Otherwise, I claim you’re guilty of statistical machismo!
UPDATE – if you read the comments you’re in for a long read (109 and counting). If you want the quick version I posted a summary of the comments. At the moment it is the last post (until somebody comments on the original post again). If its not the last post or close to it, you can find by using your browser to search for “100+” which should take you straight to the summary.
*(I would propose an order of magnitude cut-off of only worrying about Type I error correction if p>0.005, and I think this is conservative based on how much I’ve seen these correction factors change p values)
Fancy methods definitely cannot replace thoughtful ideas. I was reading an article the other day where the authors developed a complex model for the dynamics of a population, but some of their basic assumptions just didn’t make sense. As I learn more about statistical methods, I really value stepping back and thinking simply about what I am trying to understand.
That said, I wonder if some of the machismo is because fancy statistics are still impressive for many ecologists. Maybe someday, when everyone understands the mechanics of these analyses, simple analyses will become vogue!
Hi Brian,
I understand where you’re coming from with your post and most of these points, but I have to disagree with you complaining about correcting for multiple comparisons. Bonferroni’s correction really is easy to do (putting aside questions about conservatism for the moment). It can be done by a 10 year old with a pencil and paper after all the experimental work and statistical analysis is done. Less conservative corrections (e.g., Dunn-Sidak) can be done by a 15 year old with pencil and paper. Or on a tablet (either stone or silicon versions). False discovery rates are an interesting alternative, which should probably be left to 18 year olds and older. Of course, a simple a priori power analysis will tell you how much data you need to collect in your experimental design, avoiding many of these arguments.
Given that many ecologists remain fixated with the absolute value of ‘p’ (which does not actually say anything about biological significance) and many reported (published) p-values in ecology are above your threshold (p > 0.005), I think we should race our hobby horses to see who wins 😉
Oh – and can you clarify what you mean by detection error? Is this sometimes referred to as ‘observation error’, or is it based only on presence/absence type sampling? If the former, it is possible to overestimate abundance, by counting the same individual more than once, or (former and latter) assigning an individual from another species to your focal species. Observation error can have important consequences in ecological estimation. Exactly how important probably has to be deduced for individual cases.
I think it’s important not to conflate rigorous methodological approaches with machismo. But I agree that statistical machismo does exist in ecology.
Hi Mike – I’m not totally clear on which hobby horse you are racing against mine? If it is that ecologists are too obsessed with/don’t think clearly about p<0.05 I couldn't agree more!
I'm not an expert but as I understand it, detection error is one specific type of observation error. It occurs when counting abundance (although it shows up most in the difference between 0 and 1 or presence/absence) and it involves specifically missed observations of a present individual. The classic way of dealing with it is to use mark-recapture or distance observation which lets you use a reasonably fancy (probably 18 years and over only by your scale!) statistical model to correct for these misses. It requires repeated measurements or measurements of distance to all individuals and it requires unlikely assumptions like the population stays constant between the repeated measures.
As Jeremy notes, one of the biggest problems I have with singling out detection error is precisely that it is singling out one type of error only.
For Mike & Jeremy – there are times when a Bonferroni correction is needed. And I agree its not that hard to use. But there are times when its not going to change the story at all. In these cases should the author just do it anyway? Maybe, although I'm on the side of no. But in these cases should a reviewer hold a paper when its absolutely not going to make a difference?
I know I’ve referenced it before, but, on p-values and the like, I cannot recommend enough this intellectual romp by Hurlbert and Lombardi [pdf]. See also the same duo’s comments on multiple comparisons and 1-tailed tests.
Brian – my hobby horse (we should be very careful about multiple comparisons) was being put up against yours (Bonferroni is annoying) in the 12.15 post-hoc stakes. It looks like we basically agree on this, though. In clear cut-cases, it’s probably not required, but in marginal cases (as are so common in ecology), we really should guard against false inference.
I think the history of the “Bonferroni is too conservative” argument has been abused by some researchers. From my recollection, this line developed out of discussions about how to deal with the massive amount of genetic data that was available when large-scale sequencing became readily accessible. Here, Bonferroni really can screw you over in terms of the heights you need to jump to, to achieve statistical significance, where you might have hundreds or thousands of comparisons to make. In the sort of population ecology I read, there are rarely (if ever) so many comparisons, so the original Bonferroni correction probably isn’t overly conservative.
I’ll have to go and read those articles Jarett posted to see whether montane unicorns suffer from detection bias or statistical conservatism.
Yeah – I’m not sure we’re too far apart on the sometimes you need it, sometimes you don’t (and the do case is common in ecology).
I may not have made it very obvious but there is a link in my post to a nice paper by Garcia in Oikos 2004 on a test that is almost as simple as Bonferroni and much more balanced in its treatment of Type I vs Type II error. If you have to do a treatment for multiple tests, Garcia’s method is my vote.
Nice article!
I agree with a lot of this Brian, though I’m with Mike on the specific case of Bonferroni correction being pretty straightforward. I think your comments on routinely and mindlessly “correcting” for spatial autocorrelation and phylogeny are spot on.
There’s an Ecology paper from a few years ago (sorry, can’t recall author, will try to find it later today) arguing that ecologists these days overcomplicate their analyses, trying to explicitly and precisely model lots of sources of error. It’s as if people don’t believe in the Central Limit Theorem anymore. It’s unfashionable to rely on the robustness of simple, classical analyses (and they are very robust, particularly to violations of the assumption of normality).
And I say this as someone who’s done some modestly-fancy things (fitting ODEs and generalized additive models to time series data). But I’d like to think that I haul out the big guns only when it’s demanded by my scientific goals (e.g., if I want to parameterize a population dynamic model from time series data, I have no choice but to fit that ODE model to the data).
I’m plotting a post at some point on a related issue, asking whether the wide availability of R packages and WinBugs isn’t just allowing ecologists to make more and different mistakes than they otherwise would have made, and if so what can be done about it. We rightly criticize “cookbook statistics”, mindlessly following a “recipe” rather than exercising thought and judgment in one’s analyses. But one virtue of “cookbook statistics” is that it limits the kinds of mistakes one can make. Conversely, people using highly-sophisticated, “non-cookbook” methods they don’t fully understand aren’t really in a position to make all the judgment calls that are required to use those methods properly. If you’re not a professional chef, there’s something to be said for not trying to cook like one and just following simple recipes instead.
The ‘cookbook’ approach to modern R usage is intriguing, especially as it was developed (at least, in my mind) as a command line approach that forced users to think explicitly about the assumptions they made for each model/test, compared to e.g., Minitab/SPSS, which use the point and click/black box approach to statistical testing. That’s not to say people can’t do the wrong sort of stats with R, just that they have the opportunity to see explicitly what they’re doing.
I’ve only ever dabbled with R, and not very recently, so it’s interesting to here how it’s evolving.
I found it tremendously helpful as a means to formally state what I was testing in my data. A big leap over the point and click software so many of us are trained on early. And (for me anyway) I found the syntax more intuitive than SAS. It seemed more like language than an incantation.
Pingback: Why I’m Teaching Computational Data Analysis for Biology | i'm a chordata! urochordata!
I can’t recommend enough Murtaugh’s Ecology paper (which I think Jeremy was referencing) Simplicity and Complexity in Ecological Data Analysis. I think it captures much of what you’re getting at here.
Personally, I’m against ‘cookbook statistics’ as I found in my own work and in interacting with colleagues it lead to an ossification of one’s views about Statistics (which is after all an evolving science, just like Ecology) and to, often without meaning to, stop thinking about Biology. As part of a series of posts coming from my course, I’ve written a brief reflection on my own evolving views.
To whit, start with a question and your own model of how you think the world works. Collect data to test it. Think about the processes that generated that data. Then chose the simplest, clearest method to test your hypotheses. The machinery, as it were, is incredibly important, but should be chosen to yield the cleanest clearest answers.
As a side note, one reason I love causal graphs of the sort used in SEM is that they are intuitive. Anyone, regardless of mathematical or statistical ability, can create one. I’ve found time after time that once someone draws one up, the tast ahead of them in terms of analysis is far simpler than they think it will be. And then the jump into analysis is a hop rather than a leap, quite often.
Yes, thank you, I was thinking of Murtaugh.
Interesting that you raise the example of SEM. There’s a post in the queue on that. I’m not a fan, or at least not nearly as much of one as you are. But you’ll just have to wait for the post to hear why.
Before you post it, read over Jim’s new Ecosphere piece (what? I like giving you homework!). It argues quite strongly to really think of SEM as a framework, not a specific technique. It, too, can be incredibly simple or dazzlingly complex. It all depends on one’s goals.
Way ahead of you.
Pingback: Statistical Machismo | Notes on Learnings About Data Analysis
Cmon. It’s ok to ignore detection probability? For the example you raise (historic datasets like the Breeding Bird Surveys), it’s fine to state your assumption (detection probability has not changed in a systematic way over time), and present your results with appropriate caveats. But that’s no excuse to ignore detection probability in new field studies. It doesn’t “cost” anything in new field studies to collect distance data, and it allows you to do a lot of things you can’t do with presence-only data, like compare relative (or absolute) abundance among species. I guess your caveats at the end cover you: if it’s about statistical noise, ok; if it’s about systematic bias (e.g., a blue whale is detectable at greater distance than a sea otter), you have to address it. Can’t wait to review a paper that cites this blog as a justification to ignore known sources of bias! 😉
Hmm – not sure I said some of the things you said I did. But on the whole I am most sympathetic with other commenters who object to a one-size fits-all cookbook approach. I clearly said there are times when detection error is important and times when it isn’t (as we seem to agree). However, errors are made on both sides. People clearly ignore it when they shouldn’t. But (and I’m speaking from multiple personal experiences) reviewers ramrod it as an absolute requirement to publish when they shouldn’t (when it can’t and needn’t be done). And – here is my main point – in many scenarios errors of ignoring something like detection are more likely to be harmless then errors of preventing publication of good work. I’ve clearly acknowledged scenarios where detection error is vital. So I’m waiting for the reverse, please tell me stories of where detection error is vitally advancing our state of knowledge in some of the scenarios I mentioned where I said it could be ignored.
And, umm, no cost in additional time to measuring distance instead of counting individuals. I’d love to have field techs who possess this magical property! Presumably they can teleport to their field sites too? Seriously, even just eyeballing distance with an estimate and writing down an approximation takes time over just recording the species ID of an individual (50-100% more would be my rough guess). And if you’re just approximating, all you’ve done is replace one relatively obvious source of error with another source of error buried in a more complex process. And if you really want measurements with a tape out to the observed location of every individual, you’re talking say 500-1000% increase in time per point.
And in all seriousness, I do hope this blog gets used in a few push back arguments with over-the-line, applying a cook-book-of-their-own not thinking reviewers who demand more complicated statistics when they’re not really needed and come at a high cost.
If you disagree, please address the scenarios I’m talking about and make a case why detection error has to be used. Don’t just say “it has to be done”. That to me is a sign of not really thinking about the problem.
I’ve now spent 3 summers doing butterfly surveys all over canada. We routinely find tens to hundreds of butterflies (individuals; usually comprising 10-40 species) per day, spread across 10+ sites. With this kind of work, if we were collecting (even haphazard guesses) distance measurements per individual, I suspect it’d take us at least 5 times longer to complete our surveys.
Our project’s goal is to generate big data, and rely on the big numbers to take care of the issues with the inaccuracies of our field data collection. We’re running around in meadows all day, it’s near impossible to accurately derive abundance data, but having tons of historical & current presence/absence data for all of north america lets us ask some pretty interesting questions.
On another note, we should have a paper coming out indicating that our method (running around meadows) generates as good data (w/r/t species accumulation curves) as the traditional pollard transect.
It’s been good to read through this post / comment. I did my undergrad honours project last year – for about two months I tried using fancy MaxEnt models to estimate historical range shifts before my supervisor pointed out I’d violated my assumptions – we ended up using simple regression models. Two months – gone!
As already noted by Jeremy, the central limit theorem and big data tend to make a lot of these problems go away (little known fact – the central limit theorem still works even if the points are correlated unless there is pathological correlation between the points).
Good experimental design addressing ecological hypotheses (like you’ve done) is way more important than minor violations of statistical assumptions.
Good luck!
On the somewhat tangential issue of cookbook statistics (=mindless rule-following) vs. the need for good judgment, see this old article by Allan Stewart-Oaten: http://www.esajournals.org/doi/abs/10.2307/1940736. Very good piece. The message I take away from it is that good judgment is essential, but that they are based on a thorough understanding of the “rules”–not just what the “rules” are, but the reasons behind them. That is, good judgment isn’t an alternative to blind rule-following so much as it grows out of long experience following, and studying, the rules. This is why I’m a little uncomfortable with things like Ben Bolker’s piece in the Ecological Applications (?) special feature on hierarchical Bayesian models back in 2009 (?). Ben’s #1 piece of advice for non-experts looking to use this approach was “exercise good judgment”. But the problem is that, if you’re a non-expert, you don’t know enough about how the approach works to exercise good judgment.
The Stewart-Oaten piece is great – I actually assign it in my stats class (where I tell students you have to move beyond cookbook approaches and they groan). I’ve also assigned the Murtaugh piece in the past. The Stewart-Oaten is slightly dated in its examples, but it is spot on in explaining why judgement is needed.
Jeremy, you put your finger on the nub of the problem, how can we expect non-experts to apply expert judgement, yet how can you do something highly technical without expert judgement. I suspect its one reason you & I both favor simpler tried and true approaches.
But really the main point of my piece is I kind of expect people presenting as experts in statistical methods to also (indeed especially) apply a highly refined judgement. And in my experience they can be just as cook-book-like as the novice.
This discussion made me think of another link with theoretical studies — there, I think there’s more acceptance of a simple model being an elegant one (and favored by many people). With stats, I agree with Brian (and Jeremy, and others) that too many people reflexively think the more complicated or newer or harder to implement analysis must be better. I wonder how that difference arose. (And, maybe others will disagree with my characterization.)
Interesting post. First off: I hear you. I agree that many researchers should be using simple methods, and that using complex methods can hide the *real* issues behind particular problems. And I completely agree that reviewers should stop rejecting papers because simple methods are used (do people really do that?). And statistical ‘experts’ in ecology can act like ‘novices’ (myself included!). In general I hear you.
But…isn’t it best to have a diversity of researchers using a diversity of approaches? I was happy to see you agree that you have no problem with keeping complex methods around for when we really need them. But, as Jeremy alluded to, how are we to know when we really need them if we don’t have practice with them? ‘Tried and true’ methods have not always been tried and true. ANOVA is well-understood in ecology — sort of 😉 — because ecologists have tried it out, made mistakes, figured out how to improve practice, etc. Seems to me that there’s about a 20 year lag before new ideas in statistics become properly used in ecology. To me this is fine. How else will ecologists become proficient in these methods unless we try using them?
The point is, I think its a little unrealistic to ask ecologists to keep our big statistical guns away until we really need them. Don’t we need some target practice and feedback from our research community on how well we ‘shoot’? And no I don’t think that this is just an issue of education. There is real research to be done on finding out how new (I avoided the pejorative word ‘fancy’) statistical methods react with both ecological data and the minds of ecologists.
One more point. You make a distinction between checking statistical assumptions and assessing ecological hypotheses. In my experience, checking assumptions often helps me to appreciate important aspects of my system that I wasn’t thinking about. I’ve got no problems with robustness arguments, but that doesn’t mean we shouldn’t also check our assumptions. For example, if a statistical method assumes two species’ abundances are uncorrelated, yet nevertheless gives correct results about some other hypothesis of interest, this is good — the method is robust. But, ecological knowledge was uncovered by actually checking the assumption anyway.
Thanks Steve,
I agree with most of what you say. Diversity of approach is good (in fact great). Its the exact opposite (coalescing on fixation to only complex approaches) that bothers me and this post is about. Working through statistical assumptions can push our ecological science. But mandating you can’t publish a paper that advances our science if it violates statistically assumptions (no matter how unlikely they are to matter and with an ever increasing list of assumptions that must be met) doesn’t usually advance science. I guess maybe this is the center of our different viewpoints where we differ. I am on the receiving end of this all the time. And I teach advanced graduate statistics, I ought to have a license to exercise judgement, but many ecological reviewers want to revert to the one-way-to-do-it cookbook.
As far as should we “keep our big statistical guns hidden away”, I guess its a matter of grain size. Should we use new more complicated approaches. Yes, absolutely. Should we mandate everybody should use them when they’re brand spanking new no matter their level of statistical expertise. Probably not – better to let some experts find the sharp edges first.
Also, I don’t think something like phylogenetic correction (distinct from saying actually testing and studying the degree of phylogenetic conservatism) is necessarily a big statistical gun. It does nothing new. It informs no new ecology. It just fixes up a violation of statistical assumptions related to statistical independence. And sometimes fixing this up is important and worth the trouble. And sometimes it isn’t. Sometimes its the proverbial “using an atom bomb to kill a fly” – not only unnecessary but detrimental.
So in short the point I most agree with is diversity of approaches is really good. And I (and most of my collaborators in macroecology) see a real lack of thoughtful acceptance of diversity from the supposedly expert end. This is bad, and I suspect you would agree.
Thanks
Yep…I agree. But when I look at the literature I see tons of diversity in statistical sophistication. Maybe we read a different literature? 😉 I suspect that our difference of opinion relates to a bit of an availability / confirmation bias problem. It would be cool to see data on this somehow…to tease out our psychological biases. But we may need a fancy statistical technique to get good estimates! 😉 This brings up another thing…simple techniques are great for big-picture questions with clear answers. But to get a precise quantitative picture of how a particular natural system behaves (i.e. to make good parameter estimates), we often need more statistical sophistication. Again, the diversity thing comes to mind, which we agree on.
Thanks for the thoughtful response.
I do teach things like mixed hierarchical linear models (and a bit of Bayesian modelling) to my students. so I definitely see place for complex tools. But I usually tell them they might want to think about collaborating with somebody who knows what they’re doing. Complex data needs complex methods. If this has been interpreted as a post against complex methods under any scenario, then I have communicated poorly.
You are right of course that our only real difference is how much of a problem we perceive this forcing to complex methods through the review process to be. A rigorous study would be really cool!
At the moment all I can offer is anecdotal data from personal experience, but the sample sizes are starting to get up there. The last four papers I have submitted using the breeding bird survey have been rejected once and often twice because I didn’t use detection methods – even though all of these papers fell in the domain where I’ve argued detection methods aren’t vital and nobody has yet disagreed on this point with me on this post (i.e. comparing across space or time within a species using historical data). And in cases where I was given a chance at revision I made these arguments with detailed support for my position and they still got rejected for my failure to use a statistic that I couldn’t possible use and on a question for which nobody could possibly individually collect data addressing detection issues on. This equates to saying we won’t study these questions because you didn’t/can’t use a particular statistical technique. Similarly, the few times I have snuck a paper through without a phylogenetic correction on a comparative study have shocked me (I have done this a couple of times). So yes, you have found me out, this is really a rant about my rejected papers!
You’re anything but a poor communicator Brian.
This is outstanding; in my opinion you hit the nail right on the head. I would add that dealing with the kind of people you describe is absolutely one of the most aggravating things in science, because not only are they wrong, they’re convinced *you’re* the one who is wrong, yet without any real defense as to exactly why, as you allude to.
Many types of problems are not improved upon by more sophisticated statistics, and indeed only muddy the waters if it’s not really clear exactly what was done, or why, as you state; this practice is unfortunately so commonplace as to invite deep cynicism in those who see it. Many questions are only better addressed with better data, better models, or some combination of the two. But in a system that values “novelty” (“hey, look at my novel finding”–“hey great, look at your questionable or incorrect methods and conclusions”) over improved understanding.
should be:
“But in a system that values “novelty” (“hey, look at my novel finding”–”hey great, look at your questionable or incorrect methods and conclusions”) over improved understanding, this is the kind of thing that results”.
25 comments before you get one in Jim? You’re losing your touch. 😉
“Losing touch” would unfortunately be more correct Jeremy…
I hope to make a more substantive comment than my quasi-diatribe above. Sometimes I just get so frustrated with the mindless/lazy attitude I see, and we really need posts like this that point out existing problems.
One site suggestion: I personally favor leaving posts such as this one (and also many former ones), that cover broad and important topics, up at the top for an extended time, several days or so. Some of us can’t check in here frequently, but equally or more important, there is much to ruminate on, which typically increases as people add new comments. These types of discussions are really important IMO, and they can profitably go on for quite a while.
Hi Jim – your point is well taken. We have been talking about how to make sure the dialogue doesn’t stop when a really substantive post scrolls off the list due to newer posts. For now, we’re trying a “most recent comments” box on the right hand side of the home page. And in general, readers should know that it is OK and encouraged to continue to comment on posts they find interesting (thread resurrection is not bad manners here) and that others will see (and hopefully respond to) these comments because of the box.
Glad you like the post.
If you want a post to remain at the top, you can also date it a few days in the future (and then this becomes a sci-fi Ecology blog). Or make it ‘sticky’ for a few days.
Like Brian said, we considered making posts sticky. For now, we’re relying on the new recent comments and top posts sidebars to let readers know what others are reading and commenting on. We’re also scheduling posts so that really substantive ones remain on top for at least two days.
Not that this particular post needs much help attracting readers (something like 800 page views in the first 24 hours, not counting the many people who read it on the homepage!) 😉
Hello, not commented before, but a friend of mine put me on to your blog post and I felt motivated to comment on it as someone who’s had a reasonable amount of experience of phylogenetic comparative methods in ecology (having done my PhD on the consequences of phylogenetic inaccuracy on these analyses).
Broadly, I agree with the general point of your article, that needlessly complicated analyses do no one any favours, and actually might be used to cover up bad science (I’m sure many of us have reviewed papers that sound interesting but our lack of grip on the statistical approach used has made us a little uneasy that we might be having wool pulled over our eyes). I also remember being irritated by a straight bounce from Ecology on the grounds that the statistical analyses were ‘not sophisticated enough for our journal’, which bemused me because the analyses were as sophisticated as I felt they needed to be.
Having said this I disagree with most* of your specific points about phylogenetic comparative analyses and spatial analyses (though I’m admittedly less knowledgable about the latter). With modern phylogenetic comparative techniques (i.e. PGLS), what you are looking at there is essentially a data transformation so that your data conform to statistical assumptions. Where do you stand on data transformations generally (I know there are some statisticians who say the concerns about assumptions of normality, for example, are greatly over-emphasised)? Your point about inaccuracy in the phylogenies is true – but misses the point that when you do a non-phylogenetically controlled analysis you are still, whether you like it or not, making an evolutionary assumption that all of your species are equally related to each other (essentially assuming a ‘bush’ phylogeny). This may only be rendered irrelevant if the data show no phylogenetic signal (which, of course, to find out you actually have to test using the phylogeny). I found, during my PhD, that even using quite inaccurate phylogenies produces better results than the assumption of no phylogenetic structure, when the data show some phylogenetic signal (published in 2002 is Systematic Biology – recommended if insomnia is a problem for you).
I also think your challenge that proponents should give an example of a major insight that has been offered by using these techniques misses the point. It is as odd as saying “give me one example where transforming your data to conform to normality has lead to a major scientific insight”. The statistical methods themselves are not really that interesting, or the point. The question they are being used to test is the key thing.
Finally, although you made this point in relation to spatial analyses, I’ve also observed a recurrence recently of the argument that phylogenetically controlled analyses are somehow removing some potentially important information from the data. It’s not true. Or rather it is at least as true as saying log-transformation removes important signal from the data.
Interesting fuel for thought though, I hope that I get you as a reviewer for my next un-sophisticated analysis!
Matt Symonds
*the one I agree on was the point about the p value of <0.000001. Yes, that won't change regardless of the method you use (unless your system/data are very very weird). However, I don't accept that that specific case can be used as an argument against PCMs generally (it's been tried before by Ricklefs and Starck 1996 Oikos)
Hi Matthew – welcome to the blog! And thank you for taking the time to write a thoughtful and constructive comment.
I am quite familiar with PGLS (indeed have a paper in review on a modification of this technique as we speak), so I think we understand each other technically. Precisely what it is doing is fitting a model where there is correlation (i.e. non-independence) between the points. I wouldn’t exactly call this a transformation of the data as it works rather explicitly through changing the model of the error terms. You are right that ignoring phylogenetic relatedness is the same as assuming the tree is a giant polytomy which is clearly wrong, but ONLY FROM THE POINT OF VIEW OF INDEPENDENCE OF ERROR TERMS. It is quite conceivable that one (usually I am in this case) never really intended to have a phylogenetic hypothesis and thus would only even bother thinking about the phylogeny to address the phylogenetic non-independence of the errors.
Now what are the consequences of phylogenetic non-independence of the errors?
1) if the traits are phylogenetically highly labile (change rapidly all over the tree), well then the points are really independent. So we don’t need to worry.
2) Maybe there is a tiny some non-independence. What does this do? Well effectivley all it does is reduce the degrees of freedom. If I had 100 species to start, then I don’t really have 100 degrees of freedom. If there is only a little bit of phylogenetic conservative, say between sister species only, then I might have effectivley 90 degrees of freedom (when I had 100 data points). If there is quite a bit of phylogenetic conservatism, then I might have effectively 30 or 50 independent points. For a given effect size/signal, this means my p-value goes up as my number of effective points (aka degrees of freedom) goes down. If I have perfect phylogenetic conservatitism, then I actually only have one point (and zero degrees of freedom which is a no-no), but I also have no variance and I am highly unlikely to be doing a regression on the data in the first place.
That’s it. The only effect of PGLS is it is really just a fancy way to account for reduced degrees of freedom due to partial non-independence of points. In spatial ecology you can use GLS (just as in PGLS) or you can use Dutilleul’s method (I gave a link in my original post) which literally downgrades the degrees of freedom in an OLS regression based on the amount of autocorrelation instead of fitting a GLS model. I am not aware that people do this on phylogenies, but they probably could. With phylogenies they tend to either use PGLS or PIC (there is a nice paper showing that PIC and PGLS are the same by Garland and Ives in AmNat 2000).
So, if I just do a regression between say body size and abundance ignoring the phylogeny that is absolutely all that happened. I overestimated my degrees of freedom by some amount and thus my odds of making a type I error (falsely rejecting the null hypothesis) is a bit higher than my p-value says. I haven’t really invoked a phylogenetic hypothesis at all (if say I think abundance is linked to body mass through the ecological process of resource consumption independent of evolution). I’ve just inflated my p-value. Importantly, I haven’t made the slope and intercept of the regression “biased” (a statistical term meaning pushing it in one direction or the other) so my slope and intercept estimate are still good. I’ve just overstated how good my p-value is.
Does the over-stated p-value matter? Well not if any of the following are true:
a) I don’t care about p-values (most ecologists do but there are times and places it doesn’t matter)
b) my p-value is already say p=0.000001 – it would take a totally unrealistic amount of phylogenetic correlation to push this up over p>0.05 (you noted this yourself from your real-world experience)
c) one of my regression variables doesn’t have any (or much phylognetic signal). In the above example, body mass is fairly highly conserved (especially if I have species spanning across families and orders), but abundance shows essentially zero phylogenetic conservatism.
So under any of these 3 scenarios, it really is 100% OK to ignore phylogenetic correlation if I don’t have a phylogenetic hypothesis to begin with. So why would I want to spend the time and energy making a phylogeny when it won’t really matter? This I guess is the one place we disagree. I don’t think a reviewer should be telling me I “have” to do a PGLS in any of these three scenarios if I’m not interested in it.
The only time I HAVE to worry about phylogenetic correlation and use PGLS is if a,b, and c are all false. Then, indeed, I should worry about it. Or maybe, even then I could ignore PGLS if I think the phylogeny is hard to obtain and my science is important enough that it ought to be published even if p is a bit >0.05 (I am very sympathetic to this view but many journal editors are not). And by all means if I already have a phylogeny at hand because I was interested in it, then sure, run the analysis with and without PGLS and report both results (more information is good if it doesn’t cost too much), but unless p was close to 0.05 without PGLS, the answer won’t differ in any important, repeatable way. Indeed all you have to do is look at all the papers published where both analyses were done and the results don’t change by much to see just how profoundly little difference it makes most times.
Now the other time I might worry about the phylogenetic correlation is if this is actually the topic of interest. How various variables change along a phylogeny and whether they are conserved or not is totally interesting, revealing of mechanism and evolutionary process and it SHOULD be studied. But PGLS doesn’t really help me study this (very much – I do get back an estimate of phylogenetic conservatism but its not the best such estimate). In this case I should be using other tools.
To one of your points, I agree PGLS is not removing information. But it IS reducing effective degrees of freedom and thus statistical power which is what I think people are saying (although it is not doing this to a greater need than needs to be done so I agree this argument is not a strong one against PGLS). And I’m glad to hear you have analyzed the effects of inaccuracy of phylogenies on PGLS. That always seemed a big oversight to me, so I’m glad its being addressed.
So, I guess I would have to conclude that if my scenario a or b or c is true, then I do think the burden of proof is on the person saying I have to do PGLS to argue why my science would be improved in any meaningful way by doing it.
And this story is identical for spatial autocorrelation and temporal autocorrelation.
Thank you again for your comment.
Wow, thanks for the long reply. I’ve been thinking on it overnight and have also read some of the comments posted since I posted. As you might expect, I’m going to have to disagree with you on a number of points here – two of which I think you are demonstrably wrong on. To get the latter out of the way first:
1) PGLS = PIC only is true in the case where there is strong phylogenetic signal in the data. Most modern packages that do PGLS explicitly calculate the strength of the phylogenetic signal and perform the appropriate data transformation of the variance-covariance matrix according to this signal (if the signal is none, incidentally, PGLS = OLS).
2) you say that PGLS gives identical slope and intercept values to OLS. This is really wrong as the literature about the estimation of the 0.75 scaling exponent in ecology (which you mention below) shows, for example. The df thing certainly applies to spatial autocorrelation controlling techniques and one or two phylogenetic comparative methods (like concentrated changes, and PIC [but for very different reasons]), but not to PGLS.
To get to your a, b, c of circumstances where PC analyses are not necessary – I would say (now moving into the realm of opinion)
a – (not needed if not interested in p values) – I can certainly think of examples where p-values are of no interest but a PC analysis is needed. What about model selection and model averaging procedures using AIC? There may not be a p-value in sight but it’s still important to control for phylogeny here (according to the question, I should add).
b – (if p 0.7) then most likely it will make no difference. But of course, there are many cases where results will be more borderline. We could get into the whole argument about p-values here generally – and there’s a point I think that could be made here about post-hoc decisions as to what is the best analysis being somewhat dodgy – but that’s for another day.
c – (not needed if no phylogenetic signal) – True, but how are you to make that judgement call without a phylogeny? In any event, as said above, most modern versions of PGLS take that into account anyway – if there is no signal then you will get an identical result to OLS.
Your point about ‘why should I do PGLS when it is not relevant to the question I am interested in’ has legs. I accept that there is value in analyses that are interested in elucidating ecological patterns without an implicit evolutionary causal mechanism behind it. But, to be provocative and to paraphrase Harvey et al.’s hilariously obnoxious responses to Westoby et al. in the 1990s, I doubt there are many biologists who would be satisfied with that level of explanation.
Just my own hilariously obnoxious two cents. All the best.
Matt
Hi Matt,
My thoughts on your numbering:
1) I didn’t mean to say the computer output was the same in PGLS and PIC – just that in a deep mathematical sense they are the same. Please read the Garland & Ives 2000 paper I referenced and tell me why it is wrong if you want to discuss this one further. To quote them “Both independent-contrasts and generalized least squares approaches start from the same statistical model (eq. [1]), use the same phylogenetic information, and give the same results.” Otherwise, not really something I care about enough to waste time defending.
2) It sure would be helpful to give specific references, otherwise it is just argument by claiming your side is right. My own quick google for “OLS biased in phylogenetic regression” brings up several papers. I only bothered to check the first two. Both confirm my claims. p-value increased (in these cases the width of the confidence interval increases but I didn’t say much about that one way or the other until now but acknowledge this is sometimes true) and to quote the first paper by Rohle 2006 in Evolution “Their means are not because [SIC] since estimates of regression coefficients are unbiased whether or not the correct phylogeny is taken into account” – exactly as I said – the estimate of slope is no different with or without phylogenetic correction. The 2nd paper listed in Google (O’Connor et al 2007 AmNat) agrees with this/me and is specifically about allometric regressions – the area you say I’m wrong bout.
NB the 2nd paper clearly shows that for allometric regressions Type II regression (e.g. RMA or their LSVOR) is different and more accurate than Type I (e.g. OLS or GLS), but phylogenetic correction is a whole different issue which has no effect in their analysis. Please don’t confuse the two. You will note that I did NOT list Type II regression as a technique that I was complaining about – its definitely not on my list (it is simple, often important and changes answers).
a) You can use AIC to compare models with or without phylogenetic correction (Pagel’s lambda does this and more) but AIC doesn’t have type I error and as I said phylogenetic regression means the p-value (i.e. estimated rate of Type I error) is too low. AIC is used to compare to models – there is no such thing as the one true AIC value that phylogenetic correction could help achieve. Not sure what you’re really getting at. I think you’re comparing apples to oranges.
b) as best I can tell if p=0.000001 then no phylogenetic correction on realistic data is going to make p>0.05. That’s all I said in my point b. If you go back and look I have repeatedly said in the original post and my replies that if p is closer to 0.05 that might be a case where one needs phylogenetic correction. But that is my case of b is not true that I talked about later in my reply to you so you’re really just jumping around.
c) ummm – if you read back through my earlier replies, in one of them I specifically stated not that I think abundance has no phylogenetic signal but that others (and myself) have published studies showing no phylogenetic signal. It is possible to know these things by reading the literature.
Yes – Harvey might be the epicenter of “my way is the only way to do statistical analysis” kind of thinking I’m talking about. Quoting him back to me is just reconfirming my point that this kind of thinking exists as a problem in ecology.
I’m happy to continue this conversation, but if you’re going to contradict me please give me citations or equations, don’t just say I’m wrong. And please address what I actually said.
Hello again,
To respond one more time.
1) OK – your clarification has made your point clearer. I read your statement originally as saying the output of a PGLS and PIC analysis are the same.
2) For a reference on the metabolic scaling and phylogeny issue showing differences in slope values see Capellini et al. 2010 Ecology 91: 2783-2793. Tables 2 and 5 nicely demonstrate it.
a) Of course AIC doesn’t automatically require PC. Just that you could be doing an AIC analysis where it would be appropriate.
b) Think we’ve covered this now
c) No, I didn’t miss what you said. The point here is that at some stage, jn order to get to the view that abundance has not phylogenetic signal, you had to do that analysis, using phylogeny, at some stage to show that. I would also argue that the ‘other people have shown that it doesn’t apply for their data therefore it won’t apply for my data’ is weaker than actually showing it doesn’t apply.
As for the Harvey quote – I actually don’t like his attitude either – but I think you missed the tone of playful devil’s advocate-y ness which I was intending. I very rarely make comments on blogs (or even visit them) – I remember now why.
Matt
Thanks
First – I appreciate you were bringing out Harvey tongue-in-cheek and had no problem with your attitude. Its just that I have a BIG problem with Harvey’s attitude and went straight to commenting on it. Sorry if you felt I was tarring you with the same brush.
RE #2 – I looked at the reference. I understand what you’re saying. It is true that in any one study a PGLS and an OLS will return different estimates of the slope, but in fact there is no reason to choose one over the other as being closer to the truth. A regression estimate of slope has error – all we really know is it is in the cloud around the true value. Both PGLS and OLS are unbiased (as per the two references I gave), a statistical term with a precise meaning, but to an approximation lets say that the clouds of estimators that could be generated are centered on the true answer for either PGLS or OLS, or they are both on average correct, or if sample sizes were infinite they would both come up with the same value. Or put another way, if you took two different samples and ran PGLS on each sample the answers would be different but would ultimately with enough samples, or a large enough sample converge on the right answer. But so would OLS. This is a thoroughly documented statistical principal acknowledged almost everywhere (including the two citations I gave – I’m curious – did you look at them? what did you think of them). I must just be doing a bad job of explaining it.
RE #c – I think maybe it is helpful to return to my original post. My original point is that there are scenarios where it is erroneous for a reviewer to insist that I do a phylogenetically corrected regression. If I already have a phylogeny it is easy enough to do PGLS so I should just do it to avoid the debate. But if no phylogeny exists and its not what I am interested in, this is just flat out wrong in the cases I list. And yes to argue I don’t need a phylogenetic correction requires phylogenetic thinking and maybe even somebody else at some earlier point in some other group of organisms having done phylogenetic analysis (but even here note that the analysis need be nothing like PGLS – most of the analyses on abundance I referenced were simple nested ANOVA variance component analyses on genus/family/order). I never said all phylogenetic thinking was a waste of time. If you check out my reply further down to Carl I’m clear about this. But, and let me very precise here, phylogenetically corrected PGLS itself is *sometimes* a waste of time and unneeded. This is what I claimed and still believe
Hope I didn’t scare you off blogging. But if you’re going to say (and I’m quoting you here) “I think you are demonstrably wrong” please at least read and comment on the citations I provide and provide ones of your own. And don’t conflate “all phylogenetic thinking” with PGLS. Otherwise we’re just throwing opinions around and not having a dialogue which I think is the purpose of a blog. Thanks. I expect others have learned from our conversation and your forcing me to be clearer on points.
While I haven’t published a Bayesian analysis, Bayesian analysis has one huge advantage over frequentist statistics: it makes sense! P(H|D) makes sense; P(D|H) (or worse, P(D or more extreme observation | H), as in p-values) is very rarely what we want to know and lends itself to all kinds of misinterpetations unless its meaning is memorized word for word. Remember that ESP paper fiasco from a couple of years ago?
I disagree 100% with everything you just said. Have a look at Deborah Mayo’s Error and the Growth of Experimental Knowledge. Also have a look at my recent post on Gelman and Shalizi’s argument that Bayesian statistics, done properly, actually is frequentist (or more precisely, “error statistical” sensu Mayo).
Jane: I new Jeremy would jump all over you for that one, but I was surprised how instantaneously it happened! 😉
I use a WordPress plugin that automatically jumps on commenters if they claim that P(H|D) is what scientists “really” want to know. 😉
I probably don’t disagree as strongly as Jeremy. But I do disagree on several fronts:
If you are using informative priors then there is an interesting discussion to be had about whether science should proceed this way or not (Jeremy can correct me but I think this is the discussion he is interested in). But since 95% (made up statistic but probably in the ballpark) of ecologists don’t use informative priors it is a bit of a moot point to have that discussion. Some Bayesians argue that Bayesian analysis is different because it models multiple sources of variation to which I say ever heard of a mixed model? My bottom line, Bayesian=likelihood+airy philosophy unless you have an informative prior.
And no, the fact that an ESP paper with frequentist statistics got published is not evidence that frequentist statistics doesn’t make sense. Silly papers with Bayesian statistics get published. Silly papers with no statistics get published. And lots of non-silly papers with frequentist (or Bayesian, or no) statistics get published.
So, let’s say the hypothesis is “There is life on Mars”. Please tell me how P(H|D) makes sense here, but P(D|H) doesn’t. There’s either life on Mars or there’s not. The probability of the hypothesis is either 1 or 0. But the data we collect might well vary from sample to sample. And while we don’t know for sure whether P(H|D) is 1 or 0, we do know for sure it’s one or the other. Unless you’re prepared to define “probability” as “a measure of what I personally happen to believe about the balance of the evidence.” In which case I look forward to an explanation for why I, or anyone, should care about what you personally happen to believe. I think science is about the world, not about what any given scientist happens to personally believe about the world.
I realize I’m being pretty blunt here Jane. But those are really strong and highly non-obvious claims you seemed to toss off rather casually just now! I strongly suspect that we’ll have to agree to disagree because any discussion we had would simply repeat long-standing debates in the philosophical literature (and in other posts on this blog). But as Steve noted, I couldn’t let such strong claims pass without a reply; it’s just not in my nature. 😉
Very interesting!
You say “when I think about probabilities, I think about the actual frequency with which actual events in the world would occur.” Then, for the probability of life on Mars, you’ll define it as “the actual frequency with which life on Mars occurs”. You’d need an infinite number of Mars to define it, which doesn’t make any sense.
P.S.: I’m fairly new to these debates but crazy enough to pitch in (a good way to learn) 😛
Also have a look at Dennis 1996. And Taper and Lele’s edited volume on scientific evidence. And lots of other things Andrew Gelman (who considers himself a Bayesian) has written. And my old posts on frequentist vs. Bayesian statistics…
My point here is not proof by authority, which I don’t believe in. My point is that the frequentist stance is at least a serious and defensible one. You can’t just dismiss it.
Oh, and I’m pretty sure that the many people who appear to do frequentist stats correctly haven’t just memorized stock phrases word for word. Come on Jane! I suppose you can argue that frequentist statistics is hard to learn (and Bayesian hierarchical models aren’t??!!!). But plenty of people (like the people who write pretty much every paper I read) do frequentist stats properly and understand them perfectly well. You can’t seriously claim that all those people only appear to understand frequentist statistics because they’ve just memorized stock phrases word for word!
Three posts in a row by the same commenter?
I think now is one of those times to take a deep breath, Jeremy 😉
Yeah, should’ve just made them one comment. Sorry. Jane touched a nerve. In case that wasn’t obvious. 😉
OK, I won’t have time to give a proper answer to these comments until this evening, but let me give a brief answer right now.
The main theme emerging from this discussion is that the problems starts when people say “you don’t have to think or exercise judgement – there’s only one right way to do statistics” which is what you’ve just done
Brian, that’s one of my pet peeves as well, so please don’t read me as saying that there’s only one way things should be done. Judgment is absolutely crucial to good science; I’m 100% with you on that. I have a manuscript in prep where I used a frequentist analysis because (1) the paper also contained a null model-based analysis of other data and I wanted to be consistent; and (2) it would have been seen as overkill for such a simple analysis. Frankly, I would have liked to not do any inferential statistics on that particular data, as the picture told the whole story, but reviewers are likely to disagree.
Now, Jeremy, I’m not criticizing people for memorizing certain phrases word for word. They’re absolutely right to do so! But the fact that this is necessary is a bad sign. If very few people use a program correctly, the programmer should not blame the users; rather, it’s time to rethink the interface design. Just because one quantity can be computed from another doesn’t make them equivalent from a user interface point of view.
Yes, I want p(life on Mars | what we know about Mars) and it is NOT “0 or 1”. (What does that even mean? How can there be more than one probability?) The Bayesian definition of probability is closest to the one we use in daily life and p(H|D) is the degree of belief I should rationally give to H given what we know.
Also, everyone, please look over this article by John K. Krushke. It highlights a rarely discussed problem with NHST. Also, he’s the author of a terrific book on Bayesian data analysis. After reading it, you will never be able to say, “But Bayesian is too hard”.
Hi Jane,
Thanks for your reply. It appears you may have misunderstood some of my comments.
When I say that the probability of life on Mars is either zero or one, what I mean is that there’s either life on Mars, or there isn’t. We don’t know for sure which it is, but we do know it’s one or the other. So if you assign some non-extreme probability to life on Mars, you’re defining “probability” in a way that I think is very problematic. When I think about probabilities, I think about the actual frequency with which actual events in the world would occur. When you think about probabilities, you apparently think of some sort of measure of how strongly you personally believe something (if that’s not what you mean by “probability”, I’m afraid I’m unclear what you do mean)
I’m glad to hear that you’re not criticizing people for memorizing certain phrases word for word. But when you say “they’re right to do so” and that it’s “necessary” to do so, you’ve misunderstood my comment, and you’re wrong. My point is that that’s NOT what people are doing. Again, feel free to argue that frequentist approaches don’t make sense, and that people who use them are doing statistics wrong. But it’s breathtakingly false to claim that most people doing frequentist statistics don’t even understand how frequentist statistics is purported to work, and so are just spouting memorized phrases in order to feign understanding! It’s true that everyone uses similar-sounding phrases to describe frequentist statistical results. But that just reflects the fact that most scientific papers are written in a very dry and formulaic way, not that nobody really understands frequentist statistics and we’ve all just memorized stock phrases!
You lost me with your comment on computing one from the other. I know what you mean by that, but it wasn’t a point I made and I’m not sure how it’s relevant to the points I made.
If in daily life we sound sort of Bayesian when we talk about probabilities, perhaps that just shows that good science requires more complex reasoning, and more precision in our use of words, than casual conversation in everyday life. Our everyday experience, common intuitions, and everyday ways of speaking are totally out of line with the conclusions of pretty much every area of advanced science. Imagine trying to work out quantum theory by taking as given our everyday notions of “causality”. So I’m sorry, but I don’t see the scientific or philosophical relevance of our everyday intuitions and casual use of words.
Your analogy between frequentist statistics and a badly-written computer program is a poor one; the analogy assumes what it purports to demonstrate. The analogy assumes that there are various philosophies of statistics we are free to choose more or less arbitrarily, based on whichever one works best, is easiest to use, or has some other desirable property. Analogous to how we are free to write a computer program however we want, in order to make it easy to use, quick to run, etc. But that’s precisely the philosophical issue–whether or not any given philosophy of statistics in fact has the properties we want it to have.
You neglected to provide the link to the Krushke piece.
I don’t say that Bayesianism is too hard. What I deny is that it is inherently any easier to do well than frequentist stats.
I don’t say there’s only one right way to do statistics. But I do think there are wrong ways. I also think there are right and wrong (or better and worse) justifications for doing statistics in one way rather than another. If you say that we shouldn’t do Bayesian statistics because P(D|H) makes no sense, I think that’s a bad argument for doing Bayesian statistics. If you say that we should do Bayesian statistics because it’s easier to do properly than frequentist statistics, I think that’s a bad argument for doing Bayesian statistics. If you say that we should do Bayesian statistics because it’s consistent with our everyday intuitions and casual ways of speaking, I think that’s a bad argument for doing Bayesian statistics. If you say that we should do Bayesian statistics because “probability” means “personal degree of belief in a proposition”, I think that’s a bad argument for doing Bayesian statistics. But conversely if you argue (as Andrew Gelman and commenter Eric argue) that Bayesian statistics with weakly informative priors is just a technical trick for “smoothing” likelihood-based inferences, I think that’s a perfectly reasonable argument for doing Bayesian statistics. If you say, as Andrew Gelman does, that Bayesian statistics with an emphasis on posterior predictive checks is a good way to pursue an approach to science based on rooting out and eliminating errors (Deborah Mayo’s “error statistics”), I think that’s a perfectly reasonable argument for doing Bayesian statistics. If you say that Bayesian statistics with MCMC is just a calculation technique for fitting complicated, difficult-to-fit models but not actually a philosophy of inference (as someone like Subhash Lele might argue), I think that’s a perfectly reasonable argument for doing Bayesian statistics. If you say that it makes sense, in a particular case, to focus on P(H|D) because in this particular case the truth of the hypothesis really is a random variable in a frequentist sense, I think that’s a perfectly reasonable argument for doing Bayesian statistics. And choosing a better or worse justification MATTERS, because it affects the practical judgment calls that you actually make. For instance, if you justify your Bayesianism by an appeal to subjective probability (as many leading philosophers of Bayesian statistics do), you will not only have no reason to bother with Gelman-style posterior predictive checks, you will regard them as violations of the proper Bayesian approach.
Yes, doing statistics well absolutely involves judgment calls. But “judgment calls” doesn’t mean anything goes. Good judgments are based on good principles. Bad judgments are based on bad principles, or on no principles at all (ad hoc judgments).
Hi Jeremy,
Here’s another link to the Kruschke article. http://www.indiana.edu/~kruschke/articles/Kruschke2010WIRES.pdf I hope it works this time!
Please ask your students what a p-value is. If they get it right, they’ll almost certainly say a very specific phrase. Try to rephrase it and you’ll most likely get it wrong. See, for example, this post by a person who was really trying to get it right. https://plus.google.com/u/0/100576812671317518568/posts/he9T78cuMoV
I didn’t say that people who memorized the meaning of p-values didn’t understand them. (Many scientists don’t, but they’re the ones who haven’t memorized the definition.) I said that p-values are notoriously easy to misinterpret. I can see how you got one from the other, but they’re not the same.
More this evening.
Jane
Thank you for the link. I’ve skimmed the article. It’s full of the standard uninformed, laughable distortions of frequentist statistics that subjective Bayesians have been spouting for decades, and that Deborah Mayo has made a career out of demolishing. All that paper shows is that, if you assume that subjective Bayesianism is the right thing for scientists to do, frequentist statistics is the wrong thing to do. I didn’t see any arguments in that paper that don’t simply assume their conclusions.
If that Krushke article is what guides your own judgment calls about what statistical methods to use, then with respect, I think you are seriously wrong. I encourage you to at least read some Andrew Gelman and try to separate the wheat from the chaff in terms of reasons to be Bayesian, and what sort of Bayesian to be.
I have asked my students what p-values mean. I teach introductory and advance biostatistics–I’ve asked hundreds of undergraduate and graduate students what p-values are over the years. I’m almost certainly asked many more students than you have, given what I teach and how long I’ve been teaching it. Indeed, in the classes I teach, I bang on at length about the interpretation of probability and the rationale for the entire frequentist approach. My students mostly get questions about these subjects right, and those questions are explicitly designed to test their understanding as opposed to their mere ability to memorize slogans. With respect, I don’t appreciate the implication that I don’t know how to teach my own students, or that I must not have met many students. You seem so certain that most students misinterpret p-values that anyone who claims otherwise must be uninformed about what their own students do or don’t understand. Which frankly, seems like a rather arrogant stance. You seem to have a pretty strong “prior” here, so that the new data I’m providing isn’t really affecting your views on what students do or don’t know. Funny that.
And again, what students do or don’t know has nothing to do with the justification for one’s choice of statistical method. I don’t know if frequentist statistical methods are “notoriously” easy to misinterpret, although I’m sure they get misinterpreted. So do Bayesian methods. So what? I remain mystified why you keep bringing this up. Especially since you now grant that practicing scientists actually mostly do understand p-values! Are you seriously claiming that practicing scientists should base their choice of statistical method on what their students do or don’t understand?
I’m only willing to continue discussing these issues if I think the conversation is going somewhere, that we’re making progress coming to understand each other’s views, and that other readers will find the discussion valuable. I have the sense that our “rate of progress” is now very slow, and that we’re close to the point where further discussion would have no value, to us or to other readers. There are lots of good conversations going on this thread. I’m not going to clog up the thread further going over ground that, as another commenter points out, has been well-trodden over more than 80 years. Nor am I going to clog it up with repeated attempts to clarify what I’ve written, or to try to get you to clarify what you’ve written. My decision as to whether to approve, and reply to, further comments on the issues we’ve been discussing will depend on my own judgment as to whether those comments continue to advance a worthwhile conversation.
Hi Jeremy,
I will make one more attempt to explain my views. At least this conversation may prove educational to future readers.
As Brian guessed, my Bayesianism is based primarily on #3, with a good helping of #1. (No, it does not come from a single article or single author.) In particular, I like the Bayesian definition of probability as a degree of rational belief, for two reasons. First, the frequentist definition of probability requires one to have a long run of data, which we rarely do. I think it’s useful to be able to speak of the probability of a nuclear power plant having a meltdown this year. The frequentist definition does not allow us to do this. Second, I think the macroscopic world is basically deterministic and we use probability to deal with uncertainty. This is highly compatible with Bayesian definition of probability but not the frequentist one.
Second, as you must know from all the reading you’ve done, NHST is based on a logical fallacy. If you are testing a hypothesis, the outcome of your test should be a statement about the hypothesis. A p-value is a statement about the data that secondarily mentions the hypothesis. In moving from p(D|H) to a conclusion about H (without using Bayes’ theorem), we end up making the same mistake as the following syllogism:
If a person is an American, then he is probably not President.
This person is President.
Therefore, he is probably not an American.
This has the same form as:
If the null hypothesis is true, the data is unlikely.
This data was observed.
Therefore, the null hypothesis is probably not true.
(Maybe we should call NHST Tea Party statistics!)
That should be enough, but let’s move on to the practical side. My remarks about student understanding were not meant to cast any aspersion on your teaching. If most of your students understand what p-values mean, wonderful! You must be a great teacher. Unfortunately, there is quite a bit of literature showing that both students and researchers routinely make fundamental errors about the meaning of significance tests. (BTW, I did not say that most scientists understand them.) See:
Click to access haller.pdf
http://tap.sagepub.com/content/5/1/75.full.pdf+html
http://www.tandfonline.com/doi/pdf/10.1080/00207590244000250
This literature is primarily about psychologists and psychology students. Ecologists tend to know more math and may do better. Maybe.
Yes, doing statistics well absolutely involves judgment calls. But “judgment calls” doesn’t mean anything goes.
Amen! Here, we agree. And I say exactly the same thing with regard to priors.
I’ve read some of Deborah Mayo’s work, but more on the philosophical side than the statistical side. I do find Aris Spanos’ demonstration that error statistics find Kepler’s but not Ptolemy’s models of the solar system satisfactory intriguing. I assume that Mayo isn’t calling for severe tests of null hypotheses, which are often known ahead of time to be false!
We’ll have to agree to disagree. I think you’re deeply confused, not merely mistaken in the way I believe all subjective Bayesians are mistaken, and so would discourage interested readers from taking your comments as a guide to either subjective Bayesianism or frequentism.
Re: your remark about rationality, Bayesianism of the sort you practice is about rational updating of beliefs. Put another way, it’s about “coherence” of your prior and posterior beliefs. As one prominent subjective Bayesian once wrote, Bayesianism of this sort is nothing but a “code of consistency” for the person applying it. Among other things, subjective Bayesians of this sort have no reason to care where priors come from initially, or whether those priors have anything to do with previous data or how the world actually is. Indeed, on this view there’s nothing wrong with lying about your prior beliefs, as long as they conform to the axioms of probability, and as long as you update them via Bayes Rule when new evidence comes in. This sort of Bayesian believes that it’s never ok to go back and change your priors, never ok to say things like “In retrospect, it was silly of me to have the beliefs I had at the time”. This rules out things like Andrew Gelman’s posterior predictive checks, which amount to changing one’s priors. It is of course highly debatable why scientists should care about, and only about, the “coherence” of their past and future beliefs, independent of the truth of those beliefs.
You don’t understand the logic of frequentist statistics. Indeed, you’re so much of a subjective Bayesian that not only don’t you understand it, you think others “must know” what you think you know. So if I “must know” that frequentist statistics is fallacious, why do I and the many others like me claim otherwise? You must hold very odd views about the psychology of your colleagues.
I let the comment about Tea Party statistics go through because I assume you mean it in jest. But that sort of comment runs a high risk of being misunderstood and offending your colleagues (and I say that as someone who takes similar risks from time to time). I assume that’s a risk you’re happy to run.
Why you think there are practical issues worth talking about when you also think frequentist statistics is logically fallacious is unclear to me. And how you in good conscience could once have published a frequentist analysis yourself, despite believing that it’s logically fallacious, is quite beyond me. That’s very far beyond the usual sorts of compromise that authors make with referees. If a referee ever tried to force me to commit to what I thought were logical fallacies, I’d withdraw my paper. And I’d have done so even as a grad student; it isn’t having tenure that gives me the very minimal courage required not to attach my name to logical fallacies. I’m quite surprised that someone who professes to believe in subjective Bayesianism, and thus to care only about the coherence of her past and future beliefs, would also be prepared to set her beliefs to one side whenever it was expedient to do so. I wonder if you realize just how cynical and calculating you sound.
I’m glad to hear that you’ve read a bit of Mayo and Spanos, though I’m unclear how you could simultaneously think that frequentist statistics is fallacious, and that Spanos’ work is “intriguing”. Or how you could fail to understand the logic of frequentist statistics after having read Mayo’s philosophical work. Or how you could find “intriguing” work that violates everything that you as a subjective Bayesian profess to believe about statistics, as Spanos’ work does. The error probabilities that he and Mayo use to quantify the “severity” of statistical tests (of any hypothesis, null or not) are frequentist probabilities of D|H, which you think make no sense. You’re very confused.
(EDIT: Having re-read my this comment after posting it, I think I posted too quickly and said some things I shouldn’t have. That’s my bad, my sincere apologies. It wasn’t appropriate for me to stray from substantive issues into suggesting or implying that you’re cynical or calculating, or to imply that I know better than you when a comment might be seen as offensive by one’s colleagues. I do think that you’ve expressed contradictory views, but I should’ve stuck to pointing out those contradictions. I shouldn’t have suggested or implied personal criticisms of you for holding contradictory views, or talked about how difficult I find it to understand how anyone could hold the views you hold. You’ve taken my comments in good spirits and given as good as you’ve gotten in our discussion, and for that I thank you.)
I’m glad you agree that there’s no further value in our conversation, so there won’t be any need for me to block further comments on these issues.
I hope Jeremy doesn’t block this, as I think he asked one very important question that I want to answer. Why did I do an NHST analysis when I think NHST is based on a logical fallacy? Isn’t this too much to compromise on?
In this case, I think the answer is no. The analysis in question is a simple two-group comparison. I plotted the data in a stripchart and saw no difference worth remarking on between the two groups. That would have been enough for me, but editors and reviewers tend to want inferential statistics, so I did the test and came up with no significant difference. While I’d be happy to remove that analysis if an editor said it was OK, I think it is ethically acceptable to keep it because nothing hangs on that test. The plot speaks for itself.
Sorry to have mischaracterized your attitudes on applying judgement and allowing multiple approaches! I’ll let Jeremy handle most of the rest of your points (it clearly matters to him since I think he set a record for most consecutive unanswered posts 😉 )
As I noted in my response to Eric’s recent posts, one of the things I have a hard time with in the whole Bayesian debate (aside from the fact the rhetoric levels get very high on both sides) is that Bayesian really needs to be unpacked as a concept. Otherwise, you can have two conversations going on at the same time without ever really talking about the same thing.
To my mind there are 3 key concepts in Bayesian statistics sensu latu:
1) The idea of an informative prior – prior knowledge melded with current data
2) The idea of MCMC and maybe a smoothing effect of a weak prior as a way of making likelihood calculations work better
3) The core philosophical approach about what is probability, how does a scientist really think about the match between their data and their model, etc – what I called in my original post the (to me) subtle distinctions between a confidence interval and a credible interval
It seems to me there are Bayesians out there who care about only one of these ideas, two of these ideas or all three.
To me as a statistical pragmatistic, it makes it really hard to know what we’re talking about when somebody says Bayesian. I wish there were three separate words for these three separate issues.
To make sure I understand correctly, Jane, you are mostly interested in #3 in why you perceive Bayesian approaches as desirable? And Jeremy – I think you are most concerned with #1 with maybe a secondary concern in #3 – is that true?
Thanks
Brian: There are many more than three kinds of Bayesians, and Bayesians seem to know this already:
Click to access kass.pdf
Interesting. Thanks for the link.
I don’t think this is communicated to those outside the inner circle though. You seem to know a lot more about Bayesian than I do. My impression is that often Bayesians tend to use this multifaceted nature in a bit of a cheating way – having their cake and eating it too – by sliding around in what they emphasize depending on the argument they are in. But maybe this is *way* too cynical and it is my failure in lumping distinct people/arguments into one big category. Any thoughts on this?
Attacking all of Bayesian statistics or all of frequentist statistics just seems weird at this point, no? Especially after ~80 years of debate amongst very smart statisticians, without any synthesized resolution.
I agree it seems wierd to me. I believe my first sentence in my original post was something like “Bayesian is a bit of a mixed bag”. I will quickly add I think frequentist is a bit of a mixed bag too (I hate p=0.05). But this doesn’t seem to stop a lot of people 😉
I think that is exactly why I was trying to unpack things a bit. Not that we all are going to finally resolve the debate here on this blog. But in the pedagogical aspect of the blog, I think it is useful to start separating out the menu so you can pick and choose what you like or don’t like. I’d love it if somebody wanted to unpack frequentist into a similar menu.
I didn’t intend my comments as a blanket attack on all Bayesians of all stripes, but that perhaps wasn’t as clear as it should’ve been from the context. My comments were directed at the specific sort of Bayesianism that I believe is implicit in Jane’s comments. This is the sort of subjective Bayesianism that predominates in much of the philosophy of statistics literature, but which is in my experience rather rare among practicing scientists. It’s the sort of Bayesianism that Deborah Mayo is most keen to shoot down on her blog and in other writings. As I hope my most recent reply to Jane makes clear, I think there are lots of perfectly good arguments for (certain flavors of) Bayesian statistical approaches. My strong objections are limited to certain other flavors of Bayesianism, and to bad arguments for Bayesianism. In statistics, as in many things, you want to do the right thing (or maybe better, do a reasonable, defensible thing), and you want to do it for the right (or at least good) reasons.
Gelman has at least one old post on “what makes you Bayesian”, which IIRC notes that different prominent Bayesians have given very different answers.
Much the same is perhaps true of frequentists. Deborah Mayo likes to emphasize how Pearson actually didn’t buy into what’s nowadays often called the “Neyman-Pearson” rationale for frequentist statistics. Rather better known are Fisher’s various philosophical differences with Neyman and Pearson (e.g., over the need for alternative hypotheses). And Mayo herself has a sophisticated and somewhat unconventional rationale for frequentist approaches, which motivates her focus on test “severity” (a novel concept related to, but not the same as, the traditional idea of statistical power).
Great comment Brian. Here’s how I see it (heavily influenced by Andrew Gelman…but without any guarantees that he’d agree with me) and would love to hear different perspectives.
I find that Bayes provides a more useful framework for constructing procedures for getting the best possible parameter estimates for a particular problem. For example, someone mentioned ‘perfect separation’ earlier I think? This is a phenomenon in logistic regression where all of the ‘1s’ are at one end of the predictor while all the ‘0s’ are at the other. In other words, the 1s and 0s are perfectly separated. The ‘frequentist’ estimate (more precisely the maximum likelihood estimate) is a coefficient of infinity (or negative infinity). I don’t care what aspect of ecology you’re studying…NO predictors have an infinite effect! Ecology’s just not that simple (e.g. the probability of presence or absence of a species under certain environmental conditions is NEVER zero or one). The Bayesian approach to this problem is to put a prior on the coefficient to keep it from being so ridiculous. The prior tells the estimator that logistic regression coefficients don’t get much bigger than +/- 2 say, in ecology (for standardized predictors). BTW, bayesglm in the arm package in R allows you to do this kind of thing in a way that is essentially just as easy as using regular glm. POINT: Bayes often gives better estimates (but see Larry Wasserman’s difficult but interesting theoretical work for exceptions in non-parametric cases).
The problem with classical (i.e. old-school-ivory-tower-I’m-better-than-you-because-I’m-a-Bayesian) Bayes — for me — is that we need more than Bayes’ theorem to make sure that our models are actually reasonable (i.e. that they don’t look ridiculous compared to the data). In other words, we need to cross-validate, calculate p-values (although I prefer confidence intervals), to check our Bayesian models: otherwise, we shouldn’t accept our Bayesian estimates as ‘better’ (as I suggested above). The other related problem with Bayes (as it is often practiced), arises when we try to compare posterior probabilities of various hypotheses (or models), we can end up thinking that we are more sure about how good a particular model is. For example, its common in this kind of thing to have one model with very high posterior probability (i.e. 0.9 or something). This might lead people to think that this model is pretty freakin’ awesome. BUT…before we can do that…we need to calculate p-values and do cross-validation to make sure that we didn’t just start with a bunch of crappy models and simply chose the best amongst the crap.
But after ~80 years of debate…one could go on and on and on discussing the relative strengths and weaknesses.
Apologies Jeremy. I agree with you here…didn’t mean to direct that comment your way in particular.
No apologies needed. If anything, I should apologize for not checking more closely who you were directing your comment towards. I’ve never had a thread anything like this, it’s becoming a time suck for both Brian and I to keep up. So please excuse us if in our haste we don’t reply right away to every comment, or reply to the wrong ones.
Our plan in the immediate future is to write some posts that suck, so as to ensure we don’t have to deal with any super-long, super-interesting, super-active threads again for a while. 😉 Just kidding, threads like these are rare and the resulting time suck for us is a nice problem to have.
UPDATE: And BTW, as I’ve said in a recent post, I’m on board with you and Gelman here, Steve. You’ve articulated that point of view very well.
Brian, when you asked which point I was most concerned with from your little three-item list, my biggest concern would be #3. I’m mostly only concerned with #1 (informative priors) in cases where I’m already concerned by #3. If you’re adopting a subjective, personalist interpretation of probability (#3), that’s bad enough–but to then build your subjective personal beliefs into your analysis via the prior (#1) is even worse!
As Deborah Mayo points out, frequentists actually have lots of ways they can draw on “prior information” in a broad sense. Your prior information can inform things like your study design, for instance. She also has a lot of incisive things to say about the claim (which fortunately no one on this thread has been silly enough to raise!) that subjective Bayesians and frequentists are both subjective, it’s just that subjective Bayesians openly admit it.
Jeremy:
“She also has a lot of incisive things to say about the claim (which fortunately no one on this thread has been silly enough to raise!) that subjective Bayesians and frequentists are both subjective, it’s just that subjective Bayesians openly admit it.”
Wait…that’s not true? I thought that no analysis (Bayesian or not) is perfectly objective (i.e. subjective to some degree), but that it is incumbent on the analyst to try ‘as best they can’ to reduce the degree of subjectivity ‘as much as possible’. No? What am I missing?
I think this is part of the reason why I find this statistics stuff so addictive…I’m still coming up against arguments I’m not familiar with after over 10 years of studying this stuff. Was Malcolm Gladwell wrong about the 10,000 hours thing? Maybe just in my case?
Hi Steve,
No, you’re not missing anything. It’s not that any statistical approach is or could be perfectly objective, whatever that might mean. But broadly speaking, frequentist statistics (and certain flavors of Bayesian statistics), done right, is a set of tools and techniques for making statistics as objective as it can be (which frequentists, and some flavors of Bayesians but not others, believe is desirable based on deeper considerations about the goals of science as a whole, of which statistics is only a part). This is basically what Deborah Mayo means by “error statistics”: a set of statistical tools that, used well, helps you localize, quantify, and eliminate errors. Basically, keep you from fooling yourself or making mistakes about what scientific conclusions you can or can’t or can’t reliably infer based on the data.
Conversely, just stating one’s priors hardly amounts to being “open” about all the ways in which a Bayesian analysis, like a frequentist analysis, involves subjective judgment calls.
Agreed Jeremy. I think we’ve come to an understanding. Except the ‘as objective as it can be’ part sounds a little extreme…but that’s probably a conversation for another day. 😉
I realize I’m jumping in here a bit late, but I think there’s a couple points that I don’t think have been touched on yet. In order of your objections:
1. Bonferroni corrections: I heartily agree with this; I’d say Bonferroni are in fact one of the worst ways to correct for multiple comparisons for most situations they’re used for, up to and including just running independent t-tests for each comparison. That’s because it increases the difficulty of any given test passing, while totally ignoring that by doing multiple comparisons, you often have multiple sources of information that could be pooled together. Any other method (FDR,multi-level modeling, etc.) at least tries to incorporate information between tests. The only time I’d generally recommend Bonferroni is if someone was doing forward or backward variable selection, and then I’d ask them why they were doing that in the first place. I think this is pretty much in line with what Garcia is saying (still reading that paper though).
2. with regards to phylogenetic corrections, and spatial and temporal data: I understand all the arguments about the general robustness of standard analyses and if you have a very significant result, it won’t shift it to insignificance. However, that’s only if you care about the point estimate for a parameter, and not its confidence intervals, whereas I’d say the point estimate of a parameter is the least important part of a confidence interval. Given the number of meta-analyses these days, I don’t think we can disregard the effect an overly tight confidence interval can have on future analyses including the authors data. In cases where there isn’t a good phylogeny/poor temporal data etc., or when it’s known that a given factor is very unlikely to have dependence issues (such as your example with abundances in a phylogeny) I’d agree that you don’t need to automatically run those tests, but then there should be something in the paper to that effect, saying why you didn’t see the need to test that. If we’re following an error-statistical philosophy of science, I think it’s crucial that we spell out in the text how strongly tested our estimated parameters are, and what potential errors may still be affecting the results.
3. Detection probabilities: nothing really to add here.
4. Bayesian statistics: This may seem odd, since I think of myself as a pretty died-in-the-wool frequentist, but I think you’re missing a middle ground between full informative priors and max-likelihood methods where Bayesian methods are useful: for smoothing. That is, maximum likelihoods often have problems in many cases (such as separation in logistic regression), which even weak smoothing can help prevent. In those cases, the prior is acting the same as a penalty term would in penalized maximum likelihood; that is, preventing breakdown of the analysis in edge cases where just using the raw likelihood would in fact lead to poor power to distinguish between alternatives. That being said, though, I’d often agree that a Bayesian approach can be overkill for an analysis.
Re: 4, your suggestion to think of weakly informative priors as a technical trick purely for the purpose of smoothing is, if I’m not mistaken, very much Andrew Gelman’s position as well.
Re: lateness, you’re not at all late by the usual standards of this blog. But this post is threatening to win the entire internet. 😉
Thanks Eric
A very thoughtful post (and the kind of discussion I was hoping to stimulate).
My thoughts
1) we agree – nothing to add
2) I am a bit confused by your point about confidence intervals. It seems as if you are saying confidence intervals are important to later meta-analyses and thus having a confidence interval that is too tight due to ignoring autocorrelation has bad long-term consequences – have I got that right? Interesting point about meta-analyses but when these methods are being used as a gate-keeper to whether something is published or not, I think its having a bigger impact on the meta-analysis through selection bias. Also, my general sense (primarily from the spatial, not phylogenetic, literature) is that it is claimed using GLS to model autocorrelation makes more efficient estimators with tighter confidence intervals not the other way around. But overall this seems to me like something that is not well established (to my knowledge it has been studied only with simulations and seems to depend a lot on the configuration of the data). Also macroecology with its famous 3/4 power law often is more concerned about the estimated value (understanding of course that some sophisticated handling of error in estimation is needed)
4) Very interesting point. I’ve got nothing against penalized likelihood (using it heavily in a project with spline interpolation right now). But I keep getting hung up on the priors. Either you’re really doing an informative prior – in which case this needs to be honestly stated and a whole debate needs to be had. Or you’re not trying to bring any prior in, in which case it is likelihood. Wouldn’t your example (and as Jeremy notes Gelman’s argument) of smoothing be more correctly framed as a roughness penalty on the likelihood than a prior if we’re not bringing in prior information? It seems to me Bayesian supporters try to get away with the prior being everything from an innocuous detail to an earth changing factor depending on how they want to play it.
Thanks for the post!
Thanks for the response, Brian, and for hosting such a fantastic discussion on this!
For my point regarding confidence intervals*: I generally regard confidence intervals as more important than point estimates (that is, if I had to choose between seeing someone’s point estimate for a parameter, and their estimated CI for the parameter, I’d choose the latter). That’s because in general, CI’s at least give a measure of which parameter values aren’t supported by the data, assuming the model is true. Either a too-tight or too-loose CI can both be problematic for someone using the paper. I gave meta-analyses as one example where having accurate CI’s are more important than a point estimate, but I can easily think of others, such as trying to use demographic parameters from papers to estimate a population model (something I’m working on now, which was slowed down immensely by the authors of some key papers not publishing CI’s at all). If a given method (not taking into account dependencies) tends to bias CIs larger or smaller, I would consider that a problem when not taken into account. I think a reasonable question a reviewer could ask an author is: “would you feel comfortable with someone else using your CIs as an accurate assessment of the uncertainty of this value in future?”. For me, that means an author should be able to demonstrate that they have done their utmost to ensure that they have tested for potential issues that may lead to their uncertainties being mis-estimated. That is, they have tried to rule out potential sources of error that may radically shift interpretation of their data.
All that being said, I think I still largely agree with your thoughts on this, for a couple reasons:
1. As (I think) you’ve said in previous comments, the measure of uncertainty you use should be heavily influenced by what you’re trying to say with a given analysis. If you’re just showing that there’s a trend in, say, body size, in a group, and you don’t care what paticular mechanism is causing it, then a PIC analysis isn’t needed. Similarly with a temporal trend: if all you’re trying to show is the average relationship between two variables over time, and aren’t worried about the mechanism that generates the relationship (or out-of-sample predictive power) time series techniques will be overkill.
2. I think this is related to the discussion you and Carl had: if there is a potential issue with a given dependence structure changing the interpretation of the data, I think the reviewers should be recommending tests for that given dependence issues, not specific models of that dependence. There are a great many ways that, say, time-series can exhibit temporal dependence, and a linear first-order auto-regressive process is only one of them (the same goes for spatial analysis, phylogenetic analysis, network analysis…). I get the impression that many people think the way to test for spatial dependence is to fit a spatial glm to the data and see if it fits better than the simple analysis. This is related to things Andrew Gelman, Deborah Mayo and Aris Spanos have said: we should be fitting models then testing for specific areas where the given model poorly fits the data; Just because a model includes a spatial auto-covariance term doesn’t mean it’s a better model than a simple regression. It’s only a better model if the actual dependence structure of the data generating process looks like a spatial auto-covariance model.
*When I say CI’s, take this a short-hand for “measures of uncertainty”; quite often, standard errors or credible intervals are excellent for understanding uncertainty.
Not a whole lot to add because I agree with every thing you said.
I do want to emphasize that I agree that as a general rule confidence intervals (or even better probability distributions or likelihood surfaces) of estimates are much superior to point estimates. Given a choice I will always go for the variance in addition to the mean.
How I see this intersecting with PGLS/spatial regression etc is that there is a bit of a difference between having/reporting a confidence interval and making it tighter. One should always report the confidence interval. Making a confidence interval tighter is a good thing worth fighting for. But narrower confidence intervals don’t come cheap. After all the best way to narrow the confidence interval is to collect more data. And this is a cost-benefit trade-off. Getting a narrower confidence interval sometimes (but not always in my read) occurs IF you collect the data to develop a phylogeny which is good, but its a trade-off that should be evaluated in a cost-benefit framework, not a “must do”
Thanks for the great ideas and sharing!
Excellent post and discussion! It seems like a lot of the objections raised in the post basically come down to inappropriate use of statistical methods, so to me setting this up as a “complex vs simple” method thing is a bit of a red herring. Obviously there’s (lots of) times where each is used in an inappropriate context, and it doesn’t seem like it’s any better justified to do so with a simple method than a complex one. (Of course there are more ways to go wrong with more complicated methods, but I don’t think that was Brian’s point).
The discussion raises the very important question of both “cookbook statistics” and custom, obscure approaches and the role of R packages in providing fighter jets without a pilot’s license (or other terrible analogy). The genie isn’t going back into the bottle. I think the only way forward is that publications share the code and data used to generate the results. The script-based nature of R (even without it’s more complicated tools like SWeave) make this very easy to do. If I don’t like your analysis with/without Bonferroni corrections, I can change one line of code and see if it makes a difference. There is no better way to understand the merits and shortcomings of different approaches then to try them side-by-side, and sharing code and data (and a little more focus on common data structures) can make this easy for authors, reviewers, and readers.
Ok, getting off my soapbox now.
Perhaps the most damning charge made here is that none of these statistical advances have advanced our understanding of ecology. I’d be curious to read what others would suggest in a list of statistical methodology that has most changed our understanding of ecology, but someone has to push back on that. For sake of argument, I’d say you have included at least three excellent examples of statistical methods that have advanced our understanding of ecology on your list already!
Phylogenetic regression (sensu Felsenstein 1985) by itself deserves everything you say but without which we would never have had the field of phylogenetic comparative methods or community phylogenetics. Reviewers insisting on spatial autocorrelation might be seen as a sign of success of the past three decades or so of theoretical ecology emphasizing the importance of space (sensu Durrett and Levin 1994). The continued absence of this factor in many environmental niche models is probably a fundamental flaw, leading us to believe environmental predictors rather than spatial autocorrelation are responsible for determining species spatial distribution, sensu Robert Hijmans 2012). Most of all Bayesian methods, which can now probably be found in every areas of ecology, have many fundamental contributions to their name, including underlying phylogenetic approaches that have fundamentally changed our view of the tree of life, central to models of learning and behavior, and powering our GPS satellites, weather predictions, internet traffic (Bryson’s history of the Royal Society is perhaps my favorite exposition on Bayes.)
Hi Carl – really appreciate your engaging with the ideas at a high level. I agree with most of your points.
Through this discussion, my thinking has evolved to my biggest objection being to a “my way or the highway” or “cook-book only one right way” approach. And in particular how this simplistic thinking often comes from people pushing very sophisticated methods who should know better. Having said that, I still think all else being equal we need to give a strong default bias to simpler methods as being more tested, error free, broadly understood etc. I guess another way to summarize my thinking is that the burden of proof should always be on the person arguing for the more complex approach.
I love your point about how R scripting makes many of these debates a simple empirical question (of course that assumes authors are sharing their data too which still happens too rarely).
And finally, somebody tackled my core question – are all these tools good for the advancement of ecology!
Totally agree that there are many excellent and exciting applications of phylogeny in ecology and evolution right now. Whether we had to go through PIC to get there I don’t know. The Maddisons (and others) were pushing some of the more interesting approaches for analyzing evolution almost as far back as PICs. In which case the uptake of PIC but not the tools to actually ask and answer questions about evolutionary patterns might support my original claim about statistical machismo. And in either case, it doesn’t explain why people are still being pushed to always do PIC/PGLS on simple comparative papers not focused on evolutionary processes.
I agree that spatial ecology has been one of the most important developments of the last 20-30 years. Again though, I’m not sure I agree this development means we had to go into spatial regression models. Where I HOPE we go is similar to phylogenies, where we increasingly directly study spatial autocorrelation, what creates it, what the patterns are what this tells us about process, etc. And I very much agree that one of the places we most need to worry about spatial autocorrelation is niche models and we’re not doing it there. I think if spatial autocorrelation were incorporated in niche models it would be devastating to the field to realize how much of our apparent predictive power is really just spatial autocorrelation.
I agree with you on the Bayesian stuff – plenty of good applications, plenty of bad. But you’d be hard pressed to find any new band wagon where you couldn’t say the same so we shouldn’t judge the whole movement by its bad applications.
Thanks!
Thanks for a great reply, I agree with all of it.
Many of our top journals have recently mandated depositing data in a public repository (http://datadryad.org/jdap) and I agree that such top-down requirements are probably the best way to change the reluctance to share data you mention. I think the next steps are to encourage or require depositing the code to reproduces the statistical analyses performed in the paper. Replicating other elements of the research can be more complicated, but providing code that reproduces the statistical results of the paper should be a reasonable place to start.
Mandates aren’t the only way — the journal of Biostatistics, which must certainly face many of the concerns discussed here frequently — encourages papers to submit data and code, and if so, the “Reproducibility editor” will add a kitemark to papers for which he/she has been able to reproduce the statistical results (see Peng 2011 10.1126/science.1213847). So it needn’t be a pipe dream 😉
That Journal of Biostatistics policy is an interesting one, I didn’t know about that.
Re: requirements to deposit code as well as data, presumably with a requirement or expectation that referees will run it, how close do you think we are to that becoming routine in ecology? And will it make it more difficult to get referees. I mean, if, as part of refereeing a paper, I have to re-run some kind of massive hierarchical model fitting exercise that will tie up my computer for days or weeks, I’m going to hesitate to agree!
I’ll note in passing that having referees replicate all the stats is kind of old school. I believe that Sewell Wright used to completely redo all of the stats in the papers he reviewed. I’ve heard that he’d often renanalyze the data, not just check to see if he could reproduce the author’s numbers, and that his reanalyses were often incorporated into the final paper. And he was doing all this on those old hand operated adding machines! Anyone know if these stories are true?
Ecology now actually has a policy to deposit code for any original analyses. I think it is meant for atypical analyses, but, I’ve seen it used more and more. I actually wasn’t aware of it until my last submission was held up until my R code with just linear models and some AIC analyses was uploaded as a supplement. Odd that the data wasn’t required, but it’s being posted anyway at the LTER website. Personally, I’m a big fan of this policy.
(and I do so hope that story is true, Jeremy).
Regarding referees to redoing the stats: that makes me think of theoretical papers. In those, I don’t redo all the analyses (e.g., solving for equilibria), unless I think something seems off. (Maybe some people do redo them all?) I think I would do something similar with stats (that is, only go through it myself if something looked awry) — but it certainly would be nice to have the option of doing so in cases where I’m think something may be wrong.
In terms of depositing code: I usually include it as an appendix on my own, but, in one recent case where I forgot to on the initial submission (to Ecology) a reviewer did ask me to put it in. But, so far, I’ve only had the request for code come from a reviewer.
I agree. In particular, if data is open from the get-go (or attached to the paper), having a referee, if so interested, be able to fire up the same tool you used, re-run the analysis, if they so desire, and tweak it to see if the results differ….
In particular, I often get papers that inappropriately use normal statistics. But how bad is the violation? Will a simple tweak (that I would suggest) change their entire story? I don’t know unless I can do it myself. I often feel like I’m offering reviewer comments blindly then – and perhaps inappropriately. If, on the other hand, the data and the code was right there, sure, I could run their linear regression as a generalized linear model, maybe produce some different output, but feel confident that the conclusions the researcher drew were entirely correct given even fixed analyses. The review I write would be more useful, and the paper would be turned around even faster.
How great would that be?
My favourite smack-down on Bonferroni corrections is by Thomas Perneger (http://www.bmj.com/content/316/7139/1236.full). Besides being straightforward, it is interesting to see something from outside our field (Ecology & Evolutionary Biology).
Nice! Completely agree with the paper, although I fear we’ll never see our field follow the recommendation and just put out the facts and leave it to our personal judgement to evaluate the results.
(ahem)
Thanks Mike I see our hobby horses are still racing … 🙂 . Glad we have a diversity of opinions on this topic and on this blog. That’s what my post was all about (pro diversity of opinions). Can we at least agree that multiple comparisons is ultimately a philosophical question and hence not surprising that thinking people can disagree about?
(PS – and I’m sure you’ll agree Mike – to anybody who thinks being thoughtful about statistics doesn’t involve diving into philosophy and methods of scientific inference, you couldn’t be more wrong)
Absolutely – it’s crucial to understand the ‘what?’, ‘why?’ and ‘how?’ of statistical methods, which, linking back to your original post, should not result in a ‘one-size-fits-all’, dogmatic approach.
Little has been said about experimental design (which includes analysis of historical data), but I strongly feel that many problems can be avoided with good a priori design coupled with clear a prioristatements of hypotheses. These will directly inform the sort of stats that need to (or can) be done, before any experiments or analysis are carried out. I wonder if the route of many of the ‘philosophical’ disagreements touched upon here derive from the wrong stats being applied to poorly executed (or otherwise good) experiments? Perhaps that’s an oversimplification though.
It’s a great discussion, I’m learning a lot, and broadening my statistical horizons. It’s now getting tricky to follow the different conversations, but I’m chuckling at the interweb ire that is spewing forth from some commenters 🙂
Lets not get too radical here. a priori hypotheses leading to careful a priori experimental design! I mean really!
Anybody who knows stats well (probably most of the commentors on this blog) has been asked to come up with magical stats to save a hopeless situation from lack of forethought.
(Hope the sarcasm dripping off this post drips in the direction intended and not towards you Mike, because I am 100% agreeing with you)
I think that calls for a little Fisher – “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.” Substitute “consult a statistician” with “think about how to statistically analyze your experiment,” and you have an all to common occurrence.
Brian:
“(PS – and I’m sure you’ll agree Mike – to anybody who thinks being thoughtful about statistics doesn’t involve diving into philosophy and methods of scientific inference, you couldn’t be more wrong)”
I for one completely agree with this. Here’s another great one from the quotable Andrew Gelman on this topic:
“Philosophy matters to practitioners because they use philosophy to guide their practice; even those who believe themselves quite exempt from any philosophical influences are usually the slaves of some defunct methodologist.” http://www.stat.columbia.edu/%7Egelman/research/unpublished/philosophy.pdf
Sorry if this is a bit tangential and brings the discussion back towards a touchy subject, but after reading the comments, and in particular the Frequentist-Bayesian dogfight, I was curious to hear what readers, and both Brian and Jeremy think about why these sorts of discussions inevitably become Frequentist v.s. Bayesian, and in particular why the Likelihood paradigm seems to get short shrift.
It seems to me that very few debaters acknowledge or think about how the likelihood paradigm dominates the massive grey area between Frequentist & Bayesian statistical logics sensu stricto. In particular, although often embedded in the Fisherian significance testing approach (i.e. P-values), the logic behind likelihood (L(theta|data)) is awfully Bayesian, and the choice of distribution represents the incorporation of prior information into the analysis. ANY Frequentist who has implemented a Generalized Linear Model with a non-normal error distribution has bent these both of these Frequentist rules. Heck, any frequentist who has used Maximum Likelihood estimation algorithm to estimate a parameter has bent these rules (normal error distribution or not). On the other side, some Bayesians open Pandora’s Box, so to speak, by dividing the likelihood by the normalizing constant to get back to a Density function (and all the pro’s and con’s associated with that), but as Jeremy has mentioned & cited Gelman, without strong priors and with an emphasis on posterior predictive checks, this is awfully error-statistical… and if these Bayesians are mostly interested in the mean of the parameter density functions returned by their MCMC rather than the WHOLE density function, they are acting awfully Likelihood-ist.
Perhaps I don’t fully grasp the earth shaking ramifications of adopting a Frequentist v.s. Bayesian definition of probability… but the more I think and read about this stuff, the more I wonder why self proclaimed ‘Frequentists’ and ‘Bayesians’, and particularly the vociferous debaters won’t acknowledge that to varying degrees they are all Likelihoodists. Anyway, would be interested to hear thoughts from the round table.
Thanks so much for this comment.
Actually I 100% agree with you. I haven’t weighed in on the Bayesian vs Frequentist debate because it doesn’t interest me that much (my inclusion of Bayesian on my list was much more practical and related to choose complex approaches when simple would do). I would call myself a likelihood person and as such see myself able to move towards frequentist or Bayesian as needed/useful/pragmatic. I can think of several really good ecological stats books written by practicing ecologists who also happen to be good at stats (Hilborn & Mangel’s Ecological Detective or Ben Bolker’s book) that would probably describe themselves as in this camp too. When I teach stats I say there are four camps: Bayesian, frequentist, likelihood and Monte Carlo and that I am a likelihooder first, Monte Carlo second, and all four third.
While I see the Bayesian point that you can’t strictly describe probability as the % of time an event occurs if you have a one-time only event, and I can for sure see Jeremy’s point that probability is not just subjective, I just can’t worry about it that much. I know what probability means and I’d rather calculate it than argue about what it means. And likelihood is the generic tool that gets me there.
Thanks for a constructive and potentially consensus building post.
Let us likelihooders unite and tell the Bayesians and frequentists to quiet down. We’re trying to get some work done over here 😉
I’m sure Jeremy has a different point of view and will weigh in …
Very minor quibble: I’d put Monte Carlo in the frequentist camp. The rationale is basically the same as for bootstrapping and randomization tests, which are frequentist procedures.
Yeah, I think you know I don’t buy “I know what probability means and I’d rather calculate it than argue about what it means”. “I can’t define it but I know it when I see it” may work for pornography, but I’m not so sure it works for probability. 😉 But having said that, in practice (yes, I do have a practical side, believe it or not), what ultimately matters is your scientific conclusions, not your statistical ones. I suspect that even Gelman’s quite principled hybrid approach requires him to sometimes fudge the issue of what he means by “probability” (I’m not sure about this). But in the end, if his procedures do help him identify and eliminate substantive errors about how the world is or isn’t, then I’m ok with it. What I’m not ok with is claims by thoroughgoing subjective Bayesians that there’s no such thing as a substantive error at all, all there is is coherent or incoherent updating of one’s personal beliefs. This is what Brian Dennis was complaining about in his 1996 paper when he said “Being [subjectivist] Bayesian means never having to say you’re wrong.”
I agree that at this point it’s probably best to draw a line under the Bayesian vs. frequentist debate on this thread. Interested readers looking for further reading on Bayesian vs. frequentist stats should consult the sources I compiled in this old post.
Yes, both Bayesians and frequentists care a lot about likelihoods. But they tend to care about them for somewhat different reasons. Bayesians of certain stripes will insist that the likelihood embodies all the information in the observed data that is relevant to any inference one wants to make. This is a very rough statement of what’s known as the “likelihood principle”. Frequentist statistics violates the likelihood principle, which frequentists see as a virtue and certain sorts of Bayesians see as a vice.
One standard example is in disagreements over whether using a “try and try again sampling procedure” (i.e. “I’m going to keep sampling until the data come out the way I want them to”) should affect one’s inferences. Because the likelihood of the hypothesis, given the observed data, doesn’t depend on the sampling procedure (remember, it’s the likelihood of the hypothesis, taking the data you happened to observe as given), Bayesians of a certain stripe don’t care about the sampling procedure. In contrast, it’s the essence of frequentist statistics to care about data you might have sampled, but didn’t. Did you see a real effect in the data, or are the apparent patterns really just down to dumb luck (e.g., random sampling error) Because your sampling procedure affects what sort of data you might get (especially if your procedure is “sample until I happen to get the answer I want”), frequentists account for the sampling procedure in their inferences, even though this violates the likelihood principle.
Note that there is a third, not very popular, school of statistics known as “likelihoodists”, who believe that likelihoods are all you need. No priors, and no error probabilities either. I believe there is a chapter in the Taper and Lele volume on likelihoodists.
So no, agreement that “likelihoods are important” isn’t enough to settle the philosophical differences between frequentists and Bayesians (and between both and likelihoodists). If those differences are to be bridged (and I think that’s happening, slowly), it’ll be via the need for new approaches to deal with new scientific problems (e.g., inferential challenges posed by “Big Data”), and by applied practitioners who can’t afford to merely be right as a matter of philosophical principle, but who need to be right in terms of actually helping people do the real-world things they want to do. Accurately and precisely predict future data, predict the outcome of interventions in the world (planned and unplanned), etc. That practical experience is very much where Andrew Gelman started from in building up his own (quite principled, not ad hoc) hybrid of Bayesian and frequentist approaches.
Thanks Brian, it’s good to know I’m not on my own, or completely missing the point on this one. I stubbornly try and worry about the philosophical side of things, mostly so I know what people are saying (or trying to say) and can offer sound reasoning for doing what I do.
Jeremy – I found your second paragraph VERY helpful for understanding this basic difference in thinking about sampling procedure and sampling error. I think I sometimes forget these implications of the two approaches. I find it hard to imagine making inference about real world events with a straight face without somehow taking my sampling procedure into account in my analysis… I guess this makes me a Frequentist… but hopefully a Frequentist with my tent firmly staked down in the grey world of likelihood, with all the bent rules that that implies.
Re: the Taper & Lele volume – I’m slowly making my way through… have to get through Mayo’s chapter before I get to the Likelihoodists.
Now, as a diligent liklihooder, back to quietly getting some work done…
Glad you found my very brief comments helpful, thanks!
To address your comments that I’m not sure Jeremy touched on yet:
The likelihood L(theta|data) is not awefully Bayesian (and I really loathe that way of writing it, since it makes it look like you’re just talking about the probability of the theta given data). It is the probability of observing the data given a model and specific parameter. This can easily be understood as a statement about frequencies of specific data sets, if we assume our model is true.
The other thing is: frequentists generally view Maximum Likelihood as one tool for estimating parameters, and one with known issues in many cases that lead to other estimators being used (penalized max likelihood is one of them). Choosing a likelihood is, yes, a use of prior information in terms of what the user thinks may be a good model for the data, but one of the key ideas of frequentism is that, once you’ve chosen a model, your prior beliefs shouldn’t affect your understanding of how well-tested that model is. If I use a Poisson regression, then test for over-dispersion, it still shouldn’t matter how strongly I believe it’s really Poisson distributed when I test for the probability of a given amount of overdispersion. Frequentist tests are based around the properties of the sampling distribution, not the likelihood.
Given that Likelihoodists make statements to the effect that the only way evidence can enter into an analysis is through the likelihood, something that neither frequentists or several forms of Bayesian agree with, I’m failing to see how we’re ‘all likelihoodists’. Just because we all use likelihood as a tool doesn’t mean we’re all actually part of that camp.
As I see it, the practical problem with being a likelihoodist is that (in most practical problems) you need help from either Bayesian ideas or frequentist ones (or both). Likelihoodist stuff works well when there is a single parameter of interest and no nuisance parameters (i.e. parameters that you’re not interested in but are necessary to get your model to fit the data properly). In these simple cases, just make a likelihood interval and things work great (see the interesting work of Royall for many examples). But in multiple parameter settings, things get hard. The frequentist option is to forget about the likelihood principle and calculate a bunch of confidence intervals / p-values (as Jeremy discussed above) and the Bayesian option is to add a prior and integrate over the nuisance parameters to get marginal posterior summaries. If you don’t do one of these things then making practical summaries of your analysis is very difficult.
These practical problems have always suggested to me that the purely likelihoodist approach probably won’t catch on. But then again, I also don’t think that the purely Bayesian or purely frequentist approaches are the way out either (as I discussed in an earlier response on this post).
It seems everything has been said, just not by everyone, so in the hope that some people still have energy I would like to add/rephrase a few points that have been raised so far
In general, I agree – there is sometimes a tendency to use overcomplicated and allegedly sexy methods in ecology. However, I have my doubts about the generality of many of the examples in this post.
In particular, several of the examples seem to have the notion underlying that we don’t need corrections that affect significance or CIs if effects sizes are not changed, and if effects are clearly significant no matter what. Two comments on that:
* First of all, it is not so clear to me that effect sizes / point estimates are really untouched for some of the examples, in particular when you have unaccounted spatial or phylogenetic correlations in your data. The fact that some studies found no effect of such correlations on point estimates doesn’t make that a general rule, and there are counterexamples, see e.g. http://onlinelibrary.wiley.com/doi/10.1111/j.1365-2699.2012.02707.x/abstract for a number of references about RSA affecting point estimates.
* Even if point estimates were unbiased, there are a number of occasions where the exact value of significance levels/CIs matters also below the 0.05 level. Meta-analysis was already mentioned, where estimates are basically weighted by their confidence, inclusion as priors in a Bayesian analysis is another occasion. Hence, if other people want to use your results, it may be important for them if the effect is at p=0.01 or p=0.001, and at any rate, I think if you give me a number I should be able to rely on this number being correct. So, I think in cases where you don’t correct correlations that obviously change p-values of CIs, one should at least note this or rather write p<0.05 instead of p=0.001 to avoid wrong values being added to meta-analyses and such like.
And I can't resist adding a few comments on Bayes, which actually relates more to the replies below this post:
– The fact that Bayes seems more complicated is imo only because it is currently less well supported by statistics software and you have to do a lot of things by hand (sampling, convergence check, etc.) that are of rather technical nature. Other than that, it is conceptually not more complicated than its alternatives (after all, it's what people came up with first, before NHST and MLE).
– Saying that P(M|D) ~ P(D|M) except for the prior is missing the most important point about Bayes, namely that P(M|D) is interpreted as a pdf, and credibility intervals are constructed through integrals over this pdf, while MLE is testing P(D|M) point wise to arrive at confidence intervals. Fisher is giving the arguments for the latter in his seminal 1922 paper. So, Bayesian CIs and MLE CIs are reporting a fundamentally different thing, and therefore it's not just about the fancier method, but also about which of those two things you are interested in.
– I don't agree that Bayes is more subjective because of the priors, or that priors are subjective. The discussion about subjective and objective Bayes seems a very academic one to me. Priors typically relate to existing prior information, e.g. through other measurements or previous studies (e.g. http://onlinelibrary.wiley.com/doi/10.1111/j.1365-2699.2012.02745.x/abstract). A famous example that is often given in introductory Bayes courses is an HIV test with a certain false positive rate. If you don’t include the prior information that most people don’t have HIV (which is, I hope you will agree, not subjective at all), you arrive at false probabilities that someone which is positively tested is actually positive. So, for most practical cases I would say that if you call informative priors subjective, you’re in a way saying all previous scientific results are subjective, and only your current analysis isn’t. Uninformative priors are a different thing, but also here I don't see why other inferential modes would be more objective.
– Posterior predictive checks are mostly checking for model errors and not for prior errors. I don’t see why you would modify your priors unless you found out that you made a blunt error after a posterior predictive check.
In general, I don't see that the discussion about which approach (Bayes, NHST) is better has been very productive in the past, and I think it's really time to arrive at some "statistical pragmatism", as Kass calls it in http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.ss/1307626554 , but as correct as possible of course.
Finally, I’m not sure about the detection models and repeated measurements. I think accounting for sampling/observation procedure can be very important, and it makes the inference much easier if you have some repeated measurements. You may be right that it's unnecessary in your specific case, but in general I think doing measurements that allow you to assess your measurement/detection error is good practice and will help analyzing your data.
Hi Florian,
Brian and I are kind of glad this thread is winding down, we need our lives back! So sorry, I’m not going to reply at length, despite the fact that your comment is a very good one. Don’t know if Brian will reply, but don’t be surprised if he doesn’t either. I think we and the other commenters have pretty much said all we have to say on the points you raise. So insofar as you disagree with other folks, we’ll probably all just have to agree to disagree. You need to get in one of the first 100 comments if you want a lengthy reply! 😉
One very minor quibble on a small point not addressed in previous comments: whether or not a prior is “informative”, and whether or not it is interpreted as a measure of someone’s subjective personal degree of belief, are two different things and they’re independent of one another. You can be a subjectivist Bayesian and use uninformative, or informative, or weakly informative, priors. And you can be some sort of non-subjectivist Bayesian and use uninformative, or informative, or weakly informative, priors.
Hi Jeremy,
fair enough, I figured you guys must be exhausted.
I know that you were referring to the subjective/objective debate, I just wanted to point out that this is a very philosophical issue with little bearing on practical Bayesian applications in Ecology and therefore a straw man for criticizing Bayes – rereading your comments, however, I realize that you said the same thing already, sorry about that.
Oh, don’t misread my comments–I think it’s a philosophical issue, absolutely, but I also think it has super-strong practical applications. If you’re a subjectivist like Jane, you’re going to do very different Bayesian analyses, and interpret them very differently, than even an “error statistical” objective Bayesian like Andrew, never mind a frequentist!
Hi Florian,
Thanks for your thoughts and especially for providing detailed support for your thoughts so we can discuss them.
I am not going to reignite the Bayesian debate. To me (as you & Jeremy decided in a separate thread), while interesting, it doesn’t really affect how I do my statistics in this case (unlike some other cases we’ve highlighted here like multiple tests).
On spatial regression (actually you’re the first one to address spatial regression, per se, so thanks for bringing it up) a couple things:
1) I read the paper by Kuhn & Dormann when it came out. It sure wants to leave you with the impression that coefficient estimates are biased but when you look at their actual evidence:
a) That they can’t find anywhere in Cressie that says GLS isn’t biased – this is a weird double negative – don’t you think Cressie would have said somewhere in his gigantic tome if it was biased and that Kuhn & Dormann would be quoting that.
b) They have a quote from Dutelliel saying the estimates are biased, but if you actually read further into Dutelliel’s paper the *ONLY* bias Dutelliel talks about is and I’m quoting the paper “a bias due to the underestimation of the standard error of the product moment correlation after Fisher’s transformation, and that this bias leads to narrower confidence intervals than were in fact justified by the amount of information carried out by the observations. This is nothing but an intuitive predefinition of the concept of effective sample size.” This is exactly the p-value issue I have talked about (and Hawkins who Kuhn & Dormann are attacking acknowledged). Nowhere does it say that the estimate of the slope or intercept is baised in the statistical sense (i.e. expected value is different than true value). Did they not bother to read the rest of the paper?
c) they cite the well-known Dormann et al 2007 paper as demonstrating bias, but if you look at their figure 1/Table 2 – it is clear that the non spatial model is right in the middle of performance of all the different spatial models. And in all cases the non-spatial (i.e. GLM) model is within 0.5 SE of the true value. There is no bias demonstrated, just noise in estimates.
d) citing Beale et al which has a sentence that is highly conditional, and in fact starts “If the regression coefficients are close to zero then ” and goes on to say by a weird artefact the OLS regression will be slightly larger. A) This is not at all the same thing as saying that in general OLS is baised and GLS is unbiased. B) and more importantly, if the regression coefficients are close to zero then I’m probably not doing a regression or drawing significant conclusions. This is in no way a general demonstration of bias.
e) I didn’t follow their Ansel citations because they’re all in econometric books I can’t access easily and their interpretation of the citations I did follow don’t make me think it is worth my time.
In the phylogeny example error I cited papers that actually clearly state the estimate of coefficients is unbiased and people still told me they were biased (and since PGLS is just GLS with a slightly different structure to the covariance matrix this generalizes into spatial autocorrelation). All that is clear to me is that people really WANT GLS estimates to be biased. I am officially done refuting this point!
2) I am confused by this argument that confidence intervals (or p-values) are an innate property of the natural world that must be gotten “right”. Confidence intervals are a property of effect size AND sample size/design. If you care about future meta-analyses then you should report corrected p-values (same as I’ve said from the beginning). But there is no such thing as the one right confidence interval and I can never get the “right confidence interval”. I can put more effort in and get a more informative (tighter) confidence interval. But a right confidence interval. Nope. And just as I’m willing to walk away from more sampling at some point, I am willing to walk away from more statistics at some point.
I repeat probably 90% of all papers published in quality ecology journals violate at least one assumption of the statistical approaches they use. Just for a start, care to guess how many papers use normal-based statistics that would fail a test for normality? And properly so in most cases. This is NOT NOT NOT a problem that should be fixed. Doing good science, coming up with good ecological questions, testing those questions cleverly and interpreting them *ecologically* are all way harder and way more important than the statistics. And if somebody has done that well, we shouldn’t throw it away because of a minor violation of statistical assumption that is inconsequential. All models are approximations to nature and that includes statistical models. So we should treat statistical assumptions approximately, not absolutely. In all the cases we’re debating here, we know when we can bend the assumptions a little and when its a big problem so don’t tell me I have to match perfectly the assumptions.
So, I return to the original point of my post. Can anybody demonstrate a case where our ECOLOGICAL understanding is better off because of autocorrrelation correction techniques or where our ECOLOGICAL understanding was really wrong because of a failure to use autocorrelation correction techniques*. This is all I really care about. I don’t believe it exists and 100 comments into this thread nobody has even really tried. And that is all I really care about!
*NB – I am not talking about directly studying autcorrelation via variograms, variance partitioning among taxonomic levels etc – I am talking about techniques like GLS/PGLS or PIC whose primary function is to correct for the covariance in errors.
Hi Brian,
no disagreement about the fact that applying statistics is not about getting all assumptions right, but about making errors that are small enough to get reliable results.
It also seems to me that we agree on the fact that different types of correlation can lead to smaller p-values and narrower CIs than if corrected, effectively changing the number of data points you have. The question that we argue about thus is only if this matters.
Ultimately, this seems to boil down to a matter of taste and question – for some things, it really does, for others it doesn’t. For example, it matters for any application that pools your results with other results. I would go so far as saying that a Bayesian inference is incorrect if either your prior was generated with a method that didn’t correct RSA, or if your Bayesian analysis uses an informative prior together with a likelihood that exhibits uncorrected RSA, because then the effective number of data points affects the inferential weight of the likelihood in relation to the prior. In this case, parameter estimates and predictions could clearly change. It might also matter in a non-Bayesian analysis when you combine different data types of which some show residual correlations and others don’t. Moreover, it generally affects your estimates of predictive uncertainty, which I find also an important issue considering that SDMs are frequently used as predictive tools, and it might affect model selection methods.
Of course, you might say it doesn’t matter for what you are interested in. Fair enough. I wonder, though, if we are really in a situation where there are more people that unnecessarily correct RSA than those that inappropriately don’t.
As for your challenge to name some examples that show that these things matter for ecology: I wonder what you think about the examples given in Jennifer A. Hoeting (2009) The importance of accounting for spatial and temporal correlation in analyses of ecological data. Ecological Applications, 19, 574-577. http://www.esajournals.org/doi/abs/10.1890/08-0836.1 Most of those are not so much about methods to absorb RSA than to explore the reasons for RSA, but still, it shows that caring about it can make a difference.
Your comments about the question whether RSA likely leads to biased point estimates or not are interesting. I think I have to dig more into this issue to better form an opinion. I see no reason why it shouldn’t be possible to construct a situation where RSA leads to bias, but it might well be that this isn’t very likely.
Hi Florian
As you say we agree upon much. Thanks for the Hoeting reference. It is actually a nice clear and very precise summary of the issues. In particular it confirms that:
a) your sample size is really smaller than it looks*, and so things depending on sample size, i.e. estimates of uncertainty (namely p-value and confidence intervals) are bigger than calculated by traditional OLS. In practice I find that they’re usually only a tiny bit bigger. EG p values most often go up 10-50%, maybe 100%, but p=0.01 will not usually go over p=0.05 but p=0.03 often can. Certainly say p=0.001 I can’t imagine going over 0.05. Thus this issue is really at the margin for p-values
b) Hoeting says clearly “the parameter estimates will be unbiased” my exact claim all along about this. This does not mean you cannot cook up an example where OLS looks bad. But you can cook up an example where GLS looks bad too. The statement is a statistical statement about what happens on average. And it also means since, unless you have an infinite sample size your estimates are approximate, in any given study, OLS and GLS will give different answers. But there is no way to tell which one is better. On average OLS and GLS both get the right answer for a point estimate.
Now on point #a you seem more interested in confidence intervals than p-values which I respect, I also think CI are the more interesting of the two. But about this, I can only say you are right that if you are right that my primary concern is to leave an accurate CI for future metanalysis I better do GLS (or other spatial methods). More often most people’s primary concern is to see if there is statistical significance on their p-values (either whole model or individual coefficient), in which case if their p-value is border line they better do GLS, but otherwise, it won’t change the story. Now, if my primary concern is not to leave an accurate CI for future meta-analysis, should I do a GLS to help them out anyway? It would be nice. But its not mandatory. And it depends a lot on how much work it is for me. In the case of phylogenetic correction when there is no phylogeny and even no sequencing of the species at hand, it is really a lot of work. In spatial analysis, it is a modest amount of work. To download a spatial regression package in R and push a button to get a spatial regression is trivial. To pick the right type of spatial regression and calibrate it appropriately is a bit harder and requires some knowledge. Even in spatial GLS I have to decide if my spatial autocorrelation is exponential, spherical, or what. Is the error introduced by ignoring spatial autocorrelation bigger than the error introduced by having a novice try to measure autocorrelation? Hard to say.
But having a conversation about leaving a better CI for the future vs. how much work I have to do is a really valid conversation and I wouldn’t mind having it with a reviewer. But telling me I have to do spatial autcorrelation (or worse phylogenetic correction because it is so much more work) because “you’re supposed to and its really bad if you don’t” (something you haven’t said but many people have said to me) is stupid.
Thanks for the elevated conversation!
* This idea that the core effect of autocorrelation is to make the sample size smaller is rather intuitive. Imagine I have 10 points. Now if they are placed far enough apart from each other, then I truly have 10 points because the errors at each point are uncorrelated with the errors at any other point. Now, I take two points and move them towards each other. As I start to get into the range where those two points are close, they start to be come correlated and worth a bit less than two points, so maybe I have 9.5 points now. When those two points are on top of each other, they are perfectly correlated or redundant, and I really now only have 9 points. More realistically, as I start to move all 10 points closer to each other so they start to become correlated with each other, I start to have less than 10 points worth of data, maybe really 7.3 points. Thus my degrees of freedom is overestimated. Thus my p-value is too big (and my CI is a bit too small). This is not mysterious. And as I think maybe you can see intuitively, it doesn’t do anything that changes the slope of the regression. That is all that is going on.
About the question of CI: I would be perfectly happy as a reviewer/reader if we could agree on attaching a star or cross to “correlated” p-values or confidence intervals, together with the statement that those are uncorrected for correlation and therefore potentially overconfident about the uncertainty. Otherwise, I can’t get over my feeling that it isn’t right to write down a number of which you know that it is in some sense “wrong”. Unsure, however, whether starting to do so is a survival trait for getting papers through the review.
Just for the record: still, there is a number of occasions where CIs matter, basically any type of analysis where you compare things, including Meta-analysis, Bayesian statistics, particularly when used with informative priors, and any type of (AIC) based model selection methods. I’m sure a lot of people are doing e.g. model selection under RSA, so I’m a bit afraid that when broadcasting “RSA sometimes/often doesn’t matter”, the “sometimes/often” might not get across to the receiver.
I’m still struggling with the point estimate bias question … true, Hoeting states this as well, but I’m somehow reluctant to accept this as a general rule without having some sense of when and why this is the case. As you agreed, it seems pretty likely that one can construct examples where things go wrong, but the question is really how obscure these examples have to be and how relevant they are for ecological applications. Would like to read/see more on that.
OK, as this post winds down (although I am still happy to field *substantive* posts that address the main questions, I thought it would be useful to provide a summary for all those who don’t want to read 100+ comments. here it is:
1) There seemed to be a broad consensus that a diversity of statistical approaches was good, a cook book approach was bad, and a cook book approach from people using sophisticated statistics was disappointing
2) There was a broad consensus that philosophy is an important part of being a good statistician
3) We once again proved that if you mention the word Bayesian in a public setting, there will be a flame war (a good bit more civil here than most that I’ve seen though)
3b) Although we got into a not-unsurprising debate about the philosophy of Bayesian vs. Frequentist, we didn’t really get any agreement (not unsurprisingly) . Some were happy with likelihood in the middle but neither the Bayesian nor Frequentist partisans were happy with this.
4) Although not a major theme, several people seemed to think tools that pulled out for study and analysis the spatial/temporal/phlogenetic correlation were more interesting than ones that corrected them out (PGLS, spatial regression, etc). *
5) We all really liked Carl’s idea of having the R code and the data available (to reviewers and readers) so that people could quickly check out for themselves how much these different approaches mattered
To my four specific complaints:
6) Multiple comparisons – most people agree that they are necessary sometimes, most people agree that there are times they can be safely ignored. Many people (but not all) preferred methods other than the most commonly recommended Bonferroni. We agreed that this is a case in point that philosophy is innate to the discussion
7) Phlogenetic corrections – we had one very polite defender of PGLS (the most common modern approach to phylogenetically correct non-independent data) but we didn’t seemed able to get into a technical discussion and I didn’t feel like my points were addressed (he probably didn’t either). This defender pretty much refused to acknowledge there was ever a time when one didn’t need PGLS which is what I really wanted to hear. Everybody else pretty much agreed there are times it is unnecessary.
8) Spatial regression – hardly discussed
7&8) On both spatial & phylogenetic regression, three people suggested that estimates are biased without correcting for non-independence of errors. I followed up all papers given and there was nothing in any of them. My original statement remains true: the chance of Type I error gooes up a bit (p-values should be a bit higher, or if you prefer confidence intervals a bit wider, due to an effective sample size that is smaller than actual), BUT the coefficient estimates are unbiased in the statistical sense – the expected value of the estimate is the true value.**
9) Detection error – similar to phylogeny – we had one pop-in commentor who rather roundly asserted that the original post was unreasonable but never really explained why and never stayed to engage in a discussion (but seemed to imply that there were scenarios where detection error wouldn’t change your results)
10) Bayesian methods – most agreed that some Bayesian methods were overkill and simpler methods should be used but that there were many valuable tools in the Bayesian world (whatever one’s philosophical interpretation).
I’ll throw out one completely new idea. Taxonomists talk about lumpers vs splitters. I do actually think this is an innate axis of personality variability. I think we might have the same phenomonenon here. To me when I hear confidence interval, credible interval, likelihood interval I hear just interval and only secondarily (and usually only when forced to) focus on the differences (which is not the same as saying I don’t know they exist). I lump these ideas. But we clearly have some splitters who care intensely about how these are different. Just like taxonomists, we need both types and a healthy tension.
Finally, I return to the original point of my post. Violation of statistical assumptions to me is secondary to advancement of ecology. Can anybody demonstrate a case where our ECOLOGICAL understanding is better off because of autocorrrelation correction techniques or where our ECOLOGICAL understanding was really wrong because of a failure to use autocorrelation correction techniques (or detection error in within species macroecological analyses or multiple correction in certain scenarios or fancy Bayesian models when a simple alternative exists). This is all I really care about. I don’t believe it exists and 100 comments into this thread nobody has even really tried.
I sincerely doubt this post has changed the dynamic of how these topics are treated in the review process. But hopefully its at least caused a few people (especially editors) to stop and think.
* I’m aware you do get some information on the structure of autocorrelation out of PGLS if you take the time to focus on that aspect of the output but most don’t
**Some examples have been given suggesting that GLS may be more efficient (it converges on true answer at smaller samples), but I know of no general demonstration of this, and in most simulations it appears to be a relatively small effect. And in any case a slightly more efficient eatimator has the same effect as collecting more data – it is not a violation of assumptions or a must do, especially if one already has a signficant result in which case more efficiency is overkill. Similarly, OLS & GLS estimates will be different in any specific analysis but there is no reason to pick one over the other as the
One thing I’m mildly surprised about in this thread, given the savvy of the readers and the proportion of comments about R – the p-value/degrees of freedom in lme4/mixed model issue (and related AIC issues). I just thought I’d lob that one out there, as there is no other issue that makes me constantly remember that Statistics is an evolving science, same as us.
Thanks for this interesting article – and for the summary of comments at the end.
I agree with some, but not all of what is said here. Firstly, as a co-author of Beale et al (by which I assume you do mean http://onlinelibrary.wiley.com/doi/10.1111/j.1461-0248.2009.01422.x/abstract ) I of course concur with the point about OLS regression being unbiased in the presence of spatial autocorrelation. What our paper says is that in the presence of strong autocorrelation, you might get the wrong inference with OLS because the variances of parameter estimates will be wrong. That is incontrovertible, whether or not the ecological importance of the analysis justifies the extra work. That extra work, to be fair, is becoming less onerous as statistical software improves.
There are good reasons for choosing against a Bayesian analysis, but deciding it is “fancy” or “macho” is not one of them. I can fit a Bayesian hierarchical model almost as simply as a frequentist version; I tend not to, without good reason, but there are benefits (e.g. dealing with missing data, or wanting to work with functions of parameters post-hoc).
I agree wholeheartedly with the prevailing view against “cook book” approaches. I get annoyed about being told to use AIC when it is not appropriate, or being told my statistical analysis is “out of date” because we dared to use a form of stepwise (a tool oft misused, but with valid uses).
I suspect statisticians and ecologists need to learn more from each other; what one group see as “fancy” the other may see as somewhat routine, so this shouldn’t detract either way from interesting science. I want to see papers using ecological examples in statistical journals which stress the importance of the ecology, and papers in ecological journals highlighting the importance of the statistics; in fairness, I suspect I see far more of the latter than of the former. [Some stinkers too, of course…]
Hi Mark, We agree on most things including what does (point estimate) and doesn’t (variance of estimate) work about using an OLS estimator in a GLS world. And I would agree that for spatial regression (or temporal) the amount of extra work required to do GLS is getting small (the amount of knowledge required to use this technique properly is a bit bigger and my main concern in this case but it is probably also disappearing over time). However in the phylogenetic case, it can be substantial work to go obtain a phylogeny when you don’t have one (indeed 10 years ago it was so much work it was considered worth a PhD thesis). I agree avoiding Bayesian just because it is fancy doesn’t make sense, except in that being fancy means you are losing much of your audience if you are publishing in an ecology journal. I see this as a real concern that in my opinion is one factor to consider on which method to use.
Thanks for stopping by! Nice to have a real statistician contributing.
Sorry to drag up a dead post! Can you provide a reference that makes the case that, in the face of autocorrelation, OLS will converge on the true parameter estimate, but not the variance? I have suspected this is true, but there is no reference given in this post, or anywhere I can find.
Google my post a couple of days later on “why OLS is an unbiased estimator of GLS” or the link is here https://dynamicecology.wordpress.com/2012/09/24/why-ols-estimator-is-an-unbiased-estimator-for-gls/
Pingback: Questions and resources about structural equation models | Dynamic Ecology
Pingback: Why I love a good argument (and what makes an argument good) | Dynamic Ecology
Pingback: Why OLS estimator is an unbiased estimator for GLS | Dynamic Ecology
Pingback: On the other hand » Archive » The grass is always greener
Nice post! I fully agree that many ecologists like to show off and use statistical analyses that are just too complicated for the predictions raised. A “must do” that really bothers me is model selection. Come on, some people, with just a half dozen variables, leave the work of building a model to the computer! Wouldn’t it make much more sense instead to use accumulated biological knowledge to build a thoughtful model a priori, test it against real-world data, and interpret the negatives and positives under the light of natural history?
Pingback: Ecology and statistical machismo | Vishwesha Guttal
Pingback: Dynamic Ecology group blogging update | Dynamic Ecology
Pingback: Biggest day ever for Dynamic Ecology! | Dynamic Ecology
So, with regard to spatial autocorrelation and phylogenetic effects, would it be fair to summarize: the problem is essentially one of underestimating the true degrees of freedom in the data, and thus (irrespective of your estimation method or bias or otherwise of point estimates) your frequentist p-values will be unrealistically small and/or your Bayesian posteriors unrealistically narrow? And can your argument be summarized as “Often the effect is small, and can and should be ignored – it’s not worth the trouble of doing the stats in a way which corrects for it”
I quite agree, in many cases (highly labile traits, very different spatial scales) the effect is easily small enough to ignore.
However, without fitting a model that accounts for correlation between observations (be it phylogenetic correlation or spatial autocorrelation, or any other form of pseudoreplication) how can you be /sure/ the effect is small enough to ignore?
#R-code
require(MASS)
#Heres a set of spatial observations, that show considerable spatial autocorrelation:
set_1<-kde2d(runif(500,0,1),runif(500,0,1),n=51)$z[4:40,4:40]
image(set_1,axes=FALSE)
#Pretend they are (e.g.) soil moisture
#Heres second and different set of spatial observations, that also show considerable spatial autocorrelation:
set_2<-kde2d(runif(500,0,1),runif(500,0,1),n=51)$z[4:40,4:40]
#Pretend they are (e.g.) vegetation height
# In this case, set_1 and set_2 are random in that they are not in any way correlated with each other.
#Without looking, can I safely just do
cor.test(as.vector(set_1),as.vector(set_2))
#without worrying that they are autocorrelated?
plot(as.vector(set_1),as.vector(set_2),axes=FALSE,xlab="",ylab="")
abline(lm(as.vector(set_2)~as.vector(set_1)),col="red")
mtext(paste("p=",signif(p,2),sep=""),side=1,col="red",cex=1.2)
#in fact these uncorrelated datasets will give p<10^-20 more than 50% of the time. ie. the p-value is wrong by /many/ orders of magnitude. How can you know this in advance, without fitting the spatial autocorrelation?
Thanks for the post. R code – that’s awesome. Totally raising the standard of the conversation here!
So, yes you understand my claim correctly. I spent a little time, analyzing what you did. And:
1) That is a really wacky way of generating spatially autocorrelated surfaces, apparently depending on flaws in the random number generator in R (see figure posted below). In principle though it shouldn’t matter for my post.
2) The normal spatial model is that the dependent variable is a deterministic function of the independent variable plus spatially autocorrelated noise, which you aren’t matching. Doesn’t mean its right, but you are clearly deviating from this
3) I can’t quite match you result of p<10^-20 >50% of the time, but I agree it comes out as p<0.05 quite often.
4) Even if I take the 1367 degrees of freedom and reduce it down to 20 degrees of freedom (an extraordinarily severe downgrade, i.e. basically way overguessing on Duteliel’s correction), and calculate a new F statistic on F1,20 df intead of F1,1367, I still get p values almost as significant.
5) When I plot the data there is indubitably a relationship between the two variables (and not just at the macro scale but there is clearly some weird spirals and loops in the data that shouldn't be in truly independent data). See image posted below.
6) When I go on and do a spatial regression (GLS) it (as I claimed) doesn't change much, the p-value is less (and yes it may change from p<10^-17 to p<10^-15 but it is usually 1 or at most 2 orders of magnitude which still makes me feel pretty comfortable when I see p<0.0001 in ignoring the correction). And I wouldn’t get too excited about “orders of magnitude” change that close to zero – probably has more to do with numerical instability of the calculations than something real. Most particularly, in my runs I never saw the correction change something from significant to not-significant. I am sure if I ran it long enough one would pop up but it is not that common)
7) Even on this relatively small (37×37 dataset) it took over 30 minutes to run the GLS on a decent (64-bit R, 8GB memory) machine.
So my conclusions:
a) I'm not going to spend time delving into the details, but your method clearly is generating two datasets that actually are correlated even though they appear to be generated independently. I conclude this from #4, #5 and #6
b) My original argument still stands – running the spatial regression is costly (30 minutes vs <1 second on a small dataset and I believe this scales as n^2 so good luck on large datasets) and it does not change the outcome that much
c) To me the most interesting question is why the two seemingly independently generated datasets are really correlated with each other. I have two theories: a) it is a flaw in the R random number generator (suggested by the scatter plot with some clear not random interrelationships) or b) independently randomly generated autocorrelated variables have a high likelihood of still being correlated due to some sort of constraint. Based on my experiences in niche modelling where I have seen almost any layer have some explanatory power for almost any dependent variable, I have long suspected this was true. I can't yet explain why nor would I call this experiment decisive proof. But I wonder if any readers know why?
PS – I ran into a tiny error at the end of your code and have appended a fix to your code with the GLS code as well:
#generate two random surfaces – method is to generate random x coords, y coords and then do
# a kernel smooth
require(MASS)
x<-runif(500,0,1);y<-runif(500,0,1);plot(x,y)
set_1<-kde2d(x,y,n=51)$z[4:40,4:40]
image(set_1,axes=FALSE)
x<-runif(500,0,1);y<-runif(500,0,1);plot(x,y)
set_2<-kde2d(x,y,n=51)$z[4:40,4:40]
image(set_2,axes=TRUE)
#are they correlated?
cor.test(as.vector(set_1),as.vector(set_2))
plot(as.vector(set_1),as.vector(set_2),axes=FALSE,xlab="",ylab="")
abline(m<-lm(as.vector(set_2)~as.vector(set_1)),col="red")
ms<-summary(m)
p<-pf(ms$fstatistic[1L], ms$fstatistic[2L],ms$fstatistic[3L], lower.tail = FALSE)
plow<-pf(ms$fstatistic[1L], ms$fstatistic[2L],20, lower.tail = FALSE)
#print p-value and p-value when df downgraded to only 20 df
mtext(paste("F=",signif(ms$fstatistic[1L],3),"; p=",signif(p,2),"; p20df=",signif(plow,2),sep=""),side=1,col="red",cex=1.2)
#now check in GLS spatial regression context
colno<-matrix(rep(seq(1,ncol(set_1)),each=nrow(set_1)),nrow(set_1),ncol(set_1))
rowno<-matrix(rep(seq(1,ncol(set_1)),nrow(set_1)),nrow(set_1),ncol(set_1))
#gls only works on dataframe
d<-data.frame(set1=as.vector(set_1),set2=as.vector(set_2),rowno=as.vector(rowno),colno=as.vector(colno))
require(nlme)
# w/o autocorrelation – should = OLS (lm)
mgls<-gls(set2~set1,data=d)
#measure autocorrelation
plot(Variogram(mgls,form=~colno+rowno))
#spatial regression
mglssp<-update(mgls,corr=corSpher(c(28,0.2),form=~colno+rowno,nugget=T))
summary(mglssp)
Figure: set 1 vs set 2 – notice clear true correlation, clearly non-random patterns (looks like a fish!) and fact p-value still quite significant even with only 20 degrees of freedom:
Click here to see figure
Hey Brian, forgive me for coming to the party late – the term is over and I can find a chance to read and write a little. It seems to me that the original group of analyses can be broken in to two groups – analyses that are never wrong but are occasionally, and perhaps often, not worth the effort. So, I would say phylogenetic analyses and spatial regression fall into this group. Even if phylogeny or spatial autocorrelation explain little of the variation in the dependent variable it’s rarely (and maybe never) conceptually wrong to use these approaches. If I understand your point, it’s really a cost-benefit story and a knee-jerk insistence on complex analyses that are going to add little to a story is just wrongheaded. I completely agree with that.
The other category includes analyses that can and are often wrongly applied. Bonferroni corrections are a classic example. I would assert that anytime somebody applies a Bonferroni correction without (1) explicitly stating the assumed relative costs of Type I and II errors and (2) estimating the probability of a Type II error for their test, they are making a mistake. And it’s rare that I have seen authors make an explicit connection between their Bonferroni correction and the power of their test.
It’s something that is routinely overlooked when discussing the multiple comparisons problem – multiple comparisons increase both the probability of Type I errors and Type II errors. For example, it has become routine in analyses of microarray data to use false discovery rates (FDR’s) to control for Type I error rates because they are often doing 10000 statistical tests at once. However, Type II error rates are enormously inflated as well but this is rarely discussed. Multiple comparisons are a very real problem but not a problem that can routinely be solved by some manipulation of the data at hand. Almost always the solution is among -study replication. A time-honored scientific tradition that seem to have fallen a little out of favour.
I have to confess I haven’t figured out where Bayesian analysis and detection probabilities fit into this story although it seems to me that Bayesian analysis with informative priors is often fraught with problems. Thanks for a great thread.
Hi Jeff – never too late to join the party at Dynamic Ecology. I’m just jealous of your end of term luxuriating – I still have two more weeks to go 😉 I certainly agree with the first category – its a trade-off of level of effort vs what can be very small and unimportant improvements (and it is predictable in advance whether the improvements will be unimportant or not). I never thought of the issue of Bonferroni as being misuse and ignoring of Type II error, but I could certainly see it that way. Thank for the insight.
Pingback: Dynamic Ecology (and Oikos Blog) year in review | Dynamic Ecology
Dear Brian,
I have read with interest your post and agree with you in most of its contents. However, there is something that you have not talked about and that might (sometimes) justify the use of complex analytical tools. It is the case that quite often statistical complex methods are needed just because the experimental design is ill-defined. For example, there are potential sources of spatial autocorrelation that require the use of some particular statistical techniques (e.g. linear mixed effects models). As a peer-reviewer of different scientific journal I am more than happy to see that the authors have simply used a one-way or two-way ANOVA as the main (and unique) statistical method in their paper, provided that the experimental design is well defined. The ideas behind and the hypotheses to be tested are by far more important than the analytical tools. But even with good ideas, PhD students (and their mentors) are often unaware of the importance of having an appropriate experimental design and end up with a great mess up of data that requires the use of complex statistical methods to make reliable inference.
Thanks for such an stimulating post!
Luis
Pingback: Are established ecology blogs beginning to fade away? (UPDATED) | Dynamic Ecology
Pingback: Is using detection probabilities a case of statistical machismo? | Dynamic Ecology
Pingback: Qual teste estatístico devo usar? | Sobrevivendo na Ciência
Although I could say more, I will just make one brief comment on your discussion of Bayesian statistics. One of the strengths of Bayesian methods is the ability to incorporate error at all levels of an analysis. This can address one of the problems you mentioned with phylogenetically independent contrasts — the assumption of an error-free phylogeny. The uncertainty in the phylogeny (assuming our hypothetical scientist has an estimate of it) can be worked into the analysis, and the posterior distribution of whatever statistic was of interest would then reflect that uncertainty. (Technically this is a strength of MCMC as well, as you point out, however, MCMC is most commonly used in Bayesian analyses, and, regardless, if we are speaking about complexity of analysis, the difference in complexity of a Bayesian vs ML MCMC analysis is trivial). There is truly value to be gained from this kind of analysis. Will your parameterization of whatever statistic change much? Probably not. However, the distribution around that parameterization might. I would argue that good estimates of error and variance are much more important than P-values (outside a strictly experimental setting). Is there a cost to this? Absolutely. Is it worth it? Perhaps not, some of the time, but it should always be considered. Lastly, your point about communication is well put. The onus is on the users of advanced statistical methods to clearly explain them. We are obviously still learning, as we are with frequentist statistics as well — how many times have you seen a confidence interval interpreted incorrectly?
Pingback: Answers to reader questions: part I | Dynamic Ecology
Pingback: No Statistical Panacea, Hierarchical or Otherwise | Daniel Hocking, Ecologist
Pingback: Machismo and excellence in cooking and statistics | Mola Mola
Pingback: Ecologists need to do a better job of prediction – Part IV – quantifying prediction quality | Dynamic Ecology
Pingback: A importância da humildade na carreira científica | Sobrevivendo na Ciência
Pingback: Zombie ideas in ecology: the local-regional richness relationship | Dynamic Ecology
Pingback: Cool science graphics! | Dynamic Ecology
Pingback: Why are there so few retractions in ecology & conservation biology? | The Lab and Field
Pingback: Friday links: Science Cafe at the ESA meeting, Peter Medawar > EO Wilson as a source of advice, and more | Dynamic Ecology
Pingback: Why saying you are a Bayesian is a low information statement | Dynamic Ecology
Pingback: Happy Birthday to us! | Dynamic Ecology
Pingback: Why advanced machine learning methods badly overfit niche models – is this statistical machismo? | Dynamic Ecology
Pingback: Going big with data in ecology | Jabberwocky Ecology | Weecology's Blog