Why saying you are a Bayesian is a low information statement

My impression is that a multiple decades-long debate about the role of Bayesian statistics has generated a lot more smoke than light. As Larry Wasserman at the Normal Deviate blog notes in his blog birthday self-reflection post, one of his lessons learned is that “Put ‘Bayes’, ‘Frequentist’ or ‘p-value’ in the title of a blog post and you get zillions of hits. Put some combination of them and get even more”. But are we actually getting anywhere with all of this debate? Probably, yes, rational people are learning some things. But I would suggest there is a major semantic problem that we need to overcome if we want to start having more light than smoke.

Namely telling me you are doing Bayesian statistics tells me very little. It is such a broad term it doesn’t tell me a whole lot more than telling me you are doing statistics. So don’t drop the word “Bayesian” into a paper or conversation and expect me to have a big reaction one way or the other. I couldn’t possibly have an informed opinion about what you’re doing until you tell me more. So why do people keep saying “Bayesian” and then stopping and waiting for a reaction like they’ve just told me something provocative? Indeed, I am getting to the point where if you don’t know enough to know that saying you are doing Bayesian statistics is not saying much, I’m likely to think you don’t know very much and are in fact solely trying to be provocative!

In this post, I would like to argue that there are three main types of Bayesian statistics. So if you want to drop the word Bayesian in a conversation or paper, please at least drop one more qualifying word in front of it selected from the three qualifying words below. It will save us both a lot of time! This is of course not a new idea. The aforementioned Normal Deviate has had several great posts on common myths about Bayesian vs. Frequentist. And I had a post on statistical machismo in this blog just under a year ago which had a record (for us) number of comments where I first put forth my 3 types of Bayesian because of commenters I saw as talking past each other. And frequent commenter Steve Walker provided a link to a paper that said you were being unclear about what kind of a Bayesian you were unless you answered nine questions implying there are 2^9-1=511 types of Bayesian! These kinds of distinctions might be important for statisticians but are too much for ecologists – I am sticking with just 3 types.

To make my  comments clearer, let me recap Bayes Theorem. Bayes Theorem says P(A|B)P(B)=P(B|A)P(A). This is basically a statement of a symmetry of conditional probabilities but it has to be adjusted by the raw probabilities of the event conditioned upon. It is more commonly written as:
$P(\theta|D)=\frac{P(D|\theta)P(\theta)}{P(D)}$

It’s hard to imagine one little equation (which is an undisputed law of probability) causing so much grief! Let’s dissect this a little bit. The P(θ|D) on the left hand side of the equation gives the probability of the model parameters (i.e. θ – for example the t-test effect size μ or regression slope β) taking on different values conditioned on the data observed (D). In the Bayesian world P(θ|D) is called the posterior. The P(D|θ) term on the top of the fraction is the likelihood (in the exact sense you have seen it in other statistical contexts). Note that likelihood does not sum to 1 over all values of θ – rather it sums to one over all values of D, so is not a true probability distribution of θ. The P(D) on the denominator is a renormalization constant that turns things back into a probability (i.e. sums to 1) which is the main difference from this version of Bayes Theorem and likelihood – likelihood doesn’t sum to 1 because it doesn’t have the normalization constant P(D). Being a true probability is really nice, but as we’ll see its often a lot of extra work to get P(D) incorporated. The other main difference from likelihood is the P(θ). This the probability of the parameter θ not conditioned on the data – i.e. independent of the data. This is the real stumper – if you’re trying to estimate θ how could you know the probability distribution of θ!? Bayesians say that this term represents your prior beliefs about θ. Thus Bayes theorem in this context is a machine for combining observed data D with prior beliefs about θ (P(θ)) and a renormalization constant (P(D)) to give the posterior probability of θ. See the three figures below, which have the same likelihood P(D|θ) and same P(D). They differ in having progressively flatter (less informative) priors P(θ).

Top figure is the prior P(θ). Middle figure is the likelihood P(D|θ) Bottom figure is the posterior P(θ|D). The posterior gives a probability distribution for θ.  The vertical bar shows the mean of each distribution. Note how the posterior is a a form of very fancy weighted average of the prior and likelihood.

Note how as the prior becomes less peaked (i.e. more flat), the likelihood=P(D|θ) term comes to dominate the resulting posterior. And indeed in this example with a flat prior the posterior has the same shape as the likelihood (but is scaled differently on the y-axis to make the area under the curve be 1).

This formula and its corresponding labeling of terms is the one thing all Bayesian statistics agrees on. Beyond that, chaos reigns! To try and cut through the chaos, I will now present three main types of Bayesians and argue that each has their own interpretation of the terms and reasons for using Bayes theorem. Although there are many more subtle integradations, I think a nice starting point would be for everybody to start out saying which of the three types of Bayes they are.

1. Subjective Bayesian – This is the notorious version. A Subjective Bayesian rejects the frequentist idea that probability is the fraction of number of times an event occurs divided by the number of times the outcome was observed  (e.g. rolling a one on a six-sided die at say 6 times out of 36 or 16.667%). Instead, a Subjective Bayesian says probability is inherently grounded in human belief. The probability of rolling a six is based on my beliefs about the outcome, ultimately being based in my mind. In my post on statistical machismo, a commentor brought up the example of the “probability that there is life on Mars” and asked how one could possibly have a frequentist interpretation of this while noting we all have some probability estimate of this in our minds (and to prove it you can bet on this with British bookmakers if you want). Of course a frequentist would say its an ill-posed question – the proper question would be something like “what are the odds of life on non-gaseous 4th planets out from G-type stars” – at which point a frequentist view clearly is valid again. From this subjective definition follows many other features of what people typically think of as Bayesian. Subjective Bayesians believe the prior, P(θ), represents my subjective beliefs before collecting the data (or in a slightly weaker version it could represent the beliefs on experts in the subject using some mix of prior data and their intuition). And this is why Subjective Bayesians make a big point about the posterior leading to things like Credible Intervals (an interval that one thinks is a believable range of estimates for the parameter) rather than confidence intervals (which are inherently frequentist – % of time that the true value lies within the interval given the data observed). In my experience, most ecologists are uncomfortable with Subjective Bayesian approaches and true Subjectivist Bayesians are more likely to be found in philosophy or math departments. But there are no small number of ecologists who enjoy saying they are Bayesian because of the daring (=non-objective) overtones of Subjective Bayesians even if they don’t want to defend the subjective view when pushed..
2. Historical Bayesian – In another view of Bayesian approaches, the emphasis is NOT on the definition of probability, but on the role of the prior. Indeed more often than not, this approach uses a frequentist definition of probability. In this flavor of Bayes, the key point is that the prior represents, well, prior information. IE it is a summation of the expected distribution of θ based on all prior data collected. Thus Bayes theorem is a mathematically precise way of smashing previous work against a new dataset to make a new prediction. If previous data has been very consistent (leading to a very narrow strongly peaked prior) and the new data is highly variable leading to a very flat-peaked estimate of θ, then the prior will dominate the new result, and vice versa (see figures above). If both the prior and the likelihood are normal, relatively simple formulas comparing the variance in the two distributions can be used to calculate the posterior. It is an interesting question why you would favor serial updating of the probability distribution using repeated applications of Bayes theorem for each new data collection rather than performing a comprehensive meta-analysis on all data collected. But this historical Bayesian approach is making no claims about the subjectivity of the prior – indeed it is explicitly invoking an empirical, more objective use of the prior based in historical data. Nate Silver, in his recent book on prediction, calls himself Bayesian (in a generic unqualified sense) but he appears to mean this in the historical Bayesian sense, iteratively improving and correcting. As Wasserman demonstrates, Silver’s definition of probability is about as frequentist and non-subjective as you can get*. In my experience teaching, the idea of a historical prior is something that makes a light bulb go off on why you might want to bother with the whole Bayesian machinery. Wyckoff and Clark 2000 is a nice pedagogical paper showing how in a world with limited data, prior information can be useful through a historical Bayesian prior.
3. Calculation Bayesian – A lot of ecologists are uncomfortable even with the idea of a historical prior (or prefer as I do a meta-analysis that puts all data on equal footing rather than distinguishing added data vs prior data). So the solution is to use what is known as an uninformative prior (a prior that is literally flat representing a uniform distribution or for practical computational reasons more commonly a normal distribution with a variance of 1,000,000 or some such that is effectively flat). This basically removes any influence of the prior.If you take the prior out, what do you have left? Well literally you have likelihood renormalized by P(D) to convert likelihood into a true probability distribution. As the 3rd figure shows a flat prior causes the likelihood and posterior to have identical shapes with only a rescaling on the y-axis (therefore giving identical point estimates of parameters, confidence intervals around parameters, etc even if they have different names – yes that’s right the credible interval can equal the confidence interval – its just a question of interpretation). For this small difference, including P(D) means you have a really nasty computational problem. P(D) normally requires evaluating complex high-dimensional integrals that are too slow to calculate by traditional deterministic numerical methods (e.g. quadrature). This has led to the development of Markov-Chain Monte Carlo methods. There is not space to delve into MCMC here, but what it produces is a constrained random walk through parameter space (possible values of θ) that visits regions of parameter space in proportion to their probability. Thus if you take a long-enough random walk and make a histogram of every point visited in the random walk, you have a good approximation to the posterior. And when you get right down to it, this is why the third group calls themselves Bayesian. Once you’ve thrown out the prior, all you have left is the renormalization constant to distinguish yourself from likelihood. And while a renormalized true probability distribution is nice, it is not necessary. We have all sorts of tools for working with likelihood include MLE (maximum likelihood estimates of parameters), likelihood surfaces (which can be converted into likelihood intervals), likelihood ratio tests, AIC, and etc. What you really pick up by calling yourself a Calculation Bayesian is not so much the renormalizaiton constant, but the fancy MCMC calculations created to deal with it. In many respects once you have a flat prior you are just doing a very hard version of likelihood calculations with a complex but very general and powerful tool.

Note that these three types are effectively strictly nested. Historical Bayesians usually incorporate the Computational Bayesian methods. And Subjective Bayesians usually incorporate both. But not the other direction.

I also wanted to make a point about two qualifiers that SHOULD NOT get dragged into debates about Bayesian methods.

First, saying you are using “hierarchical” methods says nothing about whether you are Bayesian or not. Hierarchical is a type of model specification that allows for multiple sources of variability. In particular, it often involves not treating a population as being represented by the mean value but as having a distribution of values. This distribution of values are described by what are known as hyperparameters. Hierarchical modelling is an exciting innovation as it is clear ecology has LOTS of sources of variability and being able to model these more carefully is important. However, hierarchical models can be estimated and solved by either Bayesian or variations of maximumum likelihood (REML, penalized likelihood and EM are names you may have heard in these contexts). Some simple hierarchical models are solvable by traditional linear mixed model methods (e.g. nlme in R) and more complex ones are often solvable by Ben Bolkers bblme package in R and if those fail you can use MCMC, simulated annealing and related algorithms or even Subhash Lele’s data cloning method (which to put it crudely and simplistically is bootstrapping meets MCMC) to solve for the likelihood. A nice pedagogical paper, Farrell and Ludwig 2008,  that unfortunately for us draws its examples from psychology, gives a clear example of a hierarchical model and shows that although non-heirarchical models perform adequately, hierarchical models are clearly better, but there is no difference in Bayesian vs likelihood. It is true that MCMC is the most flexible and sometimes only solution technique that works on really complex models. But if that is the reason you are adopting MCMC and you are using a flat (uninformative prior), you are basically doing MCMC likelihood with a renormalization constant.

Second, p-values or significance tests are basically only done in the frequentists world, but it is important to note that being frequentist does not imply that you are doing p-value or significance testing. It is very common to conflate frequentist with p-values. As you might have gathered from my posts on prediction, I am just one example of somebody who is frequentist without being particularly enamored of p-values and significance testing. It is also worth noting that Bayesian’s have recreated things like p-values. Specifically Bayes factors reduce to likelihood ratios with flat priors and, yes, if you are recalling that likelihood ratios are an easy path to p-values you are correct. To be fair most Bayesians are not enamored with Bayes Factors, but it proves the point that comparing something to a null model is an inferential mode that is independent of Bayesian vs non-Bayesian. And to be fair, a lot of frequentists are not enamored with p-values either.

So conflating frequentist with p-value is just as bad as conflating Bayesian with hierarchical. They’re both wrong. That in the end is my main point. We need to separate out modes of inference (p-value, parameter distribution, model selection, etc) and model structure (hierarchical vs flat) from frequentist vs Bayesian perspectives and amongst the Bayesian perspective we need to separate out different degrees of commitment to the Bayesian philosophy (my 3 categories). You really can mix and match among these factors at will. Thus you can do point estimates of parameters, estimates of parameter distributions, model selection (AIC and BIC), model support vs null models and more in either Bayesian or Frequentist modalities (and the answers are almost always the same if you have flat priors). Until a majority of people are making and understanding these distinctions, the debate will continue to generate more smoke than light.

So if I were the language police, I would fine anybody who says they are Bayesian or “using Bayesian statistics” who doesn’t add a qualifier word to say whether they are a Subjective Bayesian or Historical Bayesian or a Computational Bayesian. At that point I feel like two simple words would communicate quite a lot about what they are doing and how they approaching things, while just using the one word Bayesian communicates almost nothing. I would also have even higher fines for people conflating hierarchical model with Bayesian philosophy and p-value with frequentist philosophy.

Personally, I would count myself as a likelihood oriented statistician who favors hierarchical approaches in some contexts but uses non-Bayesian computational methods whenever possible and falls back on Computational Bayesian only when there are no other computational methods that work. But I have no problem with people doing Bayesian analysis as long as they are clear about what they are doing and why they choose that path. I do have a problem with people who say “Bayesian” over and over like its mystical word with lots of secret powers but not being clear why they are Bayesian and what part of Bayesian they are using. I also get irked when people tell me I have to buy the whole Bayesian package (up to subjectivist definitions of probability) and then just use flat (=no) priors and effectively are just using MCMC to calculate likelihoods. What about you? Are you a subjective, historical or computational Bayesian? or none of the above?

*Interestingly, I am not aware that Silver even uses Bayes theorem in this sense (as best I understand his models he reruns them each time with all available data and a down-weight of older data that has more to do with temporal autocorrelation than Bayes theorem) – he uses historical Bayesian in a more metaphorical sense of iterative improvement without actual use of Bayes Theorem. But this is a semantic nit pick. It is a great book and you should read it, especially if you liked my posts on prediction.

This entry was posted in Instructional, New ideas by Brian McGill. Bookmark the permalink.

I am a macroecologist at the University of Maine. I study how human-caused global change (especially global warming and land cover change) affect communities, biodiversity and our global ecology.

35 thoughts on “Why saying you are a Bayesian is a low information statement”

1. Great post, with the key phrase (for me) being this:

“…and the answers are almost always the same if you have flat priors.”

In Marc Kéry’s excellent introduction to WinBUGS, he shows in chapter after chapter that frequentist and Bayesian approaches give the same answers when uniform priors are used. As such, the key difference for many users is in the coding. A basic linear regression in R is one line of code: summary(lm(y ~ x)) whereas programming the same model in WinBUGS via R can take ~30 lines of code. Not time well spent if you do not intend to use informative priors.

I rarely see priors used in the literature. I often wonder, however, if prior knowledge would lead you to use priors strong enough to influence your results, why are you asking the question in the first place?

I do have a question – I thought that model-selection or maybe model-averaging could be problematic in a Bayesian framework. Do you know of any papers or books that provide a nice summary on the subject?

• Oops! By “rarely see priors” I mean “rarely see informative priors”.

• Hi Andrew – thanks for the comments. Although I in general have yet to figure out for myself a compelling reason to be a historical Bayesian, the Wyckoff & Clark paper I linked to above (I’m pretty sure if you don’t have a subscription you can find a PDF if you google the title) gives one case that sort of makes sense. It looks at tree mortality which is a rare event (in a given year only about 1/100 trees die). Thus one needs to monitor a ginormous number of trees to get an accurate tight estimate. In the paper they combine a previous study nearby in a similar type of woods with the current study. That’s the best example I’ve seen.

You definitely can do Bayesian model selection and model averaging. This review paper by Wasserman covers both with examples clearly albeit at a fairly high statistical sophistication level. If you google “Bayesian model selection” you will find everything from youtube videos to tutorials to etc. BIC (the most popular alternative to AIC) was original justified in a Bayesian framework, and hence the “B” for Bayesian.

All of which is to say you can do it. Most of the same issues and costs remain.

• I haven’t looked at the Wasserman review, but Bayesian model selection is tricky because there is no agreement on how you calculate the number of effective parameters in a model with random effects (all parameters are “random” in a Bayesian model). As I’m sure you know this is also a problem with random/mixed effects models calculated using maximum likelihood.

BIC is not actually used for Bayesian models, it is a maximum likelihood approximation to the Bayes factor (BF). And the Bayesian equivalent to AIC is DIC (deviance information criterion). So in terms of model selection criteria:

AIC (frequentist) ~ DIC (Bayesian)
BIC (frequentist) ~ BF (Bayesian)

These two groups of criteria differ in how they treat prior model weights, which is a contentious topic for another day by another person.

• “but Bayesian model selection is tricky because there is no agreement on how you calculate the number of effective parameters in a model with random effects (all parameters are “random” in a Bayesian model).”

Dan, can you give a reference for that? Bayes factors don’t need an effective number of parameters afaik. Might be you are referring to DIC?

• Yes Florian, I was talking about DIC. Gelman and Hill (2007) discuss the shortcomings of DIC and the instability in estimation of pD (effective number of parameters) in a multilevel model with random effects. Andrew Gelman is not a fan of formal model selection for a number of reasons.

Link and Barker (2009) provide a thorough discussion of model selection approaches with comparisons of the criteria I listed earlier. They address the problem of prior model weights and even talk about reversible jump MCMC.

Gelman, A., and J. Hill. 2007. Data analysis using regression and multilevel/hierarchical models. Cambridge University Press, New York, USA.
Link, W. A., and R. J. Barker. 2009. Bayesian inference: with ecological applications. Academic Press.

• OK, for DIC this is clear, I was just confused because you mentioned the Wassermann paper in the same sentence.

• “I rarely see (informative) priors used in the literature. ”

Here’s some interesting work on making using of informative priors from a more practical perspective:

” I often wonder, however, if prior knowledge would lead you to use priors strong enough to influence your results, why are you asking the question in the first place? ”

I think you are turning a grey area issue into a black and white issue. The idea is to use just enough prior information to keep your estimates from using too much of the noise in the data (something maximum likelihood can be really bad at), without subjectively overwhelming your data with ‘prior information’.

• Steve, it seems to me weakly informative priors as discussed in the Gelman are not really good examples of informative priors – you typically use them in cases of no or very little information to avoid integrability / identifiability problems.

• Well…Gelman et al have used their weakly informative priors in multiple logistic regressions with complete separation and mixed models for which the MLEs of random effects variances are zero. So on one hand I see where you are coming from, because with complete separation we are essentially in a low information setting (i.e. we don’t have enough information to avoid boundary estimates). But on the other hand the idea that ‘we know (or strongly believe?) that the true coefficient isn’t on the boundary’ *is* prior information, just a very weak kind of information. My point was simply that pitting ‘no prior information at all’ against ‘prior information that completely dominates inferences’ is a false dichotomy. Gelman’s weakly informative prior approach is an interesting middle ground.

You could say that this is just regularization and not prior information. But I don’t think so. When you regularize, you’re just asserting that you don’t believe that the true parameters could be extremely large. An argument Gelman uses is something like: smoking has the largest health effect of any risk factor for many health indicators, the effect size of smoking is probably around X, so let’s be conservative and put a prior that puts very little probability on effects (say) twice (or more if you like) as large as X in public health studies. This is prior information of a sort, but largely lets the data speak for itself.

I’ve been thinking that it would be interesting to do this kind of analysis in ecology. What should our default priors be for various fields of ecology? A simple example could be: phosphorus has one of the largest effects on lake trophic status, the effect size of phosphorus is X, so use a prior that falls off for effects larger than 2X.

• Hi Steve – I was indeed thinking of weakly informative priors more as a regularization/parsimony tool, but your view also makes a lot of sense to me – I suppose you could even view parsimony itself as an “informative prior”, although there are of course also information-theoretical justifications as well. The point about the middle ground is a good one.

I think your idea about the priors for ecology is great – additionally that would also make a nice quantitative list of what can say for sure in ecology. “null priors” instead of “null models” 😉

• Florian,

…”that would also make a nice quantitative list of what can say for sure in ecology. ”

Exactly! Brilliant. I don’t think we spend enough time on this kind of list. There would be big challenges, but I think it would be rewarding.

“‘null priors’ instead of ‘null models'”

Love it.

2. Great distinctions, Brian. I wonder if the ‘mystical’ perception of Bayesian approaches come from people not really knowing what goes on ‘under the hood’ of those analyses. For example, a program like WinBUGS doesn’t require the user to calculate and multiply any of the likelihoods. It is a black box, kind of like ‘lm()’ (although many of us may have worked through the OLS calculation in a stats class).

I will make a plug for the opportunity of using prior information because I think priors are underutilized in ecology. I have found that including priors on some parameters can constrain the options for other parameters that might be difficult to identify otherwise. This could make ‘scaling up’ less hard. We might know quite a bit about how a process works at one scale but be uncertain about how it works at a bigger scale – one could design a study to measure bigger scale observations and use prior information on smaller scale processes, which could reduce the amount of new data collection and allow for some nice propagation of uncertainty.

Also, prior information can be really useful in applied studies, where much could be known about the process from studies in other regions or with similar species but the study requires information for specific conditions and the objective is to make less-uncertain estimates.

• Thanks Aaron. As I hope I conveyed I am sure there are useful scenarios for historical priors. I am just not seeing them too often in the ecological literature. Your example makes sense to me.

The thing I haven’t gotten my head around is why you would use current data/prior data distinction inherent in the historical Bayesian approach instead of just throwing all data (current and old) into a larger pooled or meta-analysis approach. Any thoughts Aaron or anyone else?

• Thanks! One reason that I think of for separating current and prior is if parameter information comes from a different population than the observations. Observations could be related to the parameters in another setting, but we might not feel comfortable saying that they are the same. That said, it seems like many ecologists are pretty lax about the assumption that observations are equal in different settings (like, if we said that environmental response parameters for a species from site X will work in site Y).

• I think there are a couple of very good reasons for not doing meta-analyses (especially when the number of studies synthesized is small).
First of all if one were to run a random effects meta-analysis the implicit normality of the individual study results may be too strong an assumption. Stated in other terms the normality assumption may be inadequate to capture study heterogeneity which could influence the final results. Although there are alternative approaches that model the distribution of study effects with Dirichlet Process Priors or Normal DP mixtures these are very difficult to run outside a Computational Bayesian perspective (MCMC) and usually require less diffuse priors on the hyperparameters than the N(0,1000000) one.
Secondly a historical approach may be important as a way to track the accumulated evidence especially when the studies are thought to convey information on the scale parameters of the problem as well. This would be equivalent to the conjugate priors that one finds in textbook descriptions of the Bayesian approach. These maybe more relevant to medical trials (eg Sinha’s power prior) or physics experiments rather than ecological contexts though.

By the way you forgot to mention the objective Bayesians, or do you classify them with the Historical ones?

3. I would say that I use MCMC to calculate likelihoods in large part because I can easily specify any model structure I want in an easy-to-use language such as BUGS or JAGS. Obviously there is no need to estimate a simple linear model with MCMC but when there are multiple random effects and several hierarchies the likelihood specification in a program like BUGS is straightforward. And calculating derived parameters and variances using posterior distributions is extremely convenient (delta method, anyone?). The language is also intuitive because the code is a mirror image of the equations you would describe in a paper.

Now certainly BUGS is more of a black box than hard coding the likelihood in R or a more powerful programming language (e.g., C++), but it can be a useful tool for ecologists that are not interested in going that far. One caveat is that it can allow folks to specify nonsensical models but that can actually contribute to the learning process. The biggest downside is the computation time but if your model is complex enough then the calculation of multiple integrals would take nearly as long using a frequentist approach, as Brian pointed out.

I agree that using adjectives such as Bayesian or hierarchical is not that informative. In my experience with the ecological literature it seems most people employing MCMC are following the mode of a Calculation/Computation Bayesian. So in that sense the adjectives pop up just as general descriptors for the approach taken. At the end of the day if people are using vague priors that produce estimates similar to maximum likelihood then the distinction is mostly unimportant and hopefully the authors don’t overemphasize the perceived value of the Bayesian framework (e.g., by putting it in the title).

• “and hopefully the authors don’t overemphasize the perceived value of the Bayesian framework (e.g., by putting it in the title).”

+1!

You’ve boiled my main gripe into one sentence. I see this all the time in the ecological literature.

I agree that when you have many sources of variance the BUGS language can be the most intuitive description.

• I totally agree with Dan here. I flip back-and-forth between Bayesian and frequentist statistics (which I think Jeremy may consider philosophical inconsistent, and it probably is!) depending on what I am doing — the more complicated an analysis is, the more likely I am to use Computational Bayes (to use your terms, Brian). I’d like to think that makes me pragmatic, but it probably just means I learned how to do complicated analyses under a Bayesian framework and I don’t want to do the leg-work to figure it under a frequentist paradigm.

The main draw in my mind toward Computational Bayes is the ease with which one can estimate derived quantities of interest with associated uncertainties (also mentioned by Dan). If I know I need to do that, I go straight to JAGS.

• There is a group of people in ecology who use Bayesian very pragmatically as a tool that is good for solving some jobs. You & Dan & Steve all fall in that category as best I can tell. And I think a majority of statisticians have ended up there as well. I don’t think it is a coincidence that ecologists with strong statistical training end up acting like statisticians and being highly pragmatic. I am all in favor of pragmatic (not sure Jeremy would stand on that – he is in China right now which shuts off access to WordPress so he can’t defend himself). Useful tools is more important to me than philosophical consistency. And I hope nothing I said has been construed as being opposed to this approach. It seems a very rational approach to me.

What bothers me is the other usage of Bayesian in ecology which I would argue is by far the more common. It is anything but pragmatic. It treats Bayesian as fundamentally special and different and as I said “mystical”. Dan captured it well by the number of people who put “Bayesian” in their title as if it is the central novelty and importance of the paper. And it is more often than not done by people who don’t really understand what they’re doing well enough to be pragmatic. And these are the people who completely conflate and mix up all the pieces of Bayesian theory (and independent factors like p-values, hierarchical, etc) and can’t separate them out when pushed or justify why they’re doing what they’re doing. This is what bothers me. And there are also some popularizers of Bayesian methods in ecology who I think do know better but still invoke the special mystical version of Bayes for some reason.

4. Brian, I though I should write a post with those keywords after reading Larry Wasserman’s comment about the hits, but you were faster 😉

On the topic: sure, a lot of people use JAGS or BUGS mainly because the samplers allow an efficient estimation (or an estimation at all) of models that would otherwise be difficult to fit. However, despite some philosophical fog, I actually do think “Bayesian” and “Frequentist” are two useful terms for characterizing a distinction in the goals and the norms applied to make inferential decisions.

The goal of frequency statistics is to create estimators/decision criteria with favorable properties under repeated trials, independent of the unknown true value. The goal of a Bayesian is to optimally update his prior belief from new data.

I see this distinction as more fundamental than subjective/objective Bayes distinction, and I feel it’s this distinction you are missing when you sort of suggest that Bayes with flat priors equals MLE. Even if they would for some magic reason always end up in the same value, it makes a difference how you got there. Plus, Bayesian CI are NOT generally identical to MLE CI for flat priors when you have little data in the likelihood, but this is perfectly fine, because you have defined your inferential machinery with a different goal. Because both approaches are more or less sensible though, it usually doesn’t make a huge differences, so I’m all in for some statistical pragmatism instead of arguing about nonsensical differences http://www.stat.cmu.edu/~kass/papers/bigpic.pdf

One more thing: I’m not sure whether the use of flat (not necessarily uniformative!) priors says much about the community’s state of mind or the usefulness of informative priors – I actually think nearly every ecological study would be able to specify informative priors, but most people will avoid this because a frequentist reviewer (nearly certain you get one) will likely force you do redo your analysis with flat priors, and also it looks much better if your uninformative analysis comes out with the environmental niche of 1000 species than if your informative Bayesian analysis just confirms what we already knew/suspected.

Bradley Efron also notes on the prevalence of non-informative priors in last week’s science, but expresses his expectation that we will use more informative priors as we will increasingly have to synthesize large and heterogeneous data http://www.sciencemag.org/content/340/6137/1177.full

• Thanks Florian. Interesting points.

I am sure the market for Bayesian in blogs is not saturated! I would enjoy reading your take on this.

You are right that one can draw philosophical differences that are not just subjective vs. objective. I personally haven’t found these differences to inform how I do my work or change the outcome and conclusions of other people’s work. But it doesn’t mean they’re invalid. Like I say I would be curious to hear more about how this affects your work on a practical level.

Interesting point on whether predominance of flat priors in ecology is a preference or a frequency-dependent lock-in phenomenon. Hard to tell. I personally don’t know a lot of people who tell me in private that they wish they could use informative historical priors but avoid them because of reviewers. But maybe I hang out with the wrong people!

• Well, I usually try to stress in teaching that the center of the analysis is the model – once you have specified this, you can typically chose a range of Bayesian or a frequentist method for confronting this model with your data. These methods report slightly different things, and sometimes there are computational differences. Personally, I couldn’t care less whether people have a subjective or an objective interpretation of a Bayesian CI, because the meaning of the Bayesian CI is clearly defined and doesn’t change depending on your philosophy, but what is clear is that the definition of a Bayesian CI differs mathematically from the definition of a frequentist CI, so this is a clear distinction for me.

About the practical differences: I think you will usually draw the same conclusion from Bayes and frequentist methods asymptotically (would be silly if they would end up with totally different insights once we throw in more and mode data, wouldn’t it?), but for finite data (specially for very little) and/or strong prior information, results can certainly differ. Also, you work with different software and technical tools, you use different model selection methods, you typically report the distribution rather than the mode, you have a different way of forecasting uncertainty which can actually have quite some profound effects, etc … again, this will usually not turn a negative in a positive effect, but I feel there are certainly some practical differences.

• “…most people will avoid this because a frequentist reviewer will likely force you to redo your analysis with flat priors…”

That’s probably pretty accurate. The following PLOS ONE paper is an interesting example of using informative priors to understand the impacts of forest management on bird communities. I know the authors had trouble with reviewers on this one because folks were skeptical of the informative priors, so they had to report results with vague priors as well.

http://dx.plos.org/10.1371/journal.pone.0059900

5. I like your taxonomy of Bayesians.

-Larry

6. Interesting post. It is scary to think that people conflate “hierarchical” with “Bayesian,” or that all frequentists must compute “p” values.

I don’t follow why you suggest that putting prior data into the prior is somehow not giving it an equal footing. If I split my data in half and calculate a posterior on the first half: P(X1 | theta) P(theta)/P(X1), and then use the posterior as the new prior when I consider the second half, don’t I get the exact same expression as if I considered all the data at once? P(X1 | theta) P(X2 | theta) = P(X1, X2 | theta) under the usual exchangeability assumption…

I also afraid I don’t follow how a “uniform prior” magically makes the prior irrelevant. If I parameterize my model differently then you — what you call “a” is what I call 1/a, then a prior that looks uniform in your parameterization looks very informative in mine… (e.g. someone considers the parameter that has uniform prior to be “variance”, another to be the standard deviation, another to be the inverse variance). I believe this Bayesian philosophy choses priors for mathematical convenience. I believe the notion of whether the prior is informative or not has more to do with if we get much the same posterior with a different prior.

Just for fun, I might propose an alternative three categories: those who use Bayesian concepts for philosophical reasons, those who use them for computational reasons, and those who are a bit confused about their reasons.

7. This is not at all your point here Brian, but the thing that nags me the most about all Bayesianism, regardless of how informative the prior(s), is that there’s an underlying assumption that one’s present observations were in fact obtained on a system of the same inherent structure and dynamics as the one from which any and all prior information was obtained. Now I can see that assumption being reasonable in some cases, for example in manipulative experiments over time in which you were the experimental designer, data collector etc., such that the data is consistently assessing the same system. At the other extreme would lie strictly observational data collected by anyone and their brother over time and space, presumed to be assessing the same basic system, but is it? I mean, isn’t an implicit, underlying assumption of Bayesianism that all of your data, present and prior, truly pertains to the same system, and that the variation in the estimates of the critical parameters of interest is determined entirely by stochastic elements, rather than by unrecognized differences in some critical aspect of the system between prior and present data collections? Kind of a big assumption it seems to me.

Again, not your point here, I realize.

8. I don’t really have anything substantive to add, I’m mostly just commenting to say I pretty much agree with everything Brian has to say.

I know some long time readers think of me as strongly anti-Bayesian, so maybe I’ll take the opportunity to clarify that what really bothers me is subjective Bayesianism. That’s the sort that mostly bothers most thoughtful frequentists, I think. Self-described frequentists like Deborah Mayo and Cosma Shalizi are at least mostly fine with the approach of self-described non-subjective Bayesian Andrew Gelman, for instance. And it’s subjective Bayesianism that prompted Brian Dennis to write his famous anti-Bayesian polemic in Ecological Applications. Brian’s post is right that such subjective Bayesians are rare in science, though not entirely unknown. We had one (a very confused one, in my view) pop up in the comment thread on Brian’s statistical machismo post.

What’s more common is for scientists who aren’t really subjective Bayesians to say that they are, or to say things that sort of imply that they might be, without fully realizing the implications. For instance, in his textbook Jim Clark just states flat out, without much elaboration, that Bayesians use subjective probability. I don’t think Jim Clark is actually a subjective Bayesian in the sense Brian identifies–but sometimes he sure sounds like one! And as Brian and Larry Wasserman note, Nate Silver (whose book I just read as well and will probably review for the blog) does indeed say a lot of things that make him sound like a frequentist. But he also says some things that make him sound like a subjectivist. Or when people say that using a flat prior gives you the “same” answer as with a frequentist approach, meaning that it spits out the same numbers, but without noting that those numbers mean totally different things to a frequentist vs. a subjective Bayesian. I think this sort of philosophical waffling about the interpretation of “probability” is actually fairly common among practicing scientists. As Carl notes in the previous comment, a fair number of scientists are probably rather confused about why they’re doing the statistics that they’re doing. And I think it’s unfortunate, because I think you can be quite “pragmatic” and use different statistical methods as appropriate to the problem at hand without waffling about what you mean by “probability”, or being vague about *exactly* what the stats mean and how that information supports the scientific inferences you’re trying to draw. Brad Efron is a great example of pragmatism without philosophical waffling or vagueness, I think.

9. Thanks for the great post, Brian. As someone who takes a pragmatic approach to data analysis, employing both frequentist and Bayesian methods in my own research, I think that the clarity you have provided here is quite useful. Also, given that I am the author of the blog bayesianbiologist.com, one might suggest I’ve chosen a title which has low information content. It is my hope that my readers find the information to be in the *content*, and not solely in the title.

10. Pingback: Bayes again | prior probability

11. Do I have to choose which Bayesian I am now? And when I make a model, do I have to choose which of those three types of Bayesian model it is?

The answer, by the way, is no, I do not. There is a lot of flexibility in how you formulate your priors and why, even within the same model. Sometimes you set a “noninformative” prior (btw, that’s often a misnomer) for computational convenience or because a flat prior actually is the right choice to describe complete ignorance in a particular case. Sometimes you set a weakly informative prior to force the posterior distributions of your parameters to be within some range that makes sense. Another reason is to deal with total or partial separation, or to regularize when you’re trying to combat multicollinearity and nonidentifiability, problems that creep up even when you have shit tons of data. Sometimes, you’re re-running a model every week and incorporating what you learned from the previous week into the prior distribution. Sometimes you want to be able to form finite population inferences over your parameters, in which case you should have a posterior predictive distribution to draw from. And there is no reason, none, why you can’t have some mixture of these strategies in the same model.

So when I say I am a Bayesian statistician, it means that I prefer to use fully Bayesian methods because I enjoy the flexible way in which I can incorporate prior beliefs, and also because those are simply the methods I am most well-versed in.

That’s not to say I won’t [mess] around with the randomForest package now and then, or glmnet.