My impression is that a multiple decades-long debate about the role of Bayesian statistics has generated a lot more smoke than light. As Larry Wasserman at the Normal Deviate blog notes in his blog birthday self-reflection post, one of his lessons learned is that “Put ‘Bayes’, ‘Frequentist’ or ‘p-value’ in the title of a blog post and you get zillions of hits. Put some combination of them and get even more”. But are we actually getting anywhere with all of this debate? Probably, yes, rational people are learning some things. But I would suggest there is a major semantic problem that we need to overcome if we want to start having more light than smoke.
Namely telling me you are doing Bayesian statistics tells me very little. It is such a broad term it doesn’t tell me a whole lot more than telling me you are doing statistics. So don’t drop the word “Bayesian” into a paper or conversation and expect me to have a big reaction one way or the other. I couldn’t possibly have an informed opinion about what you’re doing until you tell me more. So why do people keep saying “Bayesian” and then stopping and waiting for a reaction like they’ve just told me something provocative? Indeed, I am getting to the point where if you don’t know enough to know that saying you are doing Bayesian statistics is not saying much, I’m likely to think you don’t know very much and are in fact solely trying to be provocative!
In this post, I would like to argue that there are three main types of Bayesian statistics. So if you want to drop the word Bayesian in a conversation or paper, please at least drop one more qualifying word in front of it selected from the three qualifying words below. It will save us both a lot of time! This is of course not a new idea. The aforementioned Normal Deviate has had several great posts on common myths about Bayesian vs. Frequentist. And I had a post on statistical machismo in this blog just under a year ago which had a record (for us) number of comments where I first put forth my 3 types of Bayesian because of commenters I saw as talking past each other. And frequent commenter Steve Walker provided a link to a paper that said you were being unclear about what kind of a Bayesian you were unless you answered nine questions implying there are 2^9-1=511 types of Bayesian! These kinds of distinctions might be important for statisticians but are too much for ecologists – I am sticking with just 3 types.
To make my comments clearer, let me recap Bayes Theorem. Bayes Theorem says P(A|B)P(B)=P(B|A)P(A). This is basically a statement of a symmetry of conditional probabilities but it has to be adjusted by the raw probabilities of the event conditioned upon. It is more commonly written as:
It’s hard to imagine one little equation (which is an undisputed law of probability) causing so much grief! Let’s dissect this a little bit. The P(θ|D) on the left hand side of the equation gives the probability of the model parameters (i.e. θ – for example the t-test effect size μ or regression slope β) taking on different values conditioned on the data observed (D). In the Bayesian world P(θ|D) is called the posterior. The P(D|θ) term on the top of the fraction is the likelihood (in the exact sense you have seen it in other statistical contexts). Note that likelihood does not sum to 1 over all values of θ – rather it sums to one over all values of D, so is not a true probability distribution of θ. The P(D) on the denominator is a renormalization constant that turns things back into a probability (i.e. sums to 1) which is the main difference from this version of Bayes Theorem and likelihood – likelihood doesn’t sum to 1 because it doesn’t have the normalization constant P(D). Being a true probability is really nice, but as we’ll see its often a lot of extra work to get P(D) incorporated. The other main difference from likelihood is the P(θ). This the probability of the parameter θ not conditioned on the data – i.e. independent of the data. This is the real stumper – if you’re trying to estimate θ how could you know the probability distribution of θ!? Bayesians say that this term represents your prior beliefs about θ. Thus Bayes theorem in this context is a machine for combining observed data D with prior beliefs about θ (P(θ)) and a renormalization constant (P(D)) to give the posterior probability of θ. See the three figures below, which have the same likelihood P(D|θ) and same P(D). They differ in having progressively flatter (less informative) priors P(θ).
This formula and its corresponding labeling of terms is the one thing all Bayesian statistics agrees on. Beyond that, chaos reigns! To try and cut through the chaos, I will now present three main types of Bayesians and argue that each has their own interpretation of the terms and reasons for using Bayes theorem. Although there are many more subtle integradations, I think a nice starting point would be for everybody to start out saying which of the three types of Bayes they are.
- Subjective Bayesian – This is the notorious version. A Subjective Bayesian rejects the frequentist idea that probability is the fraction of number of times an event occurs divided by the number of times the outcome was observed (e.g. rolling a one on a six-sided die at say 6 times out of 36 or 16.667%). Instead, a Subjective Bayesian says probability is inherently grounded in human belief. The probability of rolling a six is based on my beliefs about the outcome, ultimately being based in my mind. In my post on statistical machismo, a commentor brought up the example of the “probability that there is life on Mars” and asked how one could possibly have a frequentist interpretation of this while noting we all have some probability estimate of this in our minds (and to prove it you can bet on this with British bookmakers if you want). Of course a frequentist would say its an ill-posed question – the proper question would be something like “what are the odds of life on non-gaseous 4th planets out from G-type stars” – at which point a frequentist view clearly is valid again. From this subjective definition follows many other features of what people typically think of as Bayesian. Subjective Bayesians believe the prior, P(θ), represents my subjective beliefs before collecting the data (or in a slightly weaker version it could represent the beliefs on experts in the subject using some mix of prior data and their intuition). And this is why Subjective Bayesians make a big point about the posterior leading to things like Credible Intervals (an interval that one thinks is a believable range of estimates for the parameter) rather than confidence intervals (which are inherently frequentist – % of time that the true value lies within the interval given the data observed). In my experience, most ecologists are uncomfortable with Subjective Bayesian approaches and true Subjectivist Bayesians are more likely to be found in philosophy or math departments. But there are no small number of ecologists who enjoy saying they are Bayesian because of the daring (=non-objective) overtones of Subjective Bayesians even if they don’t want to defend the subjective view when pushed..
- Historical Bayesian – In another view of Bayesian approaches, the emphasis is NOT on the definition of probability, but on the role of the prior. Indeed more often than not, this approach uses a frequentist definition of probability. In this flavor of Bayes, the key point is that the prior represents, well, prior information. IE it is a summation of the expected distribution of θ based on all prior data collected. Thus Bayes theorem is a mathematically precise way of smashing previous work against a new dataset to make a new prediction. If previous data has been very consistent (leading to a very narrow strongly peaked prior) and the new data is highly variable leading to a very flat-peaked estimate of θ, then the prior will dominate the new result, and vice versa (see figures above). If both the prior and the likelihood are normal, relatively simple formulas comparing the variance in the two distributions can be used to calculate the posterior. It is an interesting question why you would favor serial updating of the probability distribution using repeated applications of Bayes theorem for each new data collection rather than performing a comprehensive meta-analysis on all data collected. But this historical Bayesian approach is making no claims about the subjectivity of the prior – indeed it is explicitly invoking an empirical, more objective use of the prior based in historical data. Nate Silver, in his recent book on prediction, calls himself Bayesian (in a generic unqualified sense) but he appears to mean this in the historical Bayesian sense, iteratively improving and correcting. As Wasserman demonstrates, Silver’s definition of probability is about as frequentist and non-subjective as you can get*. In my experience teaching, the idea of a historical prior is something that makes a light bulb go off on why you might want to bother with the whole Bayesian machinery. Wyckoff and Clark 2000 is a nice pedagogical paper showing how in a world with limited data, prior information can be useful through a historical Bayesian prior.
- Calculation Bayesian – A lot of ecologists are uncomfortable even with the idea of a historical prior (or prefer as I do a meta-analysis that puts all data on equal footing rather than distinguishing added data vs prior data). So the solution is to use what is known as an uninformative prior (a prior that is literally flat representing a uniform distribution or for practical computational reasons more commonly a normal distribution with a variance of 1,000,000 or some such that is effectively flat). This basically removes any influence of the prior.If you take the prior out, what do you have left? Well literally you have likelihood renormalized by P(D) to convert likelihood into a true probability distribution. As the 3rd figure shows a flat prior causes the likelihood and posterior to have identical shapes with only a rescaling on the y-axis (therefore giving identical point estimates of parameters, confidence intervals around parameters, etc even if they have different names – yes that’s right the credible interval can equal the confidence interval – its just a question of interpretation). For this small difference, including P(D) means you have a really nasty computational problem. P(D) normally requires evaluating complex high-dimensional integrals that are too slow to calculate by traditional deterministic numerical methods (e.g. quadrature). This has led to the development of Markov-Chain Monte Carlo methods. There is not space to delve into MCMC here, but what it produces is a constrained random walk through parameter space (possible values of θ) that visits regions of parameter space in proportion to their probability. Thus if you take a long-enough random walk and make a histogram of every point visited in the random walk, you have a good approximation to the posterior. And when you get right down to it, this is why the third group calls themselves Bayesian. Once you’ve thrown out the prior, all you have left is the renormalization constant to distinguish yourself from likelihood. And while a renormalized true probability distribution is nice, it is not necessary. We have all sorts of tools for working with likelihood include MLE (maximum likelihood estimates of parameters), likelihood surfaces (which can be converted into likelihood intervals), likelihood ratio tests, AIC, and etc. What you really pick up by calling yourself a Calculation Bayesian is not so much the renormalizaiton constant, but the fancy MCMC calculations created to deal with it. In many respects once you have a flat prior you are just doing a very hard version of likelihood calculations with a complex but very general and powerful tool.
Note that these three types are effectively strictly nested. Historical Bayesians usually incorporate the Computational Bayesian methods. And Subjective Bayesians usually incorporate both. But not the other direction.
I also wanted to make a point about two qualifiers that SHOULD NOT get dragged into debates about Bayesian methods.
First, saying you are using “hierarchical” methods says nothing about whether you are Bayesian or not. Hierarchical is a type of model specification that allows for multiple sources of variability. In particular, it often involves not treating a population as being represented by the mean value but as having a distribution of values. This distribution of values are described by what are known as hyperparameters. Hierarchical modelling is an exciting innovation as it is clear ecology has LOTS of sources of variability and being able to model these more carefully is important. However, hierarchical models can be estimated and solved by either Bayesian or variations of maximumum likelihood (REML, penalized likelihood and EM are names you may have heard in these contexts). Some simple hierarchical models are solvable by traditional linear mixed model methods (e.g. nlme in R) and more complex ones are often solvable by Ben Bolkers bblme package in R and if those fail you can use MCMC, simulated annealing and related algorithms or even Subhash Lele’s data cloning method (which to put it crudely and simplistically is bootstrapping meets MCMC) to solve for the likelihood. A nice pedagogical paper, Farrell and Ludwig 2008, that unfortunately for us draws its examples from psychology, gives a clear example of a hierarchical model and shows that although non-heirarchical models perform adequately, hierarchical models are clearly better, but there is no difference in Bayesian vs likelihood. It is true that MCMC is the most flexible and sometimes only solution technique that works on really complex models. But if that is the reason you are adopting MCMC and you are using a flat (uninformative prior), you are basically doing MCMC likelihood with a renormalization constant.
Second, p-values or significance tests are basically only done in the frequentists world, but it is important to note that being frequentist does not imply that you are doing p-value or significance testing. It is very common to conflate frequentist with p-values. As you might have gathered from my posts on prediction, I am just one example of somebody who is frequentist without being particularly enamored of p-values and significance testing. It is also worth noting that Bayesian’s have recreated things like p-values. Specifically Bayes factors reduce to likelihood ratios with flat priors and, yes, if you are recalling that likelihood ratios are an easy path to p-values you are correct. To be fair most Bayesians are not enamored with Bayes Factors, but it proves the point that comparing something to a null model is an inferential mode that is independent of Bayesian vs non-Bayesian. And to be fair, a lot of frequentists are not enamored with p-values either.
So conflating frequentist with p-value is just as bad as conflating Bayesian with hierarchical. They’re both wrong. That in the end is my main point. We need to separate out modes of inference (p-value, parameter distribution, model selection, etc) and model structure (hierarchical vs flat) from frequentist vs Bayesian perspectives and amongst the Bayesian perspective we need to separate out different degrees of commitment to the Bayesian philosophy (my 3 categories). You really can mix and match among these factors at will. Thus you can do point estimates of parameters, estimates of parameter distributions, model selection (AIC and BIC), model support vs null models and more in either Bayesian or Frequentist modalities (and the answers are almost always the same if you have flat priors). Until a majority of people are making and understanding these distinctions, the debate will continue to generate more smoke than light.
So if I were the language police, I would fine anybody who says they are Bayesian or “using Bayesian statistics” who doesn’t add a qualifier word to say whether they are a Subjective Bayesian or Historical Bayesian or a Computational Bayesian. At that point I feel like two simple words would communicate quite a lot about what they are doing and how they approaching things, while just using the one word Bayesian communicates almost nothing. I would also have even higher fines for people conflating hierarchical model with Bayesian philosophy and p-value with frequentist philosophy.
Personally, I would count myself as a likelihood oriented statistician who favors hierarchical approaches in some contexts but uses non-Bayesian computational methods whenever possible and falls back on Computational Bayesian only when there are no other computational methods that work. But I have no problem with people doing Bayesian analysis as long as they are clear about what they are doing and why they choose that path. I do have a problem with people who say “Bayesian” over and over like its mystical word with lots of secret powers but not being clear why they are Bayesian and what part of Bayesian they are using. I also get irked when people tell me I have to buy the whole Bayesian package (up to subjectivist definitions of probability) and then just use flat (=no) priors and effectively are just using MCMC to calculate likelihoods. What about you? Are you a subjective, historical or computational Bayesian? or none of the above?
*Interestingly, I am not aware that Silver even uses Bayes theorem in this sense (as best I understand his models he reruns them each time with all available data and a down-weight of older data that has more to do with temporal autocorrelation than Bayes theorem) – he uses historical Bayesian in a more metaphorical sense of iterative improvement without actual use of Bayes Theorem. But this is a semantic nit pick. It is a great book and you should read it, especially if you liked my posts on prediction.