Bayesian Inference: Priors, Posteriors & Conjugacy
Bayesian inference treats parameters as random variables and updates beliefs using Bayes' theorem; conjugate priors make this analytically tractable for many model families. Where frequentist statistics asks "what is the probability of this data given the parameter?", Bayesian inference inverts the question: "what is the probability of the parameter given this data?"
Concepts
Bayesian updating: prior Beta(α₀,β₀) updates to posterior Beta(α₀+k, β₀+n-k) after k successes in n trials.
Beta(1,1) = Uniform prior. As n→∞, posterior concentrates: prior is washed out by likelihood.
Before seeing any data, you believe a coin is approximately fair — but not certain. After 7 heads in 10 flips, your belief shifts toward , tempered by the small sample size. Bayesian inference is the precise machinery for this update: the posterior combines the prior (pre-data belief) and the likelihood (what the data say), yielding a full probability distribution over the parameter — not a single estimate, but a complete description of remaining uncertainty.
Bayes' Theorem and the Posterior
Given data and a prior encoding beliefs before observing data:
Posterior likelihood prior — the proportionality is exact, not an approximation. The evidence is a constant with respect to : the same for every parameter value, serving only to normalize the posterior to integrate to 1. All information about which values are plausible lives in the numerator; computational methods (MCMC, variational inference) routinely skip computing entirely and work directly with the unnormalized product .
The four quantities:
- Prior : beliefs about before data
- Likelihood : data-generating model
- Evidence : marginal likelihood, normalizing constant
- Posterior : updated beliefs
The posterior is proportional to likelihood times prior: .
Point estimates from the posterior:
- MAP (maximum a posteriori):
- Posterior mean: — minimizes posterior expected squared loss
- Posterior median: minimizes posterior expected absolute loss
MAP = MLE when the prior is uniform. MAP = MLE + regularization when the prior is non-uniform: .
Conjugate Priors
A prior is conjugate to a likelihood if the posterior is in the same distributional family as the prior. This yields closed-form posteriors.
| Model | Prior | Posterior | Posterior parameters |
|---|---|---|---|
| Bernoulli() | Beta() | Beta() | successes in trials |
| Binomial() | Beta() | Beta() | same |
| Poisson() | Gamma() | Gamma() | rate + count |
| Normal(, known) | Normal() | Normal() | precision-weighted average |
| Normal(, ) | Normal-Inverse-Gamma | Normal-Inverse-Gamma | updated hyperparams |
| Multinomial() | Dirichlet() | Dirichlet() | = counts per class |
Gaussian-Gaussian conjugacy (known variance ): prior . Posterior:
The posterior precision = prior precision + data precision . As , the posterior mean converges to (MLE) regardless of the prior.
Posterior Predictive Distribution
The posterior predictive distribution for a new observation marginalizes out the unknown :
This automatically accounts for parameter uncertainty — it is wider than predicting at the MLE alone. For Beta-Binomial, this gives the Beta-Binomial distribution, which is overdispersed compared to a Binomial.
Credible Intervals vs Confidence Intervals
A Bayesian credible interval with credibility satisfies — a direct probability statement about given data.
A frequentist confidence interval at means: if we repeat the experiment many times and compute a CI each time, of the CIs will contain the true . A specific realized CI either contains or it does not — there is no probability statement for the specific interval.
The Bayesian interpretation is more natural but requires a prior. In the limit of a flat (diffuse) prior, credible and confidence intervals coincide numerically.
Bayesian Model Comparison
To compare models vs , compute the Bayes factor:
The Bayes factor automatically penalizes model complexity (Occam's razor): complex models spread prior mass over a larger parameter space, which reduces the marginal likelihood if the data do not need that complexity. No separate penalty term is needed — it is implicit in the integration.
The posterior model odds = prior odds × Bayes factor.
Bayesian Information Criterion (BIC): approximates via Laplace approximation: . The penalty approximates the prior-complexity penalty in the Bayes factor.
Variational Inference
For complex posteriors where exact computation is intractable, variational inference approximates by minimizing the KL divergence over a family :
Equivalently, maximize the evidence lower bound (ELBO):
Since , the ELBO lower-bounds ; maximizing ELBO = minimizing KL.
Mean-field approximation: . Under this factorization, the optimal satisfies:
This yields coordinate ascent updates (CAVI) that iterate until convergence.
MCMC (Markov Chain Monte Carlo): the alternative to variational inference — generate samples from without approximation. Metropolis-Hastings proposes and accepts with probability . Gibbs sampling cycles through conditionals .
Worked Example
Example 1: Beta-Binomial Coin Flip
Prior: (uniform). Observe 7 heads in 10 flips.
Posterior: .
Posterior mean: . MAP: (coincides with MLE).
Posterior predictive: (posterior mean, not 0.7).
95% credible interval for : from the Beta distribution. This says: given the prior and data, we believe with 95% probability.
Compare with frequentist 95% CI: . Here the Bayesian interval is pulled toward the prior mean of 0.5 and is slightly narrower.
Example 2: MAP as Regularized MLE
Consider linear regression with .
Prior: .
Log posterior:
MAP: — exactly ridge regression with penalty . A Gaussian prior corresponds to regularization; a Laplace prior corresponds to / Lasso.
This is the key insight: regularization = prior. The choice of regularizer encodes the prior belief about the parameter distribution.
Example 3: Variational Inference for a Gaussian Mixture
For a Gaussian mixture with components and means , the true posterior couples all assignments and means — intractable.
Mean-field: . CAVI updates:
Each coordinate update is closed-form; the algorithm alternates between these, converging to a local ELBO maximum. This is the foundation of variational autoencoders (VAEs): is the encoder, is the decoder, and the objective is the ELBO.
Connections
Where Your Intuition Breaks
"With enough data, the prior washes out" — this is true for well-specified, finite-dimensional models, where the posterior concentrates around the true parameter as . The dangerous extension is applying it to high-dimensional models. For Bayesian neural networks with millions of parameters, the likelihood only identifies a low-dimensional manifold of parameter space; the rest of the prior is barely updated even with enormous datasets. Similarly, improper priors — like a uniform prior over all of — can yield improper posteriors whose marginal likelihood does not exist, making Bayesian model comparison via Bayes factors undefined. The prior is not a nuisance to be washed away but a structural assumption that must be chosen with the same care as the likelihood.
Conjugate priors are the natural statistics of exponential families. For any exponential family with sufficient statistic , the conjugate prior has the form — it is also an exponential family. The posterior just increments the hyperparameters: "prior data" observations of mean is updated to observations of mean . The hyperparameters are literally interpretable as "pseudo-counts" from imaginary prior observations.
The ELBO decomposition reveals the VI objective. ELBO : the first term is the expected log-likelihood (fit to data), the second term penalizes the approximate posterior for deviating from the prior (regularization). In VAEs, this is exactly reconstruction loss minus KL divergence — learning to encode data well while keeping the latent space close to a Gaussian prior.
Prior choice matters more than it appears. The oft-stated claim that "with enough data the prior washes out" is true for well-specified, finite-dimensional models. But for high-dimensional parameters (Bayesian neural networks with millions of weights), the prior never washes out — the posterior depends critically on the prior even with large datasets. Additionally, improper priors (like the uniform prior on ) can lead to improper posteriors, and the marginal likelihood may not exist. Hierarchical priors, where hyperparameters are estimated from data (empirical Bayes), provide a middle ground.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.