Neural-Path/Notes
40 min

Bayesian Inference: Priors, Posteriors & Conjugacy

Bayesian inference treats parameters as random variables and updates beliefs using Bayes' theorem; conjugate priors make this analytically tractable for many model families. Where frequentist statistics asks "what is the probability of this data given the parameter?", Bayesian inference inverts the question: "what is the probability of the parameter given this data?"

Concepts

Bayesian updating: prior Beta(α₀,β₀) updates to posterior Beta(α₀+k, β₀+n-k) after k successes in n trials.

00.20.40.60.81θ
Prior Beta(2,2)Posterior Beta(2,2)Post. mean
Data (k/n)
0/0
Prior mean
0.500
Post. mean
0.500
Post. MAP
0.500

Beta(1,1) = Uniform prior. As n→∞, posterior concentrates: prior is washed out by likelihood.

Before seeing any data, you believe a coin is approximately fair — but not certain. After 7 heads in 10 flips, your belief shifts toward p=0.7p = 0.7, tempered by the small sample size. Bayesian inference is the precise machinery for this update: the posterior combines the prior (pre-data belief) and the likelihood (what the data say), yielding a full probability distribution over the parameter — not a single estimate, but a complete description of remaining uncertainty.

Bayes' Theorem and the Posterior

Given data X=xX = x and a prior π(θ)\pi(\theta) encoding beliefs before observing data:

π(θx)=p(xθ)π(θ)p(x),p(x)=p(xθ)π(θ)dθ.\pi(\theta \mid x) = \frac{p(x \mid \theta)\,\pi(\theta)}{p(x)}, \quad p(x) = \int p(x \mid \theta)\,\pi(\theta)\,d\theta.

Posterior \propto likelihood ×\times prior — the proportionality is exact, not an approximation. The evidence p(x)p(x) is a constant with respect to θ\theta: the same for every parameter value, serving only to normalize the posterior to integrate to 1. All information about which θ\theta values are plausible lives in the numerator; computational methods (MCMC, variational inference) routinely skip computing p(x)p(x) entirely and work directly with the unnormalized product p(xθ)π(θ)p(x|\theta)\pi(\theta).

The four quantities:

  • Prior π(θ)\pi(\theta): beliefs about θ\theta before data
  • Likelihood p(xθ)p(x \mid \theta): data-generating model
  • Evidence p(x)p(x): marginal likelihood, normalizing constant
  • Posterior π(θx)\pi(\theta \mid x): updated beliefs

The posterior is proportional to likelihood times prior: π(θx)p(xθ)π(θ)\pi(\theta \mid x) \propto p(x \mid \theta)\,\pi(\theta).

Point estimates from the posterior:

  • MAP (maximum a posteriori): θ^MAP=argmaxθπ(θx)\hat\theta_{\text{MAP}} = \arg\max_\theta \pi(\theta \mid x)
  • Posterior mean: θ^Bayes=E[θx]\hat\theta_{\text{Bayes}} = \mathbb{E}[\theta \mid x] — minimizes posterior expected squared loss
  • Posterior median: minimizes posterior expected absolute loss

MAP = MLE when the prior is uniform. MAP = MLE + regularization when the prior is non-uniform: logπ(θx)=logp(xθ)+logπ(θ)const\log \pi(\theta \mid x) = \log p(x \mid \theta) + \log \pi(\theta) - \text{const}.

Conjugate Priors

A prior π(θ)\pi(\theta) is conjugate to a likelihood p(xθ)p(x \mid \theta) if the posterior π(θx)\pi(\theta \mid x) is in the same distributional family as the prior. This yields closed-form posteriors.

ModelPriorPosteriorPosterior parameters
Bernoulli(pp)Beta(α,β\alpha, \beta)Beta(α+k,β+nk\alpha + k, \beta + n - k)kk successes in nn trials
Binomial(n,pn, p)Beta(α,β\alpha, \beta)Beta(α+k,β+nk\alpha + k, \beta + n - k)same
Poisson(λ\lambda)Gamma(a,ba, b)Gamma(a+xi,b+na + \sum x_i, b + n)rate + count
Normal(μ\mu, σ2\sigma^2 known)Normal(μ0,τ2\mu_0, \tau^2)Normal(μn,τn2\mu_n, \tau_n^2)precision-weighted average
Normal(μ\mu, σ2\sigma^2)Normal-Inverse-GammaNormal-Inverse-Gammaupdated hyperparams
Multinomial(pp)Dirichlet(α\boldsymbol\alpha)Dirichlet(α+n\boldsymbol\alpha + \mathbf{n})nkn_k = counts per class

Gaussian-Gaussian conjugacy (known variance σ2\sigma^2): prior μN(μ0,τ2)\mu \sim \mathcal{N}(\mu_0, \tau^2). Posterior:

μx1,,xnN ⁣(μn,τn2),τn2=11τ2+nσ2,μn=τn2 ⁣(μ0τ2+nxˉσ2).\mu \mid x_1,\ldots,x_n \sim \mathcal{N}\!\left(\mu_n, \tau_n^2\right), \quad \tau_n^2 = \frac{1}{\frac{1}{\tau^2} + \frac{n}{\sigma^2}}, \quad \mu_n = \tau_n^2\!\left(\frac{\mu_0}{\tau^2} + \frac{n\bar x}{\sigma^2}\right).

The posterior precision 1/τn21/\tau_n^2 = prior precision + data precision n/σ2n/\sigma^2. As nn \to \infty, the posterior mean converges to xˉ\bar x (MLE) regardless of the prior.

Posterior Predictive Distribution

The posterior predictive distribution for a new observation XnewX_{\text{new}} marginalizes out the unknown θ\theta:

p(xnewx1,,xn)=p(xnewθ)π(θx1,,xn)dθ.p(x_{\text{new}} \mid x_1,\ldots,x_n) = \int p(x_{\text{new}} \mid \theta)\,\pi(\theta \mid x_1,\ldots,x_n)\,d\theta.

This automatically accounts for parameter uncertainty — it is wider than predicting at the MLE alone. For Beta-Binomial, this gives the Beta-Binomial distribution, which is overdispersed compared to a Binomial.

Credible Intervals vs Confidence Intervals

A Bayesian credible interval [a,b][a, b] with 95%95\% credibility satisfies P(θ[a,b]x)=0.95P(\theta \in [a,b] \mid x) = 0.95 — a direct probability statement about θ\theta given data.

A frequentist confidence interval at 95%95\% means: if we repeat the experiment many times and compute a CI each time, 95%95\% of the CIs will contain the true θ\theta. A specific realized CI either contains θ\theta or it does not — there is no probability statement for the specific interval.

The Bayesian interpretation is more natural but requires a prior. In the limit of a flat (diffuse) prior, credible and confidence intervals coincide numerically.

Bayesian Model Comparison

To compare models M1M_1 vs M2M_2, compute the Bayes factor:

BF12=p(xM1)p(xM2)=p(xθ1,M1)π(θ1M1)dθ1p(xθ2,M2)π(θ2M2)dθ2.\text{BF}_{12} = \frac{p(x \mid M_1)}{p(x \mid M_2)} = \frac{\int p(x \mid \theta_1, M_1)\,\pi(\theta_1 \mid M_1)\,d\theta_1}{\int p(x \mid \theta_2, M_2)\,\pi(\theta_2 \mid M_2)\,d\theta_2}.

The Bayes factor automatically penalizes model complexity (Occam's razor): complex models spread prior mass over a larger parameter space, which reduces the marginal likelihood if the data do not need that complexity. No separate penalty term is needed — it is implicit in the integration.

The posterior model odds = prior odds × Bayes factor.

Bayesian Information Criterion (BIC): approximates 2logp(xM)-2\log p(x \mid M) via Laplace approximation: BIC=2logp(xθ^)+klogn\text{BIC} = -2\log p(x \mid \hat\theta) + k\log n. The klognk\log n penalty approximates the prior-complexity penalty in the Bayes factor.

Variational Inference

For complex posteriors where exact computation is intractable, variational inference approximates π(θx)\pi(\theta \mid x) by minimizing the KL divergence over a family Q\mathcal{Q}:

q=argminqQKL(q(θ)π(θx)).q^* = \arg\min_{q \in \mathcal{Q}} \text{KL}(q(\theta) \| \pi(\theta \mid x)).

Equivalently, maximize the evidence lower bound (ELBO):

ELBO(q)=Eq[logp(x,θ)]Eq[logq(θ)]=logp(x)KL(qπ(x)).\text{ELBO}(q) = \mathbb{E}_q[\log p(x, \theta)] - \mathbb{E}_q[\log q(\theta)] = \log p(x) - \text{KL}(q \| \pi(\cdot \mid x)).

Since KL0\text{KL} \geq 0, the ELBO lower-bounds logp(x)\log p(x); maximizing ELBO = minimizing KL.

Mean-field approximation: q(θ)=jqj(θj)q(\theta) = \prod_j q_j(\theta_j). Under this factorization, the optimal qjq_j^* satisfies:

logqj(θj)=Eqj[logp(x,θ)]+const.\log q_j^*(\theta_j) = \mathbb{E}_{q_{-j}}[\log p(x, \theta)] + \text{const}.

This yields coordinate ascent updates (CAVI) that iterate until convergence.

MCMC (Markov Chain Monte Carlo): the alternative to variational inference — generate samples from π(θx)\pi(\theta \mid x) without approximation. Metropolis-Hastings proposes θq(θθ)\theta' \sim q(\theta' \mid \theta) and accepts with probability min(1,π(θx)q(θθ)/[π(θx)q(θθ)])\min(1, \pi(\theta'\mid x) q(\theta|\theta') / [\pi(\theta\mid x) q(\theta'|\theta)]). Gibbs sampling cycles through conditionals π(θjθj,x)\pi(\theta_j \mid \theta_{-j}, x).

Worked Example

Example 1: Beta-Binomial Coin Flip

Prior: pBeta(1,1)p \sim \text{Beta}(1, 1) (uniform). Observe 7 heads in 10 flips.

Posterior: pxBeta(1+7,1+3)=Beta(8,4)p \mid x \sim \text{Beta}(1+7, 1+3) = \text{Beta}(8, 4).

Posterior mean: 8/(8+4)=0.6678/(8+4) = 0.667. MAP: (81)/(8+42)=7/10=0.7(8-1)/(8+4-2) = 7/10 = 0.7 (coincides with MLE).

Posterior predictive: P(next flip is heads)=E[px]=0.667P(\text{next flip is heads}) = \mathbb{E}[p \mid x] = 0.667 (posterior mean, not 0.7).

95% credible interval for pp: [0.35,0.92][0.35, 0.92] from the Beta(8,4)(8,4) distribution. This says: given the prior and data, we believe p[0.35,0.92]p \in [0.35, 0.92] with 95% probability.

Compare with frequentist 95% CI: p^±1.96p^(1p^)/n=0.7±0.28=[0.42,0.98]\hat p \pm 1.96\sqrt{\hat p(1-\hat p)/n} = 0.7 \pm 0.28 = [0.42, 0.98]. Here the Bayesian interval is pulled toward the prior mean of 0.5 and is slightly narrower.

Example 2: MAP as Regularized MLE

Consider linear regression y=Xβ+εy = X\beta + \varepsilon with εN(0,σ2I)\varepsilon \sim \mathcal{N}(0, \sigma^2 I).

Prior: βN(0,λ1I)\beta \sim \mathcal{N}(0, \lambda^{-1} I).

Log posterior: logπ(βy)=12σ2yXβ2λ2β2+const.\log \pi(\beta \mid y) = -\frac{1}{2\sigma^2}\|y - X\beta\|^2 - \frac{\lambda}{2}\|\beta\|^2 + \text{const}.

MAP: β^MAP=(XTX+λσ2I)1XTy\hat\beta_{\text{MAP}} = (X^T X + \lambda\sigma^2 I)^{-1} X^T y — exactly ridge regression with penalty λσ2\lambda\sigma^2. A Gaussian prior corresponds to L2L_2 regularization; a Laplace prior π(βj)eλβj\pi(\beta_j) \propto e^{-\lambda|\beta_j|} corresponds to L1L_1 / Lasso.

This is the key insight: regularization = prior. The choice of regularizer encodes the prior belief about the parameter distribution.

Example 3: Variational Inference for a Gaussian Mixture

For a Gaussian mixture with components zi{1,,K}z_i \in \{1,\ldots,K\} and means μk\mu_k, the true posterior π(z,μx)\pi(z, \mu \mid x) couples all assignments and means — intractable.

Mean-field: q(z,μ)=q(z)q(μ)=iq(zi)kq(μk)q(z, \mu) = q(z) q(\mu) = \prod_i q(z_i) \prod_k q(\mu_k). CAVI updates:

logq(zi=k)E[logπk]+Eμk[logp(xiμk)],\log q^*(z_i = k) \propto \mathbb{E}[\log \pi_k] + \mathbb{E}_{\mu_k}[\log p(x_i \mid \mu_k)],

logq(μk)=12τ2μk2+1σ2iE[zik](xiμk12μk2).\log q^*(\mu_k) = -\frac{1}{2\tau^2}\mu_k^2 + \frac{1}{\sigma^2}\sum_i \mathbb{E}[z_{ik}](x_i \mu_k - \tfrac{1}{2}\mu_k^2).

Each coordinate update is closed-form; the algorithm alternates between these, converging to a local ELBO maximum. This is the foundation of variational autoencoders (VAEs): q(μx)q(\mu \mid x) is the encoder, p(xμ)p(x \mid \mu) is the decoder, and the objective is the ELBO.

Connections

Where Your Intuition Breaks

"With enough data, the prior washes out" — this is true for well-specified, finite-dimensional models, where the posterior concentrates around the true parameter as nn \to \infty. The dangerous extension is applying it to high-dimensional models. For Bayesian neural networks with millions of parameters, the likelihood only identifies a low-dimensional manifold of parameter space; the rest of the prior is barely updated even with enormous datasets. Similarly, improper priors — like a uniform prior over all of Rp\mathbb{R}^p — can yield improper posteriors whose marginal likelihood p(x)p(x) does not exist, making Bayesian model comparison via Bayes factors undefined. The prior is not a nuisance to be washed away but a structural assumption that must be chosen with the same care as the likelihood.

💡Intuition

Conjugate priors are the natural statistics of exponential families. For any exponential family with sufficient statistic TT, the conjugate prior has the form π(θ)exp(χη(θ)n0A(η(θ)))\pi(\theta) \propto \exp(\chi \cdot \eta(\theta) - n_0 A(\eta(\theta))) — it is also an exponential family. The posterior just increments the hyperparameters: "prior data" n0n_0 observations of mean χ/n0\chi/n_0 is updated to n0+nn_0 + n observations of mean (χ+T(x))/(n0+n)(\chi + T(x))/(n_0 + n). The hyperparameters are literally interpretable as "pseudo-counts" from imaginary prior observations.

💡Intuition

The ELBO decomposition reveals the VI objective. ELBO =Eq[logp(xθ)]KL(qπ)= \mathbb{E}_q[\log p(x \mid \theta)] - \text{KL}(q \| \pi): the first term is the expected log-likelihood (fit to data), the second term penalizes the approximate posterior for deviating from the prior (regularization). In VAEs, this is exactly reconstruction loss minus KL divergence — learning to encode data well while keeping the latent space close to a Gaussian prior.

⚠️Warning

Prior choice matters more than it appears. The oft-stated claim that "with enough data the prior washes out" is true for well-specified, finite-dimensional models. But for high-dimensional parameters (Bayesian neural networks with millions of weights), the prior never washes out — the posterior depends critically on the prior even with large datasets. Additionally, improper priors (like the uniform prior on R\mathbb{R}) can lead to improper posteriors, and the marginal likelihood p(x)p(x) may not exist. Hierarchical priors, where hyperparameters are estimated from data (empirical Bayes), provide a middle ground.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.