Bayesian Inference: Priors, Posteriors & Conjugacy

Bayesian inference treats parameters as random variables and updates beliefs using Bayes' theorem; conjugate priors make this analytically tractable for many model families. Where frequentist statistics asks "what is the probability of this data given the parameter?", Bayesian inference inverts the question: "what is the probability of the parameter given this data?"

Concepts

Bayesian updating: prior Beta(α₀,β₀) updates to posterior Beta(α₀+k, β₀+n-k) after k successes in n trials.

Prior Beta(2,2)Posterior Beta(2,2)Post. mean

α₀ = 2

β₀ = 2

Data (k/n)

0/0

Prior mean

0.500

Post. mean

0.500

Post. MAP

0.500

Beta(1,1) = Uniform prior. As n→∞, posterior concentrates: prior is washed out by likelihood.

Before seeing any data, you believe a coin is approximately fair — but not certain. After 7 heads in 10 flips, your belief shifts toward $p = 0.7$ , tempered by the small sample size. Bayesian inference is the precise machinery for this update: the posterior combines the prior (pre-data belief) and the likelihood (what the data say), yielding a full probability distribution over the parameter — not a single estimate, but a complete description of remaining uncertainty.

Bayes' Theorem and the Posterior

Given data $X = x$ and a prior $\pi(\theta)$ encoding beliefs before observing data:

$\pi(\theta \mid x) = \frac{p(x \mid \theta)\,\pi(\theta)}{p(x)}, \quad p(x) = \int p(x \mid \theta)\,\pi(\theta)\,d\theta.$

Posterior $\propto$ likelihood $\times$ prior — the proportionality is exact, not an approximation. The evidence $p(x)$ is a constant with respect to $\theta$ : the same for every parameter value, serving only to normalize the posterior to integrate to 1. All information about which $\theta$ values are plausible lives in the numerator; computational methods (MCMC, variational inference) routinely skip computing $p(x)$ entirely and work directly with the unnormalized product $p(x|\theta)\pi(\theta)$ .

The four quantities:

Prior $\pi(\theta)$ : beliefs about $\theta$ before data
Likelihood $p(x \mid \theta)$ : data-generating model
Evidence $p(x)$ : marginal likelihood, normalizing constant
Posterior $\pi(\theta \mid x)$ : updated beliefs

The posterior is proportional to likelihood times prior: $\pi(\theta \mid x) \propto p(x \mid \theta)\,\pi(\theta)$ .

Point estimates from the posterior:

MAP (maximum a posteriori): $\hat\theta_{\text{MAP}} = \arg\max_\theta \pi(\theta \mid x)$
Posterior mean: $\hat\theta_{\text{Bayes}} = \mathbb{E}[\theta \mid x]$ — minimizes posterior expected squared loss
Posterior median: minimizes posterior expected absolute loss

MAP = MLE when the prior is uniform. MAP = MLE + regularization when the prior is non-uniform: $\log \pi(\theta \mid x) = \log p(x \mid \theta) + \log \pi(\theta) - \text{const}$ .

Conjugate Priors

A prior $\pi(\theta)$ is conjugate to a likelihood $p(x \mid \theta)$ if the posterior $\pi(\theta \mid x)$ is in the same distributional family as the prior. This yields closed-form posteriors.

Model	Prior	Posterior	Posterior parameters
Bernoulli( $p$ )	Beta( $\alpha, \beta$ )	Beta( $\alpha + k, \beta + n - k$ )	$k$ successes in $n$ trials
Binomial( $n, p$ )	Beta( $\alpha, \beta$ )	Beta( $\alpha + k, \beta + n - k$ )	same
Poisson( $\lambda$ )	Gamma( $a, b$ )	Gamma( $a + \sum x_i, b + n$ )	rate + count
Normal( $\mu$ , $\sigma^2$ known)	Normal( $\mu_0, \tau^2$ )	Normal( $\mu_n, \tau_n^2$ )	precision-weighted average
Normal( $\mu$ , $\sigma^2$ )	Normal-Inverse-Gamma	Normal-Inverse-Gamma	updated hyperparams
Multinomial( $p$ )	Dirichlet( $\boldsymbol\alpha$ )	Dirichlet( $\boldsymbol\alpha + \mathbf{n}$ )	$n_k$ = counts per class

Gaussian-Gaussian conjugacy (known variance $\sigma^2$ ): prior $\mu \sim \mathcal{N}(\mu_0, \tau^2)$ . Posterior:

$\mu \mid x_1,\ldots,x_n \sim \mathcal{N}\!\left(\mu_n, \tau_n^2\right), \quad \tau_n^2 = \frac{1}{\frac{1}{\tau^2} + \frac{n}{\sigma^2}}, \quad \mu_n = \tau_n^2\!\left(\frac{\mu_0}{\tau^2} + \frac{n\bar x}{\sigma^2}\right).$

The posterior precision $1/\tau_n^2$ = prior precision + data precision $n/\sigma^2$ . As $n \to \infty$ , the posterior mean converges to $\bar x$ (MLE) regardless of the prior.

Posterior Predictive Distribution

The posterior predictive distribution for a new observation $X_{\text{new}}$ marginalizes out the unknown $\theta$ :

$p(x_{\text{new}} \mid x_1,\ldots,x_n) = \int p(x_{\text{new}} \mid \theta)\,\pi(\theta \mid x_1,\ldots,x_n)\,d\theta.$

This automatically accounts for parameter uncertainty — it is wider than predicting at the MLE alone. For Beta-Binomial, this gives the Beta-Binomial distribution, which is overdispersed compared to a Binomial.

Credible Intervals vs Confidence Intervals

A Bayesian credible interval $[a, b]$ with $95\%$ credibility satisfies $P(\theta \in [a,b] \mid x) = 0.95$ — a direct probability statement about $\theta$ given data.

A frequentist confidence interval at $95\%$ means: if we repeat the experiment many times and compute a CI each time, $95\%$ of the CIs will contain the true $\theta$ . A specific realized CI either contains $\theta$ or it does not — there is no probability statement for the specific interval.

The Bayesian interpretation is more natural but requires a prior. In the limit of a flat (diffuse) prior, credible and confidence intervals coincide numerically.

Bayesian Model Comparison

To compare models $M_1$ vs $M_2$ , compute the Bayes factor:

$\text{BF}_{12} = \frac{p(x \mid M_1)}{p(x \mid M_2)} = \frac{\int p(x \mid \theta_1, M_1)\,\pi(\theta_1 \mid M_1)\,d\theta_1}{\int p(x \mid \theta_2, M_2)\,\pi(\theta_2 \mid M_2)\,d\theta_2}.$

The Bayes factor automatically penalizes model complexity (Occam's razor): complex models spread prior mass over a larger parameter space, which reduces the marginal likelihood if the data do not need that complexity. No separate penalty term is needed — it is implicit in the integration.

The posterior model odds = prior odds × Bayes factor.

Bayesian Information Criterion (BIC): approximates $-2\log p(x \mid M)$ via Laplace approximation: $\text{BIC} = -2\log p(x \mid \hat\theta) + k\log n$ . The $k\log n$ penalty approximates the prior-complexity penalty in the Bayes factor.

Variational Inference

For complex posteriors where exact computation is intractable, variational inference approximates $\pi(\theta \mid x)$ by minimizing the KL divergence over a family $\mathcal{Q}$ :

$q^* = \arg\min_{q \in \mathcal{Q}} \text{KL}(q(\theta) \| \pi(\theta \mid x)).$

Equivalently, maximize the evidence lower bound (ELBO):

$\text{ELBO}(q) = \mathbb{E}_q[\log p(x, \theta)] - \mathbb{E}_q[\log q(\theta)] = \log p(x) - \text{KL}(q \| \pi(\cdot \mid x)).$

Since $\text{KL} \geq 0$ , the ELBO lower-bounds $\log p(x)$ ; maximizing ELBO = minimizing KL.

Mean-field approximation: $q(\theta) = \prod_j q_j(\theta_j)$ . Under this factorization, the optimal $q_j^*$ satisfies:

$\log q_j^*(\theta_j) = \mathbb{E}_{q_{-j}}[\log p(x, \theta)] + \text{const}.$

This yields coordinate ascent updates (CAVI) that iterate until convergence.

MCMC (Markov Chain Monte Carlo): the alternative to variational inference — generate samples from $\pi(\theta \mid x)$ without approximation. Metropolis-Hastings proposes $\theta' \sim q(\theta' \mid \theta)$ and accepts with probability $\min(1, \pi(\theta'\mid x) q(\theta|\theta') / [\pi(\theta\mid x) q(\theta'|\theta)])$ . Gibbs sampling cycles through conditionals $\pi(\theta_j \mid \theta_{-j}, x)$ .

Worked Example

Example 1: Beta-Binomial Coin Flip

Prior: $p \sim \text{Beta}(1, 1)$ (uniform). Observe 7 heads in 10 flips.

Posterior: $p \mid x \sim \text{Beta}(1+7, 1+3) = \text{Beta}(8, 4)$ .

Posterior mean: $8/(8+4) = 0.667$ . MAP: $(8-1)/(8+4-2) = 7/10 = 0.7$ (coincides with MLE).

Posterior predictive: $P(\text{next flip is heads}) = \mathbb{E}[p \mid x] = 0.667$ (posterior mean, not 0.7).

95% credible interval for $p$ : $[0.35, 0.92]$ from the Beta $(8,4)$ distribution. This says: given the prior and data, we believe $p \in [0.35, 0.92]$ with 95% probability.

Compare with frequentist 95% CI: $\hat p \pm 1.96\sqrt{\hat p(1-\hat p)/n} = 0.7 \pm 0.28 = [0.42, 0.98]$ . Here the Bayesian interval is pulled toward the prior mean of 0.5 and is slightly narrower.

Example 2: MAP as Regularized MLE

Consider linear regression $y = X\beta + \varepsilon$ with $\varepsilon \sim \mathcal{N}(0, \sigma^2 I)$ .

Prior: $\beta \sim \mathcal{N}(0, \lambda^{-1} I)$ .

Log posterior: $\log \pi(\beta \mid y) = -\frac{1}{2\sigma^2}\|y - X\beta\|^2 - \frac{\lambda}{2}\|\beta\|^2 + \text{const}.$

MAP: $\hat\beta_{\text{MAP}} = (X^T X + \lambda\sigma^2 I)^{-1} X^T y$ — exactly ridge regression with penalty $\lambda\sigma^2$ . A Gaussian prior corresponds to $L_2$ regularization; a Laplace prior $\pi(\beta_j) \propto e^{-\lambda|\beta_j|}$ corresponds to $L_1$ / Lasso.

This is the key insight: regularization = prior. The choice of regularizer encodes the prior belief about the parameter distribution.

Example 3: Variational Inference for a Gaussian Mixture

For a Gaussian mixture with components $z_i \in \{1,\ldots,K\}$ and means $\mu_k$ , the true posterior $\pi(z, \mu \mid x)$ couples all assignments and means — intractable.

Mean-field: $q(z, \mu) = q(z) q(\mu) = \prod_i q(z_i) \prod_k q(\mu_k)$ . CAVI updates:

$\log q^*(z_i = k) \propto \mathbb{E}[\log \pi_k] + \mathbb{E}_{\mu_k}[\log p(x_i \mid \mu_k)],$

$\log q^*(\mu_k) = -\frac{1}{2\tau^2}\mu_k^2 + \frac{1}{\sigma^2}\sum_i \mathbb{E}[z_{ik}](x_i \mu_k - \tfrac{1}{2}\mu_k^2).$

Each coordinate update is closed-form; the algorithm alternates between these, converging to a local ELBO maximum. This is the foundation of variational autoencoders (VAEs): $q(\mu \mid x)$ is the encoder, $p(x \mid \mu)$ is the decoder, and the objective is the ELBO.

Connections

Where Your Intuition Breaks

"With enough data, the prior washes out" — this is true for well-specified, finite-dimensional models, where the posterior concentrates around the true parameter as $n \to \infty$ . The dangerous extension is applying it to high-dimensional models. For Bayesian neural networks with millions of parameters, the likelihood only identifies a low-dimensional manifold of parameter space; the rest of the prior is barely updated even with enormous datasets. Similarly, improper priors — like a uniform prior over all of $\mathbb{R}^p$ — can yield improper posteriors whose marginal likelihood $p(x)$ does not exist, making Bayesian model comparison via Bayes factors undefined. The prior is not a nuisance to be washed away but a structural assumption that must be chosen with the same care as the likelihood.

💡Intuition

Conjugate priors are the natural statistics of exponential families. For any exponential family with sufficient statistic $T$ , the conjugate prior has the form $\pi(\theta) \propto \exp(\chi \cdot \eta(\theta) - n_0 A(\eta(\theta)))$ — it is also an exponential family. The posterior just increments the hyperparameters: "prior data" $n_0$ observations of mean $\chi/n_0$ is updated to $n_0 + n$ observations of mean $(\chi + T(x))/(n_0 + n)$ . The hyperparameters are literally interpretable as "pseudo-counts" from imaginary prior observations.

💡Intuition

The ELBO decomposition reveals the VI objective. ELBO $= \mathbb{E}_q[\log p(x \mid \theta)] - \text{KL}(q \| \pi)$ : the first term is the expected log-likelihood (fit to data), the second term penalizes the approximate posterior for deviating from the prior (regularization). In VAEs, this is exactly reconstruction loss minus KL divergence — learning to encode data well while keeping the latent space close to a Gaussian prior.

⚠️Warning

Prior choice matters more than it appears. The oft-stated claim that "with enough data the prior washes out" is true for well-specified, finite-dimensional models. But for high-dimensional parameters (Bayesian neural networks with millions of weights), the prior never washes out — the posterior depends critically on the prior even with large datasets. Additionally, improper priors (like the uniform prior on $\mathbb{R}$ ) can lead to improper posteriors, and the marginal likelihood $p(x)$ may not exist. Hierarchical priors, where hyperparameters are estimated from data (empirical Bayes), provide a middle ground.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Hypothesis Testing: Neyman-Pearson, Likelihood Ratio Tests & Multiple Testing

PAC Learning, VC Dimension & Rademacher Complexity