Probability Spaces, Random Variables & Distributions

A probability space $(\Omega, \mathcal{F}, P)$ is the triple that gives every probabilistic concept a rigorous home. Random variables are measurable functions from the sample space to $\mathbb{R}$ ; distributions are the push-forward measures they induce. This lesson establishes the standard distributions appearing throughout ML and the key structural properties — independence, conditioning, Bayes' theorem — that govern reasoning under uncertainty.

Concepts

When you evaluate a neural network on a test example and get a probability distribution over classes, you're using the language of this lesson: the model is producing a number between 0 and 1 for each class, and those numbers should sum to 1 and satisfy all the rules of a probability measure. The $(\Omega, \mathcal{F}, P)$ triple is the formal foundation that ensures those rules are consistent — that conditional probabilities, marginals, and expectations all behave correctly. Without this structure, Bayes' theorem and maximum likelihood estimation wouldn't have clean definitions.

Probability Spaces

A probability space is a triple $(\Omega, \mathcal{F}, P)$ where:

$\Omega$ is the sample space (set of all possible outcomes)
$\mathcal{F}$ is a $\sigma$ -algebra on $\Omega$ (collection of observable events)
$P : \mathcal{F} \to [0,1]$ is a probability measure: $P(\Omega) = 1$ , $\sigma$ -additive

The $\sigma$ -algebra $\mathcal{F}$ is the structure that says which subsets of $\Omega$ are "observable" events. Without it, you could assign probability to sets that lead to contradictions (Banach-Tarski paradox: a solid sphere can be decomposed into pieces and reassembled into two spheres of the same size — a measure-theoretic impossibility if you allow non-measurable sets). The $\sigma$ -algebra requirement — closed under countable unions and complements — is exactly what prevents these pathologies and guarantees that countable additivity is consistent.

Examples of sample spaces:

Coin flip: $\Omega = \{H, T\}$ , $\mathcal{F} = 2^\Omega$ , $P(H) = p$
Continuous outcome: $\Omega = \mathbb{R}$ , $\mathcal{F} = \mathcal{B}(\mathbb{R})$ , $P$ given by CDF
Infinite sequences: $\Omega = \{0,1\}^\infty$ (for modeling iid binary sequences) with product $\sigma$ -algebra
Path space: $\Omega = C([0,1], \mathbb{R})$ (continuous functions) for Brownian motion

Random Variables and Their Distributions

A random variable is a measurable function $X : (\Omega, \mathcal{F}) \to (\mathbb{R}, \mathcal{B}(\mathbb{R}))$ .

The distribution (or law) of $X$ is the push-forward measure $P_X = P \circ X^{-1}$ :

$P_X(B) = P(X \in B) = P(\{\omega : X(\omega) \in B\}) \quad \forall B \in \mathcal{B}(\mathbb{R}).$

Cumulative distribution function (CDF): $F_X(x) = P(X \leq x) = P_X((-\infty, x])$ .

Properties: nondecreasing, right-continuous, $F_X(-\infty) = 0$ , $F_X(+\infty) = 1$ .

Probability mass function (PMF): for discrete $X$ taking values $\{x_1, x_2, \ldots\}$ : $p_X(x_k) = P(X = x_k)$ .

Probability density function (PDF): for absolutely continuous $X$ : $f_X(x) = F_X'(x)$ a.e., with $P_X(B) = \int_B f_X(x)\,dx$ .

Standard Distributions

Discrete distributions:

Distribution	PMF	Mean	Variance	ML role
Bernoulli( $p$ )	$p^k(1-p)^{1-k}$ , $k\in\{0,1\}$	$p$	$p(1-p)$	Binary labels
Binomial( $n,p$ )	$\binom{n}{k}p^k(1-p)^{n-k}$	$np$	$np(1-p)$	Count of successes
Poisson( $\lambda$ )	$e^{-\lambda}\lambda^k/k!$	$\lambda$	$\lambda$	Event counts
Geometric( $p$ )	$(1-p)^{k-1}p$	$1/p$	$(1-p)/p^2$	First success time
Categorical( $\pi$ )	$\pi_k$ for $k=1,\ldots,K$	—	—	Multiclass labels

Continuous distributions:

Distribution	PDF	Mean	Variance	ML role
Gaussian $\mathcal{N}(\mu,\sigma^2)$	$(2\pi\sigma^2)^{-1/2}\exp(-(x-\mu)^2/2\sigma^2)$	$\mu$	$\sigma^2$	Priors, noise
Exponential( $\lambda$ )	$\lambda e^{-\lambda x}$ , $x\geq 0$	$1/\lambda$	$1/\lambda^2$	Waiting times
Gamma( $\alpha,\beta$ )	$x^{\alpha-1}e^{-\beta x}\beta^\alpha/\Gamma(\alpha)$	$\alpha/\beta$	$\alpha/\beta^2$	Conjugate to Poisson
Beta( $\alpha,\beta$ )	$x^{\alpha-1}(1-x)^{\beta-1}/B(\alpha,\beta)$	$\alpha/(\alpha+\beta)$	See below	Conjugate to Bernoulli
Student- $t$ ( $\nu$ )	$(1+x^2/\nu)^{-(\nu+1)/2}\cdot C$	0 ( $\nu>1$ )	$\nu/(\nu-2)$	Heavy tails
Uniform( $a,b$ )	$1/(b-a)$	$(a+b)/2$	$(b-a)^2/12$	Non-informative prior

Multivariate Gaussian. The most important distribution in ML:

$\mathbf{x} \sim \mathcal{N}(\boldsymbol\mu, \Sigma): \quad p(\mathbf{x}) = \frac{1}{(2\pi)^{n/2}\det(\Sigma)^{1/2}}\exp\!\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol\mu)^T\Sigma^{-1}(\mathbf{x}-\boldsymbol\mu)\right).$

Properties of multivariate Gaussian:

Affine closure: if $\mathbf{x} \sim \mathcal{N}(\mu,\Sigma)$ , then $A\mathbf{x}+b \sim \mathcal{N}(A\mu+b, A\Sigma A^T)$ .
Marginals: $x_i \sim \mathcal{N}(\mu_i, \Sigma_{ii})$ — marginals of Gaussians are Gaussian.
Conditionals: $(x_A | x_B = v) \sim \mathcal{N}(\mu_{A|B}, \Sigma_{A|B})$ where $\mu_{A|B} = \mu_A + \Sigma_{AB}\Sigma_{BB}^{-1}(v-\mu_B)$ — conditionals of Gaussians are Gaussian.
Product of Gaussians: $\mathcal{N}(x;\mu_1,\sigma_1^2)\cdot\mathcal{N}(x;\mu_2,\sigma_2^2) \propto \mathcal{N}(x;\mu_*,\sigma_*^2)$ — the unnormalized product is Gaussian (key for Bayesian updates).
Entropy: $H(\mathcal{N}(\mu,\Sigma)) = \frac{1}{2}\log\det(2\pi e\Sigma)$ — Gaussian maximizes entropy for given covariance.

Dirichlet distribution. Generalizes Beta to the simplex: $\mathbf{p} = (p_1,\ldots,p_K) \sim \text{Dir}(\alpha_1,\ldots,\alpha_K)$ where $\sum_k \alpha_k = \alpha_0$ .

$p(\mathbf{p}) = \frac{\Gamma(\alpha_0)}{\prod_k\Gamma(\alpha_k)}\prod_k p_k^{\alpha_k-1}, \quad \mathbf{p} \in \Delta^{K-1}.$

Conjugate prior for the Categorical distribution. $\mathbb{E}[p_k] = \alpha_k/\alpha_0$ . Used as prior over topic proportions in LDA.

Independence and Conditional Probability

Independence of events: $A \perp B$ iff $P(A \cap B) = P(A)P(B)$ .

Independence of random variables: $X \perp Y$ iff $P(X \in A, Y \in B) = P(X \in A)P(Y \in B)$ for all Borel $A, B$ — equivalently, the joint distribution factors: $P_{X,Y} = P_X \otimes P_Y$ .

Conditional probability: $P(A | B) = P(A \cap B) / P(B)$ for $P(B) > 0$ .

Conditional distribution: $P(X \in A | Y = y)$ — for continuous $Y$ , defined via regular conditional probability, a measure-theoretic subtlety. The conditional PDF $f_{X|Y}(x|y) = f_{X,Y}(x,y)/f_Y(y)$ .

Bayes' theorem:

$P(H | E) = \frac{P(E | H)P(H)}{P(E)} = \frac{P(E | H)P(H)}{\sum_{H'} P(E | H')P(H')}.$

Total probability: $P(E) = \sum_k P(E | H_k)P(H_k)$ for a partition $\{H_k\}$ of $\Omega$ .

Law of total expectation: $\mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X | Y]]$ .

Law of total variance: $\text{Var}(X) = \mathbb{E}[\text{Var}(X|Y)] + \text{Var}(\mathbb{E}[X|Y])$ .

Worked Example

Example 1: Gaussian Conditioning (Bayesian Update)

Let prior $\theta \sim \mathcal{N}(\mu_0, \sigma_0^2)$ and likelihood $x | \theta \sim \mathcal{N}(\theta, \sigma^2)$ (single observation). The posterior:

$p(\theta | x) \propto p(x|\theta)p(\theta) = \mathcal{N}(x;\theta,\sigma^2)\cdot\mathcal{N}(\theta;\mu_0,\sigma_0^2).$

Both are Gaussian in $\theta$ ; their product is Gaussian:

$\theta | x \sim \mathcal{N}(\mu_n, \sigma_n^2), \quad \frac{1}{\sigma_n^2} = \frac{1}{\sigma_0^2} + \frac{1}{\sigma^2}, \quad \mu_n = \sigma_n^2\!\left(\frac{\mu_0}{\sigma_0^2} + \frac{x}{\sigma^2}\right).$

The posterior mean is a precision-weighted average of prior and observation. With $n$ iid observations: $1/\sigma_n^2 = 1/\sigma_0^2 + n/\sigma^2$ and $\mu_n = \sigma_n^2(\mu_0/\sigma_0^2 + n\bar{x}/\sigma^2)$ . As $n\to\infty$ : $\mu_n \to \bar{x}$ (data dominates prior) and $\sigma_n^2 \to \sigma^2/n$ (posterior concentrates). This is Bayesian inference for a Gaussian model.

Example 2: Multivariate Gaussian — Marginal and Conditional

Let $\mathbf{x} = (x_A, x_B) \sim \mathcal{N}(\boldsymbol\mu, \Sigma)$ with block structure:

$\boldsymbol\mu = \begin{pmatrix}\mu_A \\ \mu_B\end{pmatrix}, \quad \Sigma = \begin{pmatrix}\Sigma_{AA} & \Sigma_{AB} \\ \Sigma_{BA} & \Sigma_{BB}\end{pmatrix}.$

Marginal: $x_A \sim \mathcal{N}(\mu_A, \Sigma_{AA})$ .

Conditional: $x_A | x_B = v \sim \mathcal{N}(\mu_{A|B}, \Sigma_{A|B})$ where:

$\mu_{A|B} = \mu_A + \Sigma_{AB}\Sigma_{BB}^{-1}(v - \mu_B), \qquad \Sigma_{A|B} = \Sigma_{AA} - \Sigma_{AB}\Sigma_{BB}^{-1}\Sigma_{BA}.$

The term $\Sigma_{AB}\Sigma_{BB}^{-1}$ is the regression coefficient — the optimal linear predictor of $x_A$ from $x_B$ . Gaussian process regression is exactly this formula applied to function values at unobserved locations.

Example 3: Conjugacy and the Dirichlet-Categorical Model

Prior: $\boldsymbol\pi \sim \text{Dir}(\boldsymbol\alpha)$ over $K$ -class probabilities. Data: $x_1,\ldots,x_n \sim \text{Cat}(\boldsymbol\pi)$ iid. The posterior is:

$\boldsymbol\pi | x_{1:n} \sim \text{Dir}(\alpha_1 + n_1, \ldots, \alpha_K + n_K),$

where $n_k = \sum_i \mathbf{1}[x_i = k]$ counts. Conjugacy means the posterior is in the same family as the prior — just with updated hyperparameters. This closes-form posterior update is why Bayesian inference is tractable for exponential family models: the sufficient statistics of the data just add to the hyperparameters.

Connections

Where Your Intuition Breaks

Bayes' theorem looks simple: $P(A|B) = P(B|A)P(A)/P(B)$ . The dangerous assumption hidden in the denominator: $P(B) > 0$ . Conditional probability is undefined when the conditioning event has probability zero. This is not a pathological corner case — in continuous distributions, any specific value has probability zero, so $P(X=x)=0$ for every $x$ . Conditioning on a continuous observation (as in a Kalman filter or a continuous latent variable model) requires a more careful construction: regular conditional distributions, which exist under mild measurability conditions but do not follow from the elementary formula. This is why likelihood functions for continuous observations are densities (evaluated pointwise but not themselves probabilities), and why "conditioning on a zero-probability event" in variational inference needs to be handled via the density rather than the probability mass.

💡Intuition

The multivariate Gaussian is defined by its first two moments. A remarkable property: knowing $\boldsymbol\mu$ and $\Sigma$ completely specifies the entire distribution. All higher moments are determined by these two. This is why Gaussian assumptions are so prevalent — they are the maximum entropy distribution subject to known mean and covariance (by the Gaussian maximizes entropy theorem), and they are closed under all affine operations, marginalization, and conditioning. In practice, assuming Gaussian noise or Gaussian priors is not just convenient — it is the least informative (most conservative) assumption given second-order statistics.

💡Intuition

Bayesian updating is sequential: each posterior becomes the next prior. In online learning, the Bayesian update formula $P(\theta|x_{1:n}) \propto P(x_n|\theta)P(\theta|x_{1:n-1})$ shows that new data updates beliefs incrementally. For conjugate priors, this is a simple hyperparameter increment. For non-conjugate models, variational inference or MCMC approximate the posterior. The Kalman filter is the Gaussian case of this sequential Bayesian updating — making it the optimal linear filter for Gaussian state-space models.

⚠️Warning

Independence does not imply zero correlation, and zero correlation does not imply independence. Two variables can be uncorrelated ( $\text{Cov}(X,Y)=0$ ) yet strongly dependent (e.g., $X \sim \mathcal{N}(0,1)$ , $Y = X^2$ : $\text{Cov}(X,Y) = \mathbb{E}[X^3] = 0$ but $Y$ is a deterministic function of $X$ ). The exception: for jointly Gaussian variables, uncorrelated implies independent. Confusing correlation with dependence leads to bugs in feature selection (correlated features are not necessarily redundant) and independence assumptions in generative models.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Measure Theory Primer: σ-Algebras, Measures & Lebesgue Integration

Expectation, Moments, Characteristic Functions & Generating Functions