Neural-Path/Notes
40 min

Probability Spaces, Random Variables & Distributions

A probability space (Ω,F,P)(\Omega, \mathcal{F}, P) is the triple that gives every probabilistic concept a rigorous home. Random variables are measurable functions from the sample space to R\mathbb{R}; distributions are the push-forward measures they induce. This lesson establishes the standard distributions appearing throughout ML and the key structural properties — independence, conditioning, Bayes' theorem — that govern reasoning under uncertainty.

Concepts

When you evaluate a neural network on a test example and get a probability distribution over classes, you're using the language of this lesson: the model is producing a number between 0 and 1 for each class, and those numbers should sum to 1 and satisfy all the rules of a probability measure. The (Ω,F,P)(\Omega, \mathcal{F}, P) triple is the formal foundation that ensures those rules are consistent — that conditional probabilities, marginals, and expectations all behave correctly. Without this structure, Bayes' theorem and maximum likelihood estimation wouldn't have clean definitions.

Probability Spaces

A probability space is a triple (Ω,F,P)(\Omega, \mathcal{F}, P) where:

  • Ω\Omega is the sample space (set of all possible outcomes)
  • F\mathcal{F} is a σ\sigma-algebra on Ω\Omega (collection of observable events)
  • P:F[0,1]P : \mathcal{F} \to [0,1] is a probability measure: P(Ω)=1P(\Omega) = 1, σ\sigma-additive

The σ\sigma-algebra F\mathcal{F} is the structure that says which subsets of Ω\Omega are "observable" events. Without it, you could assign probability to sets that lead to contradictions (Banach-Tarski paradox: a solid sphere can be decomposed into pieces and reassembled into two spheres of the same size — a measure-theoretic impossibility if you allow non-measurable sets). The σ\sigma-algebra requirement — closed under countable unions and complements — is exactly what prevents these pathologies and guarantees that countable additivity is consistent.

Examples of sample spaces:

  • Coin flip: Ω={H,T}\Omega = \{H, T\}, F=2Ω\mathcal{F} = 2^\Omega, P(H)=pP(H) = p
  • Continuous outcome: Ω=R\Omega = \mathbb{R}, F=B(R)\mathcal{F} = \mathcal{B}(\mathbb{R}), PP given by CDF
  • Infinite sequences: Ω={0,1}\Omega = \{0,1\}^\infty (for modeling iid binary sequences) with product σ\sigma-algebra
  • Path space: Ω=C([0,1],R)\Omega = C([0,1], \mathbb{R}) (continuous functions) for Brownian motion

Random Variables and Their Distributions

A random variable is a measurable function X:(Ω,F)(R,B(R))X : (\Omega, \mathcal{F}) \to (\mathbb{R}, \mathcal{B}(\mathbb{R})).

The distribution (or law) of XX is the push-forward measure PX=PX1P_X = P \circ X^{-1}:

PX(B)=P(XB)=P({ω:X(ω)B})BB(R).P_X(B) = P(X \in B) = P(\{\omega : X(\omega) \in B\}) \quad \forall B \in \mathcal{B}(\mathbb{R}).

Cumulative distribution function (CDF): FX(x)=P(Xx)=PX((,x])F_X(x) = P(X \leq x) = P_X((-\infty, x]).

Properties: nondecreasing, right-continuous, FX()=0F_X(-\infty) = 0, FX(+)=1F_X(+\infty) = 1.

Probability mass function (PMF): for discrete XX taking values {x1,x2,}\{x_1, x_2, \ldots\}: pX(xk)=P(X=xk)p_X(x_k) = P(X = x_k).

Probability density function (PDF): for absolutely continuous XX: fX(x)=FX(x)f_X(x) = F_X'(x) a.e., with PX(B)=BfX(x)dxP_X(B) = \int_B f_X(x)\,dx.

Standard Distributions

Discrete distributions:

DistributionPMFMeanVarianceML role
Bernoulli(pp)pk(1p)1kp^k(1-p)^{1-k}, k{0,1}k\in\{0,1\}ppp(1p)p(1-p)Binary labels
Binomial(n,pn,p)(nk)pk(1p)nk\binom{n}{k}p^k(1-p)^{n-k}npnpnp(1p)np(1-p)Count of successes
Poisson(λ\lambda)eλλk/k!e^{-\lambda}\lambda^k/k!λ\lambdaλ\lambdaEvent counts
Geometric(pp)(1p)k1p(1-p)^{k-1}p1/p1/p(1p)/p2(1-p)/p^2First success time
Categorical(π\pi)πk\pi_k for k=1,,Kk=1,\ldots,KMulticlass labels

Continuous distributions:

DistributionPDFMeanVarianceML role
Gaussian N(μ,σ2)\mathcal{N}(\mu,\sigma^2)(2πσ2)1/2exp((xμ)2/2σ2)(2\pi\sigma^2)^{-1/2}\exp(-(x-\mu)^2/2\sigma^2)μ\muσ2\sigma^2Priors, noise
Exponential(λ\lambda)λeλx\lambda e^{-\lambda x}, x0x\geq 01/λ1/\lambda1/λ21/\lambda^2Waiting times
Gamma(α,β\alpha,\beta)xα1eβxβα/Γ(α)x^{\alpha-1}e^{-\beta x}\beta^\alpha/\Gamma(\alpha)α/β\alpha/\betaα/β2\alpha/\beta^2Conjugate to Poisson
Beta(α,β\alpha,\beta)xα1(1x)β1/B(α,β)x^{\alpha-1}(1-x)^{\beta-1}/B(\alpha,\beta)α/(α+β)\alpha/(\alpha+\beta)See belowConjugate to Bernoulli
Student-tt(ν\nu)(1+x2/ν)(ν+1)/2C(1+x^2/\nu)^{-(\nu+1)/2}\cdot C0 (ν>1\nu>1)ν/(ν2)\nu/(\nu-2)Heavy tails
Uniform(a,ba,b)1/(ba)1/(b-a)(a+b)/2(a+b)/2(ba)2/12(b-a)^2/12Non-informative prior

Multivariate Gaussian. The most important distribution in ML:

xN(μ,Σ):p(x)=1(2π)n/2det(Σ)1/2exp ⁣(12(xμ)TΣ1(xμ)).\mathbf{x} \sim \mathcal{N}(\boldsymbol\mu, \Sigma): \quad p(\mathbf{x}) = \frac{1}{(2\pi)^{n/2}\det(\Sigma)^{1/2}}\exp\!\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol\mu)^T\Sigma^{-1}(\mathbf{x}-\boldsymbol\mu)\right).

Properties of multivariate Gaussian:

  • Affine closure: if xN(μ,Σ)\mathbf{x} \sim \mathcal{N}(\mu,\Sigma), then Ax+bN(Aμ+b,AΣAT)A\mathbf{x}+b \sim \mathcal{N}(A\mu+b, A\Sigma A^T).
  • Marginals: xiN(μi,Σii)x_i \sim \mathcal{N}(\mu_i, \Sigma_{ii}) — marginals of Gaussians are Gaussian.
  • Conditionals: (xAxB=v)N(μAB,ΣAB)(x_A | x_B = v) \sim \mathcal{N}(\mu_{A|B}, \Sigma_{A|B}) where μAB=μA+ΣABΣBB1(vμB)\mu_{A|B} = \mu_A + \Sigma_{AB}\Sigma_{BB}^{-1}(v-\mu_B) — conditionals of Gaussians are Gaussian.
  • Product of Gaussians: N(x;μ1,σ12)N(x;μ2,σ22)N(x;μ,σ2)\mathcal{N}(x;\mu_1,\sigma_1^2)\cdot\mathcal{N}(x;\mu_2,\sigma_2^2) \propto \mathcal{N}(x;\mu_*,\sigma_*^2) — the unnormalized product is Gaussian (key for Bayesian updates).
  • Entropy: H(N(μ,Σ))=12logdet(2πeΣ)H(\mathcal{N}(\mu,\Sigma)) = \frac{1}{2}\log\det(2\pi e\Sigma) — Gaussian maximizes entropy for given covariance.

Dirichlet distribution. Generalizes Beta to the simplex: p=(p1,,pK)Dir(α1,,αK)\mathbf{p} = (p_1,\ldots,p_K) \sim \text{Dir}(\alpha_1,\ldots,\alpha_K) where kαk=α0\sum_k \alpha_k = \alpha_0.

p(p)=Γ(α0)kΓ(αk)kpkαk1,pΔK1.p(\mathbf{p}) = \frac{\Gamma(\alpha_0)}{\prod_k\Gamma(\alpha_k)}\prod_k p_k^{\alpha_k-1}, \quad \mathbf{p} \in \Delta^{K-1}.

Conjugate prior for the Categorical distribution. E[pk]=αk/α0\mathbb{E}[p_k] = \alpha_k/\alpha_0. Used as prior over topic proportions in LDA.

Independence and Conditional Probability

Independence of events: ABA \perp B iff P(AB)=P(A)P(B)P(A \cap B) = P(A)P(B).

Independence of random variables: XYX \perp Y iff P(XA,YB)=P(XA)P(YB)P(X \in A, Y \in B) = P(X \in A)P(Y \in B) for all Borel A,BA, B — equivalently, the joint distribution factors: PX,Y=PXPYP_{X,Y} = P_X \otimes P_Y.

Conditional probability: P(AB)=P(AB)/P(B)P(A | B) = P(A \cap B) / P(B) for P(B)>0P(B) > 0.

Conditional distribution: P(XAY=y)P(X \in A | Y = y) — for continuous YY, defined via regular conditional probability, a measure-theoretic subtlety. The conditional PDF fXY(xy)=fX,Y(x,y)/fY(y)f_{X|Y}(x|y) = f_{X,Y}(x,y)/f_Y(y).

Bayes' theorem:

P(HE)=P(EH)P(H)P(E)=P(EH)P(H)HP(EH)P(H).P(H | E) = \frac{P(E | H)P(H)}{P(E)} = \frac{P(E | H)P(H)}{\sum_{H'} P(E | H')P(H')}.

Total probability: P(E)=kP(EHk)P(Hk)P(E) = \sum_k P(E | H_k)P(H_k) for a partition {Hk}\{H_k\} of Ω\Omega.

Law of total expectation: E[X]=E[E[XY]]\mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X | Y]].

Law of total variance: Var(X)=E[Var(XY)]+Var(E[XY])\text{Var}(X) = \mathbb{E}[\text{Var}(X|Y)] + \text{Var}(\mathbb{E}[X|Y]).

Worked Example

Example 1: Gaussian Conditioning (Bayesian Update)

Let prior θN(μ0,σ02)\theta \sim \mathcal{N}(\mu_0, \sigma_0^2) and likelihood xθN(θ,σ2)x | \theta \sim \mathcal{N}(\theta, \sigma^2) (single observation). The posterior:

p(θx)p(xθ)p(θ)=N(x;θ,σ2)N(θ;μ0,σ02).p(\theta | x) \propto p(x|\theta)p(\theta) = \mathcal{N}(x;\theta,\sigma^2)\cdot\mathcal{N}(\theta;\mu_0,\sigma_0^2).

Both are Gaussian in θ\theta; their product is Gaussian:

θxN(μn,σn2),1σn2=1σ02+1σ2,μn=σn2 ⁣(μ0σ02+xσ2).\theta | x \sim \mathcal{N}(\mu_n, \sigma_n^2), \quad \frac{1}{\sigma_n^2} = \frac{1}{\sigma_0^2} + \frac{1}{\sigma^2}, \quad \mu_n = \sigma_n^2\!\left(\frac{\mu_0}{\sigma_0^2} + \frac{x}{\sigma^2}\right).

The posterior mean is a precision-weighted average of prior and observation. With nn iid observations: 1/σn2=1/σ02+n/σ21/\sigma_n^2 = 1/\sigma_0^2 + n/\sigma^2 and μn=σn2(μ0/σ02+nxˉ/σ2)\mu_n = \sigma_n^2(\mu_0/\sigma_0^2 + n\bar{x}/\sigma^2). As nn\to\infty: μnxˉ\mu_n \to \bar{x} (data dominates prior) and σn2σ2/n\sigma_n^2 \to \sigma^2/n (posterior concentrates). This is Bayesian inference for a Gaussian model.

Example 2: Multivariate Gaussian — Marginal and Conditional

Let x=(xA,xB)N(μ,Σ)\mathbf{x} = (x_A, x_B) \sim \mathcal{N}(\boldsymbol\mu, \Sigma) with block structure:

μ=(μAμB),Σ=(ΣAAΣABΣBAΣBB).\boldsymbol\mu = \begin{pmatrix}\mu_A \\ \mu_B\end{pmatrix}, \quad \Sigma = \begin{pmatrix}\Sigma_{AA} & \Sigma_{AB} \\ \Sigma_{BA} & \Sigma_{BB}\end{pmatrix}.

Marginal: xAN(μA,ΣAA)x_A \sim \mathcal{N}(\mu_A, \Sigma_{AA}).

Conditional: xAxB=vN(μAB,ΣAB)x_A | x_B = v \sim \mathcal{N}(\mu_{A|B}, \Sigma_{A|B}) where:

μAB=μA+ΣABΣBB1(vμB),ΣAB=ΣAAΣABΣBB1ΣBA.\mu_{A|B} = \mu_A + \Sigma_{AB}\Sigma_{BB}^{-1}(v - \mu_B), \qquad \Sigma_{A|B} = \Sigma_{AA} - \Sigma_{AB}\Sigma_{BB}^{-1}\Sigma_{BA}.

The term ΣABΣBB1\Sigma_{AB}\Sigma_{BB}^{-1} is the regression coefficient — the optimal linear predictor of xAx_A from xBx_B. Gaussian process regression is exactly this formula applied to function values at unobserved locations.

Example 3: Conjugacy and the Dirichlet-Categorical Model

Prior: πDir(α)\boldsymbol\pi \sim \text{Dir}(\boldsymbol\alpha) over KK-class probabilities. Data: x1,,xnCat(π)x_1,\ldots,x_n \sim \text{Cat}(\boldsymbol\pi) iid. The posterior is:

πx1:nDir(α1+n1,,αK+nK),\boldsymbol\pi | x_{1:n} \sim \text{Dir}(\alpha_1 + n_1, \ldots, \alpha_K + n_K),

where nk=i1[xi=k]n_k = \sum_i \mathbf{1}[x_i = k] counts. Conjugacy means the posterior is in the same family as the prior — just with updated hyperparameters. This closes-form posterior update is why Bayesian inference is tractable for exponential family models: the sufficient statistics of the data just add to the hyperparameters.

Connections

Where Your Intuition Breaks

Bayes' theorem looks simple: P(AB)=P(BA)P(A)/P(B)P(A|B) = P(B|A)P(A)/P(B). The dangerous assumption hidden in the denominator: P(B)>0P(B) > 0. Conditional probability is undefined when the conditioning event has probability zero. This is not a pathological corner case — in continuous distributions, any specific value has probability zero, so P(X=x)=0P(X=x)=0 for every xx. Conditioning on a continuous observation (as in a Kalman filter or a continuous latent variable model) requires a more careful construction: regular conditional distributions, which exist under mild measurability conditions but do not follow from the elementary formula. This is why likelihood functions for continuous observations are densities (evaluated pointwise but not themselves probabilities), and why "conditioning on a zero-probability event" in variational inference needs to be handled via the density rather than the probability mass.

💡Intuition

The multivariate Gaussian is defined by its first two moments. A remarkable property: knowing μ\boldsymbol\mu and Σ\Sigma completely specifies the entire distribution. All higher moments are determined by these two. This is why Gaussian assumptions are so prevalent — they are the maximum entropy distribution subject to known mean and covariance (by the Gaussian maximizes entropy theorem), and they are closed under all affine operations, marginalization, and conditioning. In practice, assuming Gaussian noise or Gaussian priors is not just convenient — it is the least informative (most conservative) assumption given second-order statistics.

💡Intuition

Bayesian updating is sequential: each posterior becomes the next prior. In online learning, the Bayesian update formula P(θx1:n)P(xnθ)P(θx1:n1)P(\theta|x_{1:n}) \propto P(x_n|\theta)P(\theta|x_{1:n-1}) shows that new data updates beliefs incrementally. For conjugate priors, this is a simple hyperparameter increment. For non-conjugate models, variational inference or MCMC approximate the posterior. The Kalman filter is the Gaussian case of this sequential Bayesian updating — making it the optimal linear filter for Gaussian state-space models.

⚠️Warning

Independence does not imply zero correlation, and zero correlation does not imply independence. Two variables can be uncorrelated (Cov(X,Y)=0\text{Cov}(X,Y)=0) yet strongly dependent (e.g., XN(0,1)X \sim \mathcal{N}(0,1), Y=X2Y = X^2: Cov(X,Y)=E[X3]=0\text{Cov}(X,Y) = \mathbb{E}[X^3] = 0 but YY is a deterministic function of XX). The exception: for jointly Gaussian variables, uncorrelated implies independent. Confusing correlation with dependence leads to bugs in feature selection (correlated features are not necessarily redundant) and independence assumptions in generative models.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.