Probability Spaces, Random Variables & Distributions
A probability space is the triple that gives every probabilistic concept a rigorous home. Random variables are measurable functions from the sample space to ; distributions are the push-forward measures they induce. This lesson establishes the standard distributions appearing throughout ML and the key structural properties — independence, conditioning, Bayes' theorem — that govern reasoning under uncertainty.
Concepts
When you evaluate a neural network on a test example and get a probability distribution over classes, you're using the language of this lesson: the model is producing a number between 0 and 1 for each class, and those numbers should sum to 1 and satisfy all the rules of a probability measure. The triple is the formal foundation that ensures those rules are consistent — that conditional probabilities, marginals, and expectations all behave correctly. Without this structure, Bayes' theorem and maximum likelihood estimation wouldn't have clean definitions.
Probability Spaces
A probability space is a triple where:
- is the sample space (set of all possible outcomes)
- is a -algebra on (collection of observable events)
- is a probability measure: , -additive
The -algebra is the structure that says which subsets of are "observable" events. Without it, you could assign probability to sets that lead to contradictions (Banach-Tarski paradox: a solid sphere can be decomposed into pieces and reassembled into two spheres of the same size — a measure-theoretic impossibility if you allow non-measurable sets). The -algebra requirement — closed under countable unions and complements — is exactly what prevents these pathologies and guarantees that countable additivity is consistent.
Examples of sample spaces:
- Coin flip: , ,
- Continuous outcome: , , given by CDF
- Infinite sequences: (for modeling iid binary sequences) with product -algebra
- Path space: (continuous functions) for Brownian motion
Random Variables and Their Distributions
A random variable is a measurable function .
The distribution (or law) of is the push-forward measure :
Cumulative distribution function (CDF): .
Properties: nondecreasing, right-continuous, , .
Probability mass function (PMF): for discrete taking values : .
Probability density function (PDF): for absolutely continuous : a.e., with .
Standard Distributions
Discrete distributions:
| Distribution | PMF | Mean | Variance | ML role |
|---|---|---|---|---|
| Bernoulli() | , | Binary labels | ||
| Binomial() | Count of successes | |||
| Poisson() | Event counts | |||
| Geometric() | First success time | |||
| Categorical() | for | — | — | Multiclass labels |
Continuous distributions:
| Distribution | Mean | Variance | ML role | |
|---|---|---|---|---|
| Gaussian | Priors, noise | |||
| Exponential() | , | Waiting times | ||
| Gamma() | Conjugate to Poisson | |||
| Beta() | See below | Conjugate to Bernoulli | ||
| Student-() | 0 () | Heavy tails | ||
| Uniform() | Non-informative prior |
Multivariate Gaussian. The most important distribution in ML:
Properties of multivariate Gaussian:
- Affine closure: if , then .
- Marginals: — marginals of Gaussians are Gaussian.
- Conditionals: where — conditionals of Gaussians are Gaussian.
- Product of Gaussians: — the unnormalized product is Gaussian (key for Bayesian updates).
- Entropy: — Gaussian maximizes entropy for given covariance.
Dirichlet distribution. Generalizes Beta to the simplex: where .
Conjugate prior for the Categorical distribution. . Used as prior over topic proportions in LDA.
Independence and Conditional Probability
Independence of events: iff .
Independence of random variables: iff for all Borel — equivalently, the joint distribution factors: .
Conditional probability: for .
Conditional distribution: — for continuous , defined via regular conditional probability, a measure-theoretic subtlety. The conditional PDF .
Bayes' theorem:
Total probability: for a partition of .
Law of total expectation: .
Law of total variance: .
Worked Example
Example 1: Gaussian Conditioning (Bayesian Update)
Let prior and likelihood (single observation). The posterior:
Both are Gaussian in ; their product is Gaussian:
The posterior mean is a precision-weighted average of prior and observation. With iid observations: and . As : (data dominates prior) and (posterior concentrates). This is Bayesian inference for a Gaussian model.
Example 2: Multivariate Gaussian — Marginal and Conditional
Let with block structure:
Marginal: .
Conditional: where:
The term is the regression coefficient — the optimal linear predictor of from . Gaussian process regression is exactly this formula applied to function values at unobserved locations.
Example 3: Conjugacy and the Dirichlet-Categorical Model
Prior: over -class probabilities. Data: iid. The posterior is:
where counts. Conjugacy means the posterior is in the same family as the prior — just with updated hyperparameters. This closes-form posterior update is why Bayesian inference is tractable for exponential family models: the sufficient statistics of the data just add to the hyperparameters.
Connections
Where Your Intuition Breaks
Bayes' theorem looks simple: . The dangerous assumption hidden in the denominator: . Conditional probability is undefined when the conditioning event has probability zero. This is not a pathological corner case — in continuous distributions, any specific value has probability zero, so for every . Conditioning on a continuous observation (as in a Kalman filter or a continuous latent variable model) requires a more careful construction: regular conditional distributions, which exist under mild measurability conditions but do not follow from the elementary formula. This is why likelihood functions for continuous observations are densities (evaluated pointwise but not themselves probabilities), and why "conditioning on a zero-probability event" in variational inference needs to be handled via the density rather than the probability mass.
The multivariate Gaussian is defined by its first two moments. A remarkable property: knowing and completely specifies the entire distribution. All higher moments are determined by these two. This is why Gaussian assumptions are so prevalent — they are the maximum entropy distribution subject to known mean and covariance (by the Gaussian maximizes entropy theorem), and they are closed under all affine operations, marginalization, and conditioning. In practice, assuming Gaussian noise or Gaussian priors is not just convenient — it is the least informative (most conservative) assumption given second-order statistics.
Bayesian updating is sequential: each posterior becomes the next prior. In online learning, the Bayesian update formula shows that new data updates beliefs incrementally. For conjugate priors, this is a simple hyperparameter increment. For non-conjugate models, variational inference or MCMC approximate the posterior. The Kalman filter is the Gaussian case of this sequential Bayesian updating — making it the optimal linear filter for Gaussian state-space models.
Independence does not imply zero correlation, and zero correlation does not imply independence. Two variables can be uncorrelated () yet strongly dependent (e.g., , : but is a deterministic function of ). The exception: for jointly Gaussian variables, uncorrelated implies independent. Confusing correlation with dependence leads to bugs in feature selection (correlated features are not necessarily redundant) and independence assumptions in generative models.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.