Expectation, Moments, Characteristic Functions & Generating Functions

Expectation is linear integration against a probability measure — it summarizes a distribution by its average behavior. Moments encode the shape of a distribution (mean, variance, skewness, kurtosis), while the moment generating function and characteristic function are dual representations that uniquely identify distributions and make the Central Limit Theorem provable. This lesson develops these tools and their ML applications.

Concepts

Every loss function in ML is an expectation: $\mathcal{L}(\theta) = \mathbb{E}_{(x,y)\sim p_{\text{data}}}[\ell(f_\theta(x), y)]$ . Every gradient is an expectation of a per-sample gradient. The Central Limit Theorem explains why averaging over minibatches gives useful gradient estimates. Variance bounds explain when those estimates are too noisy. The moment generating function is how the CLT is actually proved. This lesson is the mathematical machinery behind everything you do with "average over training examples."

Expectation

For a random variable $X$ on $(\Omega, \mathcal{F}, P)$ , the expectation is:

$\mathbb{E}[X] = \int_\Omega X \, dP = \int_{-\infty}^\infty x \, dF_X(x).$

For discrete $X$ : $\mathbb{E}[X] = \sum_k x_k P(X = x_k)$ . For continuous $X$ with PDF $f_X$ : $\mathbb{E}[X] = \int x f_X(x)\,dx$ .

Law of the unconscious statistician (LOTUS): for a measurable $g$ :

$\mathbb{E}[g(X)] = \int g(x) \, dP_X(x).$

Key properties:

Linearity: $\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]$ (always, even without independence)
Monotonicity: $X \leq Y$ a.s. $\Rightarrow \mathbb{E}[X] \leq \mathbb{E}[Y]$
Independence: if $X \perp Y$ : $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$
Jensen's inequality: for convex $\phi$ : $\phi(\mathbb{E}[X]) \leq \mathbb{E}[\phi(X)]$

Linearity holds always — without independence, without identical distribution, without any structural assumptions — because expectation is defined as integration, and integration is a linear operation. This is why the gradient of a sum of losses equals the sum of the gradients, why SGD is an unbiased estimator even with non-independent mini-batches, and why the bias-variance decomposition works. The linearity of expectation is not a theorem with conditions; it is a definition with consequences.

Conditional expectation $\mathbb{E}[X | \mathcal{G}]$ for a sub- $\sigma$ -algebra $\mathcal{G} \subseteq \mathcal{F}$ is the unique $\mathcal{G}$ -measurable random variable satisfying:

$\int_G \mathbb{E}[X|\mathcal{G}] \, dP = \int_G X \, dP \quad \forall G \in \mathcal{G}.$

This is a projection: $\mathbb{E}[X|\mathcal{G}]$ minimizes $\mathbb{E}[(X - Z)^2]$ over all $\mathcal{G}$ -measurable $Z$ .

Moments and Cumulants

The $k$ -th moment of $X$ : $\mu_k = \mathbb{E}[X^k]$ .

The $k$ -th central moment: $\mathbb{E}[(X-\mu)^k]$ where $\mu = \mathbb{E}[X]$ .

Moment	Formula	Interpretation
Mean $\mu$	$\mathbb{E}[X]$	Location/center
Variance $\sigma^2$	$\mathbb{E}[(X-\mu)^2] = \mathbb{E}[X^2] - \mu^2$	Spread
Skewness	$\mathbb{E}[(X-\mu)^3]/\sigma^3$	Asymmetry
Excess kurtosis	$\mathbb{E}[(X-\mu)^4]/\sigma^4 - 3$	Heavy-tailedness

Variance decomposition: $\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$ . For a sum of independent variables: $\text{Var}(\sum_i X_i) = \sum_i \text{Var}(X_i)$ .

Covariance and correlation:

$\text{Cov}(X,Y) = \mathbb{E}[(X-\mu_X)(Y-\mu_Y)] = \mathbb{E}[XY] - \mu_X\mu_Y,$

$\rho(X,Y) = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}} \in [-1,1].$

Covariance matrix. For a random vector $\mathbf{X} \in \mathbb{R}^n$ :

$\Sigma = \text{Cov}(\mathbf{X}) = \mathbb{E}[(\mathbf{X}-\boldsymbol\mu)(\mathbf{X}-\boldsymbol\mu)^T], \quad \Sigma_{ij} = \text{Cov}(X_i, X_j).$

$\Sigma$ is always PSD. For $Y = A\mathbf{X}$ : $\text{Cov}(Y) = A\Sigma A^T$ .

Cumulants $\kappa_n$ are defined via the cumulant generating function (CGF) $K(t) = \log M_X(t)$ : $\kappa_n = K^{(n)}(0)$ .

$n$	Cumulant $\kappa_n$
1	$\mu$ (mean)
2	$\sigma^2$ (variance)
3	$\mathbb{E}[(X-\mu)^3]$ (3rd central moment = skewness $\times \sigma^3$ )
$\geq 3$	Zero for Gaussian

The Gaussian is uniquely characterized by having all cumulants of order $\geq 3$ equal to zero — this is the deepest characterization of the Gaussian distribution.

Moment Generating Function and Characteristic Function

Moment generating function (MGF):

$M_X(t) = \mathbb{E}[e^{tX}] = \sum_{k=0}^\infty \frac{t^k}{k!}\mathbb{E}[X^k], \quad t \in (-\delta, \delta).$

The MGF exists when the series converges in a neighborhood of $0$ (which fails for heavy-tailed distributions like Pareto). When it exists:

$\mathbb{E}[X^k] = M_X^{(k)}(0), \qquad M_{aX+b}(t) = e^{bt}M_X(at), \qquad M_{X+Y}(t) = M_X(t)M_Y(t) \text{ (indep.)}.$

Characteristic function (CF):

$\varphi_X(t) = \mathbb{E}[e^{itX}] = \int e^{itx}\,dP_X(x), \quad t \in \mathbb{R}.$

The CF always exists (it is the Fourier transform of the distribution) and uniquely determines the distribution.

Key properties:

$|\varphi_X(t)| \leq 1$ , $\varphi_X(0) = 1$
$\varphi_{aX+b}(t) = e^{ibt}\varphi_X(at)$
$\varphi_{X+Y}(t) = \varphi_X(t)\varphi_Y(t)$ for independent $X, Y$
Inversion formula: $f_X(x) = \frac{1}{2\pi}\int_{-\infty}^\infty e^{-itx}\varphi_X(t)\,dt$ (if $f_X$ exists)

Lévy's continuity theorem. $X_n \xrightarrow{d} X$ iff $\varphi_{X_n}(t) \to \varphi_X(t)$ for all $t \in \mathbb{R}$ . This is the key tool for proving the CLT via characteristic functions.

Moment Inequalities

Markov's inequality. For $X \geq 0$ and $t > 0$ :

$P(X \geq t) \leq \frac{\mathbb{E}[X]}{t}.$

Proof. $\mathbb{E}[X] \geq \mathbb{E}[X \cdot \mathbf{1}_{X\geq t}] \geq t\cdot P(X \geq t)$ .

Chebyshev's inequality. For any $X$ with finite variance and $t > 0$ :

$P(|X - \mu| \geq t) \leq \frac{\text{Var}(X)}{t^2}.$

Proof. Apply Markov to $(X-\mu)^2$ and threshold $t^2$ .

Cauchy-Schwarz. $|\mathbb{E}[XY]|^2 \leq \mathbb{E}[X^2]\mathbb{E}[Y^2]$ . This gives $|\rho(X,Y)| \leq 1$ .

Lyapunov's inequality. For $1 \leq r \leq s$ : $\mathbb{E}[|X|^r]^{1/r} \leq \mathbb{E}[|X|^s]^{1/s}$ — $L^p$ norms increase with $p$ .

Worked Example

Example 1: MGF of the Gaussian

For $X \sim \mathcal{N}(\mu, \sigma^2)$ :

$M_X(t) = \mathbb{E}[e^{tX}] = e^{\mu t + \sigma^2 t^2/2}.$

Proof. Complete the square in the exponent:

$\int e^{tx} \frac{e^{-(x-\mu)^2/(2\sigma^2)}}{\sqrt{2\pi}\sigma}\,dx = e^{\mu t + \sigma^2 t^2/2}\int \frac{e^{-(x-(\mu+\sigma^2 t))^2/(2\sigma^2)}}{\sqrt{2\pi}\sigma}\,dx = e^{\mu t + \sigma^2 t^2/2}.$

Consequences. $M_X'(0) = \mu = \mathbb{E}[X]$ . $M_X''(0) = \mu^2 + \sigma^2 = \mathbb{E}[X^2]$ . CGF: $K(t) = \log M_X(t) = \mu t + \sigma^2 t^2/2$ , so cumulants: $\kappa_1 = \mu$ , $\kappa_2 = \sigma^2$ , $\kappa_k = 0$ for $k \geq 3$ . The Gaussian has no higher cumulants.

CF: $\varphi_X(t) = e^{i\mu t - \sigma^2 t^2/2}$ — the Gaussian CF is also a Gaussian (in $t$ ), giving the Gaussian its self-dual Fourier transform property.

Example 2: Characteristic Function Proof of CLT (Sketch)

Let $X_1, \ldots, X_n$ be iid with mean 0, variance 1. Let $S_n = (X_1 + \ldots + X_n)/\sqrt{n}$ . Then:

$\varphi_{S_n}(t) = \left(\varphi_{X_1}(t/\sqrt{n})\right)^n.$

Taylor expand: $\varphi_{X_1}(s) = 1 + i\mathbb{E}[X]s - \mathbb{E}[X^2]s^2/2 + O(s^3) = 1 - t^2/(2n) + O(n^{-3/2})$ .

Hence: $\varphi_{S_n}(t) = \left(1 - \frac{t^2}{2n} + O(n^{-3/2})\right)^n \to e^{-t^2/2}$ as $n\to\infty$ .

Since $e^{-t^2/2}$ is the CF of $\mathcal{N}(0,1)$ , by Lévy's continuity theorem: $S_n \xrightarrow{d} \mathcal{N}(0,1)$ .

Example 3: Conditional Expectation as Projection

In $L^2(\Omega, \mathcal{F}, P)$ with inner product $\langle X, Y\rangle = \mathbb{E}[XY]$ , the subspace of $\mathcal{G}$ -measurable square-integrable random variables is a closed subspace $L^2(\Omega, \mathcal{G}, P)$ .

The conditional expectation $\mathbb{E}[X|\mathcal{G}]$ is the orthogonal projection of $X$ onto this subspace: it satisfies $\langle X - \mathbb{E}[X|\mathcal{G}], Z\rangle = 0$ for all $\mathcal{G}$ -measurable $Z$ . This is exactly the optimality condition that $\mathbb{E}[X|\mathcal{G}]$ minimizes the mean-squared prediction error over all $\mathcal{G}$ -measurable predictors.

In ML: predicting $Y$ from $X$ using a linear function is projecting onto the linear subspace. Using any $\sigma(X)$ -measurable function is projecting onto the full conditional expectation $\mathbb{E}[Y|X]$ — the best possible predictor under squared loss.

Connections

Where Your Intuition Breaks

Variance is finite for most distributions you use in practice — Gaussian, Bernoulli, Poisson. But heavy-tailed distributions (Pareto, Cauchy, power-law) can have infinite variance or even infinite mean. When you compute a sample average of a Cauchy distribution, the result does not converge — each new sample can move the average by an arbitrarily large amount. This matters in ML: gradient estimates for objectives with heavy-tailed data (financial returns, text frequency distributions) can have infinite variance, making SGD arbitrarily noisy. Gradient clipping is a practical fix, but it changes the estimator from an unbiased average to a biased one. The theoretical justification for gradient clipping requires understanding why the Cauchy-style pathology makes unbiased estimation infeasible.

💡Intuition

The characteristic function always exists and uniquely determines the distribution. The MGF can fail to exist (heavy-tailed distributions have $\mathbb{E}[e^{tX}] = \infty$ for any $t > 0$ ), but the CF exists for every distribution because $|e^{itx}| = 1$ . The inversion formula recovers the PDF from the CF — showing the bijection between distributions and their CFs. This makes the CF the correct tool for proving convergence in distribution (via Lévy's continuity theorem), while the MGF is more convenient when it exists because it avoids complex numbers.

💡Intuition

Cumulants add under convolution. For independent $X, Y$ : $\kappa_n(X+Y) = \kappa_n(X) + \kappa_n(Y)$ . This is why the variance of a sum of independent variables is the sum of variances — variance is the second cumulant. It also implies: if $X_1,\ldots,X_n$ are iid with cumulants $\kappa_k$ , then $S_n = \sum X_i$ has cumulants $n\kappa_k$ . Normalized $S_n/\sqrt{n}$ has cumulants $n\kappa_k/n^{k/2} = \kappa_k/n^{k/2-1}$ — which go to zero for $k \geq 3$ as $n\to\infty$ , leaving only $\kappa_1$ and $\kappa_2$ (Gaussian). The CLT is the statement that only the first two cumulants survive normalization.

⚠️Warning

Chebyshev's bound is often very loose. Chebyshev gives $P(|X-\mu| \geq k\sigma) \leq 1/k^2$ — for $k=3$ this is $1/9 \approx 11\%$ , while the true Gaussian tail probability is $0.27\%$ . The bound is tight for the distribution that places mass $1/(2k^2)$ at $\pm k\sigma$ and $1-1/k^2$ at 0, but typical ML random variables are much better behaved. Concentration inequalities (Hoeffding, Bernstein, sub-Gaussian — covered in the bridge lesson) give exponentially tighter bounds for bounded or light-tailed random variables.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Probability Spaces, Random Variables & Distributions

Modes of Convergence: Almost Sure, In Probability, Lp & In Distribution