Neural-Path/Notes
40 min

Expectation, Moments, Characteristic Functions & Generating Functions

Expectation is linear integration against a probability measure — it summarizes a distribution by its average behavior. Moments encode the shape of a distribution (mean, variance, skewness, kurtosis), while the moment generating function and characteristic function are dual representations that uniquely identify distributions and make the Central Limit Theorem provable. This lesson develops these tools and their ML applications.

Concepts

Every loss function in ML is an expectation: L(θ)=E(x,y)pdata[(fθ(x),y)]\mathcal{L}(\theta) = \mathbb{E}_{(x,y)\sim p_{\text{data}}}[\ell(f_\theta(x), y)]. Every gradient is an expectation of a per-sample gradient. The Central Limit Theorem explains why averaging over minibatches gives useful gradient estimates. Variance bounds explain when those estimates are too noisy. The moment generating function is how the CLT is actually proved. This lesson is the mathematical machinery behind everything you do with "average over training examples."

Expectation

For a random variable XX on (Ω,F,P)(\Omega, \mathcal{F}, P), the expectation is:

E[X]=ΩXdP=xdFX(x).\mathbb{E}[X] = \int_\Omega X \, dP = \int_{-\infty}^\infty x \, dF_X(x).

For discrete XX: E[X]=kxkP(X=xk)\mathbb{E}[X] = \sum_k x_k P(X = x_k). For continuous XX with PDF fXf_X: E[X]=xfX(x)dx\mathbb{E}[X] = \int x f_X(x)\,dx.

Law of the unconscious statistician (LOTUS): for a measurable gg:

E[g(X)]=g(x)dPX(x).\mathbb{E}[g(X)] = \int g(x) \, dP_X(x).

Key properties:

  • Linearity: E[aX+bY]=aE[X]+bE[Y]\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y] (always, even without independence)
  • Monotonicity: XYX \leq Y a.s. E[X]E[Y]\Rightarrow \mathbb{E}[X] \leq \mathbb{E}[Y]
  • Independence: if XYX \perp Y: E[XY]=E[X]E[Y]\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]
  • Jensen's inequality: for convex ϕ\phi: ϕ(E[X])E[ϕ(X)]\phi(\mathbb{E}[X]) \leq \mathbb{E}[\phi(X)]

Linearity holds always — without independence, without identical distribution, without any structural assumptions — because expectation is defined as integration, and integration is a linear operation. This is why the gradient of a sum of losses equals the sum of the gradients, why SGD is an unbiased estimator even with non-independent mini-batches, and why the bias-variance decomposition works. The linearity of expectation is not a theorem with conditions; it is a definition with consequences.

Conditional expectation E[XG]\mathbb{E}[X | \mathcal{G}] for a sub-σ\sigma-algebra GF\mathcal{G} \subseteq \mathcal{F} is the unique G\mathcal{G}-measurable random variable satisfying:

GE[XG]dP=GXdPGG.\int_G \mathbb{E}[X|\mathcal{G}] \, dP = \int_G X \, dP \quad \forall G \in \mathcal{G}.

This is a projection: E[XG]\mathbb{E}[X|\mathcal{G}] minimizes E[(XZ)2]\mathbb{E}[(X - Z)^2] over all G\mathcal{G}-measurable ZZ.

Moments and Cumulants

The kk-th moment of XX: μk=E[Xk]\mu_k = \mathbb{E}[X^k].

The kk-th central moment: E[(Xμ)k]\mathbb{E}[(X-\mu)^k] where μ=E[X]\mu = \mathbb{E}[X].

MomentFormulaInterpretation
Mean μ\muE[X]\mathbb{E}[X]Location/center
Variance σ2\sigma^2E[(Xμ)2]=E[X2]μ2\mathbb{E}[(X-\mu)^2] = \mathbb{E}[X^2] - \mu^2Spread
SkewnessE[(Xμ)3]/σ3\mathbb{E}[(X-\mu)^3]/\sigma^3Asymmetry
Excess kurtosisE[(Xμ)4]/σ43\mathbb{E}[(X-\mu)^4]/\sigma^4 - 3Heavy-tailedness

Variance decomposition: Var(X)=E[X2](E[X])2\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2. For a sum of independent variables: Var(iXi)=iVar(Xi)\text{Var}(\sum_i X_i) = \sum_i \text{Var}(X_i).

Covariance and correlation:

Cov(X,Y)=E[(XμX)(YμY)]=E[XY]μXμY,\text{Cov}(X,Y) = \mathbb{E}[(X-\mu_X)(Y-\mu_Y)] = \mathbb{E}[XY] - \mu_X\mu_Y,

ρ(X,Y)=Cov(X,Y)Var(X)Var(Y)[1,1].\rho(X,Y) = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}} \in [-1,1].

Covariance matrix. For a random vector XRn\mathbf{X} \in \mathbb{R}^n:

Σ=Cov(X)=E[(Xμ)(Xμ)T],Σij=Cov(Xi,Xj).\Sigma = \text{Cov}(\mathbf{X}) = \mathbb{E}[(\mathbf{X}-\boldsymbol\mu)(\mathbf{X}-\boldsymbol\mu)^T], \quad \Sigma_{ij} = \text{Cov}(X_i, X_j).

Σ\Sigma is always PSD. For Y=AXY = A\mathbf{X}: Cov(Y)=AΣAT\text{Cov}(Y) = A\Sigma A^T.

Cumulants κn\kappa_n are defined via the cumulant generating function (CGF) K(t)=logMX(t)K(t) = \log M_X(t): κn=K(n)(0)\kappa_n = K^{(n)}(0).

nnCumulant κn\kappa_n
1μ\mu (mean)
2σ2\sigma^2 (variance)
3E[(Xμ)3]\mathbb{E}[(X-\mu)^3] (3rd central moment = skewness ×σ3\times \sigma^3)
3\geq 3Zero for Gaussian

The Gaussian is uniquely characterized by having all cumulants of order 3\geq 3 equal to zero — this is the deepest characterization of the Gaussian distribution.

Moment Generating Function and Characteristic Function

Moment generating function (MGF):

MX(t)=E[etX]=k=0tkk!E[Xk],t(δ,δ).M_X(t) = \mathbb{E}[e^{tX}] = \sum_{k=0}^\infty \frac{t^k}{k!}\mathbb{E}[X^k], \quad t \in (-\delta, \delta).

The MGF exists when the series converges in a neighborhood of 00 (which fails for heavy-tailed distributions like Pareto). When it exists:

E[Xk]=MX(k)(0),MaX+b(t)=ebtMX(at),MX+Y(t)=MX(t)MY(t) (indep.).\mathbb{E}[X^k] = M_X^{(k)}(0), \qquad M_{aX+b}(t) = e^{bt}M_X(at), \qquad M_{X+Y}(t) = M_X(t)M_Y(t) \text{ (indep.)}.

Characteristic function (CF):

φX(t)=E[eitX]=eitxdPX(x),tR.\varphi_X(t) = \mathbb{E}[e^{itX}] = \int e^{itx}\,dP_X(x), \quad t \in \mathbb{R}.

The CF always exists (it is the Fourier transform of the distribution) and uniquely determines the distribution.

Key properties:

  • φX(t)1|\varphi_X(t)| \leq 1, φX(0)=1\varphi_X(0) = 1
  • φaX+b(t)=eibtφX(at)\varphi_{aX+b}(t) = e^{ibt}\varphi_X(at)
  • φX+Y(t)=φX(t)φY(t)\varphi_{X+Y}(t) = \varphi_X(t)\varphi_Y(t) for independent X,YX, Y
  • Inversion formula: fX(x)=12πeitxφX(t)dtf_X(x) = \frac{1}{2\pi}\int_{-\infty}^\infty e^{-itx}\varphi_X(t)\,dt (if fXf_X exists)

Lévy's continuity theorem. XndXX_n \xrightarrow{d} X iff φXn(t)φX(t)\varphi_{X_n}(t) \to \varphi_X(t) for all tRt \in \mathbb{R}. This is the key tool for proving the CLT via characteristic functions.

Moment Inequalities

Markov's inequality. For X0X \geq 0 and t>0t > 0:

P(Xt)E[X]t.P(X \geq t) \leq \frac{\mathbb{E}[X]}{t}.

Proof. E[X]E[X1Xt]tP(Xt)\mathbb{E}[X] \geq \mathbb{E}[X \cdot \mathbf{1}_{X\geq t}] \geq t\cdot P(X \geq t).

Chebyshev's inequality. For any XX with finite variance and t>0t > 0:

P(Xμt)Var(X)t2.P(|X - \mu| \geq t) \leq \frac{\text{Var}(X)}{t^2}.

Proof. Apply Markov to (Xμ)2(X-\mu)^2 and threshold t2t^2.

Cauchy-Schwarz. E[XY]2E[X2]E[Y2]|\mathbb{E}[XY]|^2 \leq \mathbb{E}[X^2]\mathbb{E}[Y^2]. This gives ρ(X,Y)1|\rho(X,Y)| \leq 1.

Lyapunov's inequality. For 1rs1 \leq r \leq s: E[Xr]1/rE[Xs]1/s\mathbb{E}[|X|^r]^{1/r} \leq \mathbb{E}[|X|^s]^{1/s}LpL^p norms increase with pp.

Worked Example

Example 1: MGF of the Gaussian

For XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2):

MX(t)=E[etX]=eμt+σ2t2/2.M_X(t) = \mathbb{E}[e^{tX}] = e^{\mu t + \sigma^2 t^2/2}.

Proof. Complete the square in the exponent:

etxe(xμ)2/(2σ2)2πσdx=eμt+σ2t2/2e(x(μ+σ2t))2/(2σ2)2πσdx=eμt+σ2t2/2.\int e^{tx} \frac{e^{-(x-\mu)^2/(2\sigma^2)}}{\sqrt{2\pi}\sigma}\,dx = e^{\mu t + \sigma^2 t^2/2}\int \frac{e^{-(x-(\mu+\sigma^2 t))^2/(2\sigma^2)}}{\sqrt{2\pi}\sigma}\,dx = e^{\mu t + \sigma^2 t^2/2}.

Consequences. MX(0)=μ=E[X]M_X'(0) = \mu = \mathbb{E}[X]. MX(0)=μ2+σ2=E[X2]M_X''(0) = \mu^2 + \sigma^2 = \mathbb{E}[X^2]. CGF: K(t)=logMX(t)=μt+σ2t2/2K(t) = \log M_X(t) = \mu t + \sigma^2 t^2/2, so cumulants: κ1=μ\kappa_1 = \mu, κ2=σ2\kappa_2 = \sigma^2, κk=0\kappa_k = 0 for k3k \geq 3. The Gaussian has no higher cumulants.

CF: φX(t)=eiμtσ2t2/2\varphi_X(t) = e^{i\mu t - \sigma^2 t^2/2} — the Gaussian CF is also a Gaussian (in tt), giving the Gaussian its self-dual Fourier transform property.

Example 2: Characteristic Function Proof of CLT (Sketch)

Let X1,,XnX_1, \ldots, X_n be iid with mean 0, variance 1. Let Sn=(X1++Xn)/nS_n = (X_1 + \ldots + X_n)/\sqrt{n}. Then:

φSn(t)=(φX1(t/n))n.\varphi_{S_n}(t) = \left(\varphi_{X_1}(t/\sqrt{n})\right)^n.

Taylor expand: φX1(s)=1+iE[X]sE[X2]s2/2+O(s3)=1t2/(2n)+O(n3/2)\varphi_{X_1}(s) = 1 + i\mathbb{E}[X]s - \mathbb{E}[X^2]s^2/2 + O(s^3) = 1 - t^2/(2n) + O(n^{-3/2}).

Hence: φSn(t)=(1t22n+O(n3/2))net2/2\varphi_{S_n}(t) = \left(1 - \frac{t^2}{2n} + O(n^{-3/2})\right)^n \to e^{-t^2/2} as nn\to\infty.

Since et2/2e^{-t^2/2} is the CF of N(0,1)\mathcal{N}(0,1), by Lévy's continuity theorem: SndN(0,1)S_n \xrightarrow{d} \mathcal{N}(0,1).

Example 3: Conditional Expectation as Projection

In L2(Ω,F,P)L^2(\Omega, \mathcal{F}, P) with inner product X,Y=E[XY]\langle X, Y\rangle = \mathbb{E}[XY], the subspace of G\mathcal{G}-measurable square-integrable random variables is a closed subspace L2(Ω,G,P)L^2(\Omega, \mathcal{G}, P).

The conditional expectation E[XG]\mathbb{E}[X|\mathcal{G}] is the orthogonal projection of XX onto this subspace: it satisfies XE[XG],Z=0\langle X - \mathbb{E}[X|\mathcal{G}], Z\rangle = 0 for all G\mathcal{G}-measurable ZZ. This is exactly the optimality condition that E[XG]\mathbb{E}[X|\mathcal{G}] minimizes the mean-squared prediction error over all G\mathcal{G}-measurable predictors.

In ML: predicting YY from XX using a linear function is projecting onto the linear subspace. Using any σ(X)\sigma(X)-measurable function is projecting onto the full conditional expectation E[YX]\mathbb{E}[Y|X] — the best possible predictor under squared loss.

Connections

Where Your Intuition Breaks

Variance is finite for most distributions you use in practice — Gaussian, Bernoulli, Poisson. But heavy-tailed distributions (Pareto, Cauchy, power-law) can have infinite variance or even infinite mean. When you compute a sample average of a Cauchy distribution, the result does not converge — each new sample can move the average by an arbitrarily large amount. This matters in ML: gradient estimates for objectives with heavy-tailed data (financial returns, text frequency distributions) can have infinite variance, making SGD arbitrarily noisy. Gradient clipping is a practical fix, but it changes the estimator from an unbiased average to a biased one. The theoretical justification for gradient clipping requires understanding why the Cauchy-style pathology makes unbiased estimation infeasible.

💡Intuition

The characteristic function always exists and uniquely determines the distribution. The MGF can fail to exist (heavy-tailed distributions have E[etX]=\mathbb{E}[e^{tX}] = \infty for any t>0t > 0), but the CF exists for every distribution because eitx=1|e^{itx}| = 1. The inversion formula recovers the PDF from the CF — showing the bijection between distributions and their CFs. This makes the CF the correct tool for proving convergence in distribution (via Lévy's continuity theorem), while the MGF is more convenient when it exists because it avoids complex numbers.

💡Intuition

Cumulants add under convolution. For independent X,YX, Y: κn(X+Y)=κn(X)+κn(Y)\kappa_n(X+Y) = \kappa_n(X) + \kappa_n(Y). This is why the variance of a sum of independent variables is the sum of variances — variance is the second cumulant. It also implies: if X1,,XnX_1,\ldots,X_n are iid with cumulants κk\kappa_k, then Sn=XiS_n = \sum X_i has cumulants nκkn\kappa_k. Normalized Sn/nS_n/\sqrt{n} has cumulants nκk/nk/2=κk/nk/21n\kappa_k/n^{k/2} = \kappa_k/n^{k/2-1} — which go to zero for k3k \geq 3 as nn\to\infty, leaving only κ1\kappa_1 and κ2\kappa_2 (Gaussian). The CLT is the statement that only the first two cumulants survive normalization.

⚠️Warning

Chebyshev's bound is often very loose. Chebyshev gives P(Xμkσ)1/k2P(|X-\mu| \geq k\sigma) \leq 1/k^2 — for k=3k=3 this is 1/911%1/9 \approx 11\%, while the true Gaussian tail probability is 0.27%0.27\%. The bound is tight for the distribution that places mass 1/(2k2)1/(2k^2) at ±kσ\pm k\sigma and 11/k21-1/k^2 at 0, but typical ML random variables are much better behaved. Concentration inequalities (Hoeffding, Bernstein, sub-Gaussian — covered in the bridge lesson) give exponentially tighter bounds for bounded or light-tailed random variables.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.