Expectation is linear integration against a probability measure — it summarizes a distribution by its average behavior. Moments encode the shape of a distribution (mean, variance, skewness, kurtosis), while the moment generating function and characteristic function are dual representations that uniquely identify distributions and make the Central Limit Theorem provable. This lesson develops these tools and their ML applications.
Concepts
Every loss function in ML is an expectation: L(θ)=E(x,y)∼pdata[ℓ(fθ(x),y)]. Every gradient is an expectation of a per-sample gradient. The Central Limit Theorem explains why averaging over minibatches gives useful gradient estimates. Variance bounds explain when those estimates are too noisy. The moment generating function is how the CLT is actually proved. This lesson is the mathematical machinery behind everything you do with "average over training examples."
Expectation
For a random variable X on (Ω,F,P), the expectation is:
E[X]=∫ΩXdP=∫−∞∞xdFX(x).
For discrete X: E[X]=∑kxkP(X=xk). For continuous X with PDF fX: E[X]=∫xfX(x)dx.
Law of the unconscious statistician (LOTUS): for a measurable g:
E[g(X)]=∫g(x)dPX(x).
Key properties:
Linearity:E[aX+bY]=aE[X]+bE[Y] (always, even without independence)
Monotonicity:X≤Y a.s. ⇒E[X]≤E[Y]
Independence: if X⊥Y: E[XY]=E[X]E[Y]
Jensen's inequality: for convex ϕ: ϕ(E[X])≤E[ϕ(X)]
Linearity holds always — without independence, without identical distribution, without any structural assumptions — because expectation is defined as integration, and integration is a linear operation. This is why the gradient of a sum of losses equals the sum of the gradients, why SGD is an unbiased estimator even with non-independent mini-batches, and why the bias-variance decomposition works. The linearity of expectation is not a theorem with conditions; it is a definition with consequences.
Conditional expectation E[X∣G] for a sub-σ-algebra G⊆F is the unique G-measurable random variable satisfying:
∫GE[X∣G]dP=∫GXdP∀G∈G.
This is a projection: E[X∣G] minimizes E[(X−Z)2] over all G-measurable Z.
Moments and Cumulants
The k-th moment of X: μk=E[Xk].
The k-th central moment: E[(X−μ)k] where μ=E[X].
Moment
Formula
Interpretation
Mean μ
E[X]
Location/center
Variance σ2
E[(X−μ)2]=E[X2]−μ2
Spread
Skewness
E[(X−μ)3]/σ3
Asymmetry
Excess kurtosis
E[(X−μ)4]/σ4−3
Heavy-tailedness
Variance decomposition:Var(X)=E[X2]−(E[X])2. For a sum of independent variables: Var(∑iXi)=∑iVar(Xi).
Covariance and correlation:
Cov(X,Y)=E[(X−μX)(Y−μY)]=E[XY]−μXμY,
ρ(X,Y)=Var(X)Var(Y)Cov(X,Y)∈[−1,1].
Covariance matrix. For a random vector X∈Rn:
Σ=Cov(X)=E[(X−μ)(X−μ)T],Σij=Cov(Xi,Xj).
Σ is always PSD. For Y=AX: Cov(Y)=AΣAT.
Cumulantsκn are defined via the cumulant generating function (CGF) K(t)=logMX(t): κn=K(n)(0).
n
Cumulant κn
1
μ (mean)
2
σ2 (variance)
3
E[(X−μ)3] (3rd central moment = skewness ×σ3)
≥3
Zero for Gaussian
The Gaussian is uniquely characterized by having all cumulants of order ≥3 equal to zero — this is the deepest characterization of the Gaussian distribution.
Moment Generating Function and Characteristic Function
Moment generating function (MGF):
MX(t)=E[etX]=∑k=0∞k!tkE[Xk],t∈(−δ,δ).
The MGF exists when the series converges in a neighborhood of 0 (which fails for heavy-tailed distributions like Pareto). When it exists:
Consequences.MX′(0)=μ=E[X]. MX′′(0)=μ2+σ2=E[X2]. CGF: K(t)=logMX(t)=μt+σ2t2/2, so cumulants: κ1=μ, κ2=σ2, κk=0 for k≥3. The Gaussian has no higher cumulants.
CF:φX(t)=eiμt−σ2t2/2 — the Gaussian CF is also a Gaussian (in t), giving the Gaussian its self-dual Fourier transform property.
Example 2: Characteristic Function Proof of CLT (Sketch)
Let X1,…,Xn be iid with mean 0, variance 1. Let Sn=(X1+…+Xn)/n. Then:
φSn(t)=(φX1(t/n))n.
Taylor expand: φX1(s)=1+iE[X]s−E[X2]s2/2+O(s3)=1−t2/(2n)+O(n−3/2).
Hence: φSn(t)=(1−2nt2+O(n−3/2))n→e−t2/2 as n→∞.
Since e−t2/2 is the CF of N(0,1), by Lévy's continuity theorem: SndN(0,1).
Example 3: Conditional Expectation as Projection
In L2(Ω,F,P) with inner product ⟨X,Y⟩=E[XY], the subspace of G-measurable square-integrable random variables is a closed subspace L2(Ω,G,P).
The conditional expectation E[X∣G] is the orthogonal projection of X onto this subspace: it satisfies ⟨X−E[X∣G],Z⟩=0 for all G-measurable Z. This is exactly the optimality condition that E[X∣G] minimizes the mean-squared prediction error over all G-measurable predictors.
In ML: predicting Y from X using a linear function is projecting onto the linear subspace. Using any σ(X)-measurable function is projecting onto the full conditional expectation E[Y∣X] — the best possible predictor under squared loss.
Connections
Where Your Intuition Breaks
Variance is finite for most distributions you use in practice — Gaussian, Bernoulli, Poisson. But heavy-tailed distributions (Pareto, Cauchy, power-law) can have infinite variance or even infinite mean. When you compute a sample average of a Cauchy distribution, the result does not converge — each new sample can move the average by an arbitrarily large amount. This matters in ML: gradient estimates for objectives with heavy-tailed data (financial returns, text frequency distributions) can have infinite variance, making SGD arbitrarily noisy. Gradient clipping is a practical fix, but it changes the estimator from an unbiased average to a biased one. The theoretical justification for gradient clipping requires understanding why the Cauchy-style pathology makes unbiased estimation infeasible.
💡Intuition
The characteristic function always exists and uniquely determines the distribution. The MGF can fail to exist (heavy-tailed distributions have E[etX]=∞ for any t>0), but the CF exists for every distribution because ∣eitx∣=1. The inversion formula recovers the PDF from the CF — showing the bijection between distributions and their CFs. This makes the CF the correct tool for proving convergence in distribution (via Lévy's continuity theorem), while the MGF is more convenient when it exists because it avoids complex numbers.
💡Intuition
Cumulants add under convolution. For independent X,Y: κn(X+Y)=κn(X)+κn(Y). This is why the variance of a sum of independent variables is the sum of variances — variance is the second cumulant. It also implies: if X1,…,Xn are iid with cumulants κk, then Sn=∑Xi has cumulants nκk. Normalized Sn/n has cumulants nκk/nk/2=κk/nk/2−1 — which go to zero for k≥3 as n→∞, leaving only κ1 and κ2 (Gaussian). The CLT is the statement that only the first two cumulants survive normalization.
⚠️Warning
Chebyshev's bound is often very loose. Chebyshev gives P(∣X−μ∣≥kσ)≤1/k2 — for k=3 this is 1/9≈11%, while the true Gaussian tail probability is 0.27%. The bound is tight for the distribution that places mass 1/(2k2) at ±kσ and 1−1/k2 at 0, but typical ML random variables are much better behaved. Concentration inequalities (Hoeffding, Bernstein, sub-Gaussian — covered in the bridge lesson) give exponentially tighter bounds for bounded or light-tailed random variables.