Matrix calculus is the language in which backpropagation, gradient-based optimization, and the derivation of learning rules are written. Extending scalar differentiation to vector and matrix arguments requires careful attention to layout conventions, but the payoff is enormous: every gradient descent update, every second-order Newton step, and every derivation of a loss function's gradient reduces to a handful of matrix calculus identities. This lesson develops the full toolkit — gradients, Jacobians, Hessians, and the chain rule in matrix form — and applies it to derive OLS, PCA, and backpropagation as special cases.
Concepts
Gradient Field — f(x) = xᵀAx, ∇f = 2Ax (drag the point)
x = (1.200, 0.800)
f(x) = xᵀAx = 2.080
∇f = 2Ax =
(2.400, 1.600)
‖∇f‖ = 2.884
A = [1, 0; 0, 1]
H(f) = 2A
f = x₁²+x₂². Gradient = 2x — points straight out from origin. All directions equal curvature.
The yellow arrow is ∇f(x) = 2Ax, always perpendicular to the level curve through x. Gradient descent follows −∇f.
Every time you run backpropagation, the chain rule is computing a matrix product of Jacobians. When you write loss.backward() in PyTorch, the framework is mechanically applying a handful of matrix calculus identities — the gradient of a quadratic, the chain rule, the Jacobian of a matrix-vector product. Learning these identities by hand once makes all of optimization, backpropagation, and second-order methods transparent rather than magical.
Layout Conventions
Two conventions exist for arranging partial derivatives, and confusing them causes sign errors, transposed answers, and impossible-to-debug gradients.
Numerator layout (Jacobian layout): For f:Rn→Rm, the Jacobian ∂x∂f∈Rm×n with (∂x∂f)ij=∂xj∂fi.
Denominator layout: Jacobian is ∂x∂f∈Rn×m — the transpose of numerator layout.
This course uses denominator layout (most common in ML textbooks, consistent with the convention that ∇xf is a column vector of the same shape as x). Key consequence: the gradient of f:Rn→R is a column vector ∇xf∈Rn.
Gradient of a Scalar Function
For f:Rn→R, the gradient is:
∇xf=∂x∂f=∂f/∂x1⋮∂f/∂xn∈Rn.
The gradient must have the same shape as x because it specifies a step direction in the same space: updating x←x−η∇xf requires both sides to be vectors of the same dimension. The denominator layout convention (column gradient) is the one consistent with this update rule — switching to numerator layout produces a row vector, requiring explicit transposes everywhere gradient descent appears.
Fundamental identities (denominator layout):
Function f(x)
Gradient ∇xf
Hessian ∇x2f
aTx
a
0
xTAx
(A+AT)x
A+AT
xTAx (A symmetric)
2Ax
2A
∥x∥2=xTx
2x
2I
∥x−a∥2
2(x−a)
2I
xTAb
Ab
0
∥x∥1
sign(x)
undefined at 0
Derivation: ∇x(xTAx). Write f=∑i,jaijxixj. The k-th component:
∂xk∂f=∑jakjxj+∑iaikxi=(Ax)k+(ATx)k.
Hence ∇x(xTAx)=(A+AT)x=2Ax when A is symmetric.
Jacobian of a Vector Function
For f:Rn→Rm, the Jacobian (denominator layout) is:
J=∂x∂f∈Rn×m,Jij=∂xi∂fj.
The Jacobian is the linear map that best approximates f at a point: f(x+Δx)≈f(x)+JTΔx (note the transpose — a consequence of denominator layout).
Key examples:
f(x)
Jacobian ∂x∂f
Ax (linear)
AT
xTA
A
σ(x) (elementwise sigmoid)
diag(σ(x)⊙(1−σ(x)))
softmax(x)
diag(p)−ppT where p=softmax(x)
The Chain Rule
For composed functions f(x)=g(h(x)) where x∈Rn, h:Rn→Rk, g:Rk→R:
∇xf=Jh⋅∇ugu=h(x),
where Jh=∂x∂h∈Rn×k is the Jacobian of h (denominator layout).
For the full vector-to-vector composition f=g∘h:
∂x∂f=∂x∂h⋅∂h∂f.
This is precisely the rule that backpropagation implements: accumulate Jacobians layer by layer from output to input.
Gradients of Matrix Functions
For functions of matrices A∈Rm×n, define ∂A∂f∈Rm×n with (∂A∂f)ij=∂Aij∂f.
Key identities:
Function f(A)
Gradient ∂A∂f
tr(ATB)
B
tr(AB)
BT
tr(ABAT)
A(B+BT)
tr(ATA)
2A
logdet(A) (A sym PD)
A−1
det(A)
det(A)⋅A−T
∥A∥F2
2A
Derivation: ∂A∂logdet(A). Use the Jacobi identity: d(detA)=det(A)tr(A−1dA). Then:
Newton's method uses this: minimize the quadratic approximation to get the update Δx∗=−H−1∇f. This requires solving an n×n linear system (or approximating the Hessian) at every step.
Automatic Differentiation: Forward and Reverse Mode
Modern deep learning frameworks (PyTorch, JAX, TensorFlow) compute gradients via automatic differentiation (AD), not symbolic calculus or finite differences.
Forward mode AD. Propagate derivative information forward through the computation graph, accumulating the Jacobian one input at a time. Cost: O(n) forward passes for a function Rn→Rm — efficient when n≪m.
Reverse mode AD (backpropagation). Propagate gradient information backward through the graph, accumulating the gradient one output at a time. Cost: O(m) backward passes — efficient when m≪n. For neural network training, m=1 (scalar loss) and n= number of parameters (millions to billions), so reverse mode is overwhelmingly preferred.
Computational complexity. For f:Rn→R: reverse mode computes the full gradient ∇f∈Rn in O(cost of computing f) — essentially for free relative to the forward pass. This is why gradient descent scales to billions of parameters.
Jacobian-vector products (JVPs) and vector-Jacobian products (VJPs). Forward mode computes JVPs (Jv for a vector v), reverse mode computes VJPs (uTJ for a covector u). These are the primitives that AD engines expose and compose.
Worked Example
Example 1: Deriving the OLS Gradient
Loss L(w)=∥Aw−b∥2=wTATAw−2bTAw+bTb.
Using the table identities:
∇w(wTATAw)=2ATAw (since ATA is symmetric)
∇w(−2bTAw)=−2ATb
∇w(bTb)=0
Setting ∇wL=2ATAw−2ATb=0 recovers the normal equations ATAw=ATb.
Example 2: Gaussian MLE for μ and Σ
Log-likelihood for N iid observations from N(μ,Σ):
Gradient w.r.t. Σ (using ∇ΣlogdetΣ=Σ−1 and ∇Σtr(Σ−1B)=−Σ−1BΣ−1):
∇Σℓ=0⟹Σ^=N1∑i=1N(xi−μ^)(xi−μ^)T.
The sample covariance is the MLE for Σ. Both results follow purely from matrix calculus identities.
Example 3: Backpropagation as Chain Rule
A single fully-connected layer: y=σ(Wx+b) where W∈Rm×n.
Let δ=∂y∂L∈Rm be the upstream gradient.
Chain rule:
∂z∂L=δ⊙σ′(z),z=Wx+b.
∂W∂L=∂z∂L⋅xT∈Rm×n,∂b∂L=∂z∂L,∂x∂L=WT∂z∂L.
These three lines are exactly what a layer's backward pass computes. The WT in the final line (gradient flows backward through the transpose) is the signature of the chain rule applied to the linear map z=Wx, whose Jacobian (in denominator layout) is WT.
Connections
Where Your Intuition Breaks
The gradient of a matrix-valued function f(W) is a matrix of the same shape as W. This feels natural, but the layout convention determines the exact form of every identity. The most common error in manual backpropagation is a transposition mistake — the gradient of f(W)=uTWv is ∇Wf=uvT (outer product), not vuT. The transpose arises from the layout convention (denominator layout puts input dimensions first), not from any deep mathematical reason. When deriving gradients by hand, the safest check is always the finite-difference approximation: ∇Wf≈(f(W+ϵEij)−f(W−ϵEij))/(2ϵ) for each standard basis matrix Eij.
Numerically Checking Gradients
Before trusting an analytically derived gradient, verify it numerically. The finite difference approximation:
∂xi∂f≈2hf(x+hei)−f(x−hei),h≈10−5.
Gradient check: compute the relative difference between analytical and numerical gradients:
Layout convention bugs are the most common matrix calculus error. If your gradient has the wrong shape (should be (n×1) but you get (1×n)), you have a layout mismatch. Establish a single convention at the start of every derivation and use it consistently. PyTorch uses "denominator layout" for .grad attributes — gradients have the same shape as parameters.
💡Intuition
The gradient points perpendicular to level curves. For f(x)=xTAx, the level curve through x0 is {x:f(x)=f(x0)}. The gradient ∇f(x0)=2Ax0 is perpendicular to this curve. This is because moving along the level curve doesn't change f, so the directional derivative in the tangent direction is zero — and the gradient is the unique direction with this property (up to scale). Gradient descent follows −∇f, which is the direction of steepest descent.
💡Intuition
Second-order methods: Newton's method vs gradient descent. Gradient descent step: x←x−η∇f. Newton step: x←x−H−1∇f. Newton uses the Hessian to adapt the step to the local curvature — it converges in one step for quadratics, and quadratically near the optimum for smooth functions. The cost: forming and inverting an n×n Hessian is O(n3). For n=108 parameters, this is completely infeasible. Quasi-Newton methods (L-BFGS, Adagrad, Adam) approximate H−1 cheaply, capturing some of the curvature information at O(n) cost.
Hessian of f=xTAx is 2A — eigenvectors are principal curvature directions
SVD (Lesson 5)
σ1=max∥x∥=1∥Ax∥ — Lagrangian of ∥Ax∥2 gives left/right singular vector equations
PSD (Lesson 6)
A⪰0⟺f(x)=xTAx is convex ⟺H(f)⪰0
Least squares (Lesson 7)
Normal equations = gradient of ∥Ax−b∥2 set to zero
Bridge to Multivariate Calculus
Module 04 (Multivariate Calculus & Differential Geometry) extends these ideas to nonlinear functions: the gradient becomes a covector on a manifold, the Hessian becomes the second fundamental form, and the chain rule becomes the pullback of differential forms. For machine learning, the key extension is to Riemannian optimization — gradient descent on curved parameter spaces (e.g., optimization over orthogonal matrices or probability simplices), where the Euclidean gradient must be corrected by the metric tensor.