Neural-Path/Notes
55 min

Matrix Calculus & Differentiation

Matrix calculus is the language in which backpropagation, gradient-based optimization, and the derivation of learning rules are written. Extending scalar differentiation to vector and matrix arguments requires careful attention to layout conventions, but the payoff is enormous: every gradient descent update, every second-order Newton step, and every derivation of a loss function's gradient reduces to a handful of matrix calculus identities. This lesson develops the full toolkit — gradients, Jacobians, Hessians, and the chain rule in matrix form — and applies it to derive OLS, PCA, and backpropagation as special cases.

Concepts

Gradient Field — f(x) = xᵀAx, ∇f = 2Ax (drag the point)

x

x = (1.200, 0.800)

f(x) = xᵀAx = 2.080

∇f = 2Ax =

(2.400, 1.600)

‖∇f‖ = 2.884

A = [1, 0; 0, 1]

H(f) = 2A

f = x₁²+x₂². Gradient = 2x — points straight out from origin. All directions equal curvature.

The yellow arrow is ∇f(x) = 2Ax, always perpendicular to the level curve through x. Gradient descent follows −∇f.

Every time you run backpropagation, the chain rule is computing a matrix product of Jacobians. When you write loss.backward() in PyTorch, the framework is mechanically applying a handful of matrix calculus identities — the gradient of a quadratic, the chain rule, the Jacobian of a matrix-vector product. Learning these identities by hand once makes all of optimization, backpropagation, and second-order methods transparent rather than magical.

Layout Conventions

Two conventions exist for arranging partial derivatives, and confusing them causes sign errors, transposed answers, and impossible-to-debug gradients.

Numerator layout (Jacobian layout): For f:RnRmf : \mathbb{R}^n \to \mathbb{R}^m, the Jacobian fxRm×n\frac{\partial \mathbf{f}}{\partial \mathbf{x}} \in \mathbb{R}^{m \times n} with (fx)ij=fixj\left(\frac{\partial \mathbf{f}}{\partial \mathbf{x}}\right)_{ij} = \frac{\partial f_i}{\partial x_j}.

Denominator layout: Jacobian is fxRn×m\frac{\partial \mathbf{f}}{\partial \mathbf{x}} \in \mathbb{R}^{n \times m} — the transpose of numerator layout.

This course uses denominator layout (most common in ML textbooks, consistent with the convention that xf\nabla_{\mathbf{x}} f is a column vector of the same shape as x\mathbf{x}). Key consequence: the gradient of f:RnRf : \mathbb{R}^n \to \mathbb{R} is a column vector xfRn\nabla_{\mathbf{x}} f \in \mathbb{R}^n.

Gradient of a Scalar Function

For f:RnRf : \mathbb{R}^n \to \mathbb{R}, the gradient is:

xf=fx=(f/x1f/xn)Rn.\nabla_{\mathbf{x}} f = \frac{\partial f}{\partial \mathbf{x}} = \begin{pmatrix}\partial f/\partial x_1 \\ \vdots \\ \partial f/\partial x_n\end{pmatrix} \in \mathbb{R}^n.

The gradient must have the same shape as x\mathbf{x} because it specifies a step direction in the same space: updating xxηxf\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla_{\mathbf{x}} f requires both sides to be vectors of the same dimension. The denominator layout convention (column gradient) is the one consistent with this update rule — switching to numerator layout produces a row vector, requiring explicit transposes everywhere gradient descent appears.

Fundamental identities (denominator layout):

Function f(x)f(\mathbf{x})Gradient xf\nabla_\mathbf{x} fHessian x2f\nabla^2_\mathbf{x} f
aTx\mathbf{a}^T\mathbf{x}a\mathbf{a}00
xTAx\mathbf{x}^T A \mathbf{x}(A+AT)x(A + A^T)\mathbf{x}A+ATA + A^T
xTAx\mathbf{x}^T A \mathbf{x} (AA symmetric)2Ax2A\mathbf{x}2A2A
x2=xTx\|\mathbf{x}\|^2 = \mathbf{x}^T\mathbf{x}2x2\mathbf{x}2I2I
xa2\|\mathbf{x} - \mathbf{a}\|^22(xa)2(\mathbf{x} - \mathbf{a})2I2I
xTAb\mathbf{x}^T A \mathbf{b}AbA\mathbf{b}00
x1\|\mathbf{x}\|_1sign(x)\operatorname{sign}(\mathbf{x})undefined at 0

Derivation: x(xTAx)\nabla_\mathbf{x}(\mathbf{x}^T A \mathbf{x}). Write f=i,jaijxixjf = \sum_{i,j} a_{ij} x_i x_j. The kk-th component:

fxk=jakjxj+iaikxi=(Ax)k+(ATx)k.\frac{\partial f}{\partial x_k} = \sum_j a_{kj} x_j + \sum_i a_{ik} x_i = (A\mathbf{x})_k + (A^T\mathbf{x})_k.

Hence x(xTAx)=(A+AT)x=2Ax\nabla_\mathbf{x}(\mathbf{x}^TA\mathbf{x}) = (A + A^T)\mathbf{x} = 2A\mathbf{x} when AA is symmetric.

Jacobian of a Vector Function

For f:RnRm\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m, the Jacobian (denominator layout) is:

J=fxRn×m,Jij=fjxi.J = \frac{\partial \mathbf{f}}{\partial \mathbf{x}} \in \mathbb{R}^{n \times m}, \qquad J_{ij} = \frac{\partial f_j}{\partial x_i}.

The Jacobian is the linear map that best approximates f\mathbf{f} at a point: f(x+Δx)f(x)+JTΔx\mathbf{f}(\mathbf{x} + \Delta\mathbf{x}) \approx \mathbf{f}(\mathbf{x}) + J^T \Delta\mathbf{x} (note the transpose — a consequence of denominator layout).

Key examples:

f(x)\mathbf{f}(\mathbf{x})Jacobian fx\frac{\partial \mathbf{f}}{\partial \mathbf{x}}
AxA\mathbf{x} (linear)ATA^T
xTA\mathbf{x}^T AAA
σ(x)\sigma(\mathbf{x}) (elementwise sigmoid)diag(σ(x)(1σ(x)))\operatorname{diag}(\sigma(\mathbf{x}) \odot (1 - \sigma(\mathbf{x})))
softmax(x)\operatorname{softmax}(\mathbf{x})diag(p)ppT\operatorname{diag}(\mathbf{p}) - \mathbf{p}\mathbf{p}^T where p=softmax(x)\mathbf{p} = \operatorname{softmax}(\mathbf{x})

The Chain Rule

For composed functions f(x)=g(h(x))f(\mathbf{x}) = g(h(\mathbf{x})) where xRn\mathbf{x} \in \mathbb{R}^n, h:RnRkh : \mathbb{R}^n \to \mathbb{R}^k, g:RkRg : \mathbb{R}^k \to \mathbb{R}:

xf=Jhugu=h(x),\nabla_\mathbf{x} f = J_h \cdot \nabla_\mathbf{u} g \big|_{\mathbf{u} = h(\mathbf{x})},

where Jh=hxRn×kJ_h = \frac{\partial h}{\partial \mathbf{x}} \in \mathbb{R}^{n \times k} is the Jacobian of hh (denominator layout).

For the full vector-to-vector composition f=gh\mathbf{f} = g \circ h:

fx=hxfh.\frac{\partial \mathbf{f}}{\partial \mathbf{x}} = \frac{\partial h}{\partial \mathbf{x}} \cdot \frac{\partial \mathbf{f}}{\partial h}.

This is precisely the rule that backpropagation implements: accumulate Jacobians layer by layer from output to input.

Gradients of Matrix Functions

For functions of matrices ARm×nA \in \mathbb{R}^{m \times n}, define fARm×n\frac{\partial f}{\partial A} \in \mathbb{R}^{m \times n} with (fA)ij=fAij\left(\frac{\partial f}{\partial A}\right)_{ij} = \frac{\partial f}{\partial A_{ij}}.

Key identities:

Function f(A)f(A)Gradient fA\frac{\partial f}{\partial A}
tr(ATB)\operatorname{tr}(A^TB)BB
tr(AB)\operatorname{tr}(AB)BTB^T
tr(ABAT)\operatorname{tr}(ABA^T)A(B+BT)A(B + B^T)
tr(ATA)\operatorname{tr}(A^TA)2A2A
logdet(A)\log\det(A) (A sym PD)A1A^{-1}
det(A)\det(A)det(A)AT\det(A) \cdot A^{-T}
AF2\|A\|_F^22A2A

Derivation: Alogdet(A)\frac{\partial}{\partial A}\log\det(A). Use the Jacobi identity: d(detA)=det(A)tr(A1dA)d(\det A) = \det(A) \operatorname{tr}(A^{-1} dA). Then:

d(logdetA)=tr(A1dA)    logdetAAij=(A1)ji    Alogdet(A)=AT.d(\log\det A) = \operatorname{tr}(A^{-1}dA) \implies \frac{\partial \log\det A}{\partial A_{ij}} = (A^{-1})_{ji} \implies \nabla_A \log\det(A) = A^{-T}.

For symmetric AA: Alogdet(A)=A1\nabla_A \log\det(A) = A^{-1}. This gradient appears in the MLE for a Gaussian model's covariance matrix.

The Hessian

The Hessian of f:RnRf : \mathbb{R}^n \to \mathbb{R} is the symmetric matrix of second partial derivatives:

Hij=2fxixj,H=2fRn×n.H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}, \qquad \mathbf{H} = \nabla^2 f \in \mathbb{R}^{n \times n}.

By Clairaut's theorem (for twice-continuously-differentiable ff), Hij=HjiH_{ij} = H_{ji} — the Hessian is symmetric.

Taylor expansion. Near x0\mathbf{x}_0:

f(x0+Δx)=f(x0)+(f(x0))TΔx+12(Δx)TH(x0)Δx+O(Δx3).f(\mathbf{x}_0 + \Delta\mathbf{x}) = f(\mathbf{x}_0) + (\nabla f(\mathbf{x}_0))^T\Delta\mathbf{x} + \frac{1}{2}(\Delta\mathbf{x})^T H(\mathbf{x}_0) \Delta\mathbf{x} + O(\|\Delta\mathbf{x}\|^3).

Newton's method uses this: minimize the quadratic approximation to get the update Δx=H1f\Delta\mathbf{x}^* = -H^{-1}\nabla f. This requires solving an n×nn \times n linear system (or approximating the Hessian) at every step.

Automatic Differentiation: Forward and Reverse Mode

Modern deep learning frameworks (PyTorch, JAX, TensorFlow) compute gradients via automatic differentiation (AD), not symbolic calculus or finite differences.

Forward mode AD. Propagate derivative information forward through the computation graph, accumulating the Jacobian one input at a time. Cost: O(n)O(n) forward passes for a function RnRm\mathbb{R}^n \to \mathbb{R}^m — efficient when nmn \ll m.

Reverse mode AD (backpropagation). Propagate gradient information backward through the graph, accumulating the gradient one output at a time. Cost: O(m)O(m) backward passes — efficient when mnm \ll n. For neural network training, m=1m = 1 (scalar loss) and n=n = number of parameters (millions to billions), so reverse mode is overwhelmingly preferred.

Computational complexity. For f:RnRf : \mathbb{R}^n \to \mathbb{R}: reverse mode computes the full gradient fRn\nabla f \in \mathbb{R}^n in O(cost of computing f)O(\text{cost of computing } f) — essentially for free relative to the forward pass. This is why gradient descent scales to billions of parameters.

Jacobian-vector products (JVPs) and vector-Jacobian products (VJPs). Forward mode computes JVPs (JvJ \mathbf{v} for a vector v\mathbf{v}), reverse mode computes VJPs (uTJ\mathbf{u}^T J for a covector u\mathbf{u}). These are the primitives that AD engines expose and compose.

Worked Example

Example 1: Deriving the OLS Gradient

Loss L(w)=Awb2=wTATAw2bTAw+bTb\mathcal{L}(\mathbf{w}) = \|A\mathbf{w} - \mathbf{b}\|^2 = \mathbf{w}^TA^TA\mathbf{w} - 2\mathbf{b}^TA\mathbf{w} + \mathbf{b}^T\mathbf{b}.

Using the table identities:

  • w(wTATAw)=2ATAw\nabla_\mathbf{w}(\mathbf{w}^TA^TA\mathbf{w}) = 2A^TA\mathbf{w} (since ATAA^TA is symmetric)
  • w(2bTAw)=2ATb\nabla_\mathbf{w}(-2\mathbf{b}^TA\mathbf{w}) = -2A^T\mathbf{b}
  • w(bTb)=0\nabla_\mathbf{w}(\mathbf{b}^T\mathbf{b}) = \mathbf{0}

Setting wL=2ATAw2ATb=0\nabla_\mathbf{w}\mathcal{L} = 2A^TA\mathbf{w} - 2A^T\mathbf{b} = \mathbf{0} recovers the normal equations ATAw=ATbA^TA\mathbf{w} = A^T\mathbf{b}.

Example 2: Gaussian MLE for μ\mu and Σ\Sigma

Log-likelihood for NN iid observations from N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma):

(μ,Σ)=N2logdet(Σ)12i=1N(xiμ)TΣ1(xiμ)+const.\ell(\boldsymbol{\mu}, \Sigma) = -\frac{N}{2}\log\det(\Sigma) - \frac{1}{2}\sum_{i=1}^N (\mathbf{x}_i - \boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}_i - \boldsymbol{\mu}) + \text{const.}

Gradient w.r.t. μ\boldsymbol{\mu}:

μ=Σ1i=1N(xiμ)=0    μ^=1Ni=1Nxi.\nabla_{\boldsymbol{\mu}} \ell = \Sigma^{-1}\sum_{i=1}^N (\mathbf{x}_i - \boldsymbol{\mu}) = \mathbf{0} \implies \hat{\boldsymbol{\mu}} = \frac{1}{N}\sum_{i=1}^N \mathbf{x}_i.

Gradient w.r.t. Σ\Sigma (using ΣlogdetΣ=Σ1\nabla_\Sigma \log\det\Sigma = \Sigma^{-1} and Σtr(Σ1B)=Σ1BΣ1\nabla_\Sigma \operatorname{tr}(\Sigma^{-1}B) = -\Sigma^{-1}B\Sigma^{-1}):

Σ=0    Σ^=1Ni=1N(xiμ^)(xiμ^)T.\nabla_\Sigma \ell = \mathbf{0} \implies \hat{\Sigma} = \frac{1}{N}\sum_{i=1}^N (\mathbf{x}_i - \hat{\boldsymbol{\mu}})(\mathbf{x}_i - \hat{\boldsymbol{\mu}})^T.

The sample covariance is the MLE for Σ\Sigma. Both results follow purely from matrix calculus identities.

Example 3: Backpropagation as Chain Rule

A single fully-connected layer: y=σ(Wx+b)\mathbf{y} = \sigma(W\mathbf{x} + \mathbf{b}) where WRm×nW \in \mathbb{R}^{m \times n}.

Let δ=LyRm\boldsymbol{\delta} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \in \mathbb{R}^m be the upstream gradient.

Chain rule:

Lz=δσ(z),z=Wx+b.\frac{\partial \mathcal{L}}{\partial \mathbf{z}} = \boldsymbol{\delta} \odot \sigma'(\mathbf{z}), \qquad \mathbf{z} = W\mathbf{x} + \mathbf{b}.

LW=LzxTRm×n,Lb=Lz,Lx=WTLz.\frac{\partial \mathcal{L}}{\partial W} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}} \cdot \mathbf{x}^T \in \mathbb{R}^{m \times n}, \qquad \frac{\partial \mathcal{L}}{\partial \mathbf{b}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}}, \qquad \frac{\partial \mathcal{L}}{\partial \mathbf{x}} = W^T \frac{\partial \mathcal{L}}{\partial \mathbf{z}}.

These three lines are exactly what a layer's backward pass computes. The WTW^T in the final line (gradient flows backward through the transpose) is the signature of the chain rule applied to the linear map z=Wx\mathbf{z} = W\mathbf{x}, whose Jacobian (in denominator layout) is WTW^T.

Connections

Where Your Intuition Breaks

The gradient of a matrix-valued function f(W)f(W) is a matrix of the same shape as WW. This feels natural, but the layout convention determines the exact form of every identity. The most common error in manual backpropagation is a transposition mistake — the gradient of f(W)=uTWvf(W) = \mathbf{u}^T W \mathbf{v} is Wf=uvT\nabla_W f = \mathbf{u}\mathbf{v}^T (outer product), not vuT\mathbf{v}\mathbf{u}^T. The transpose arises from the layout convention (denominator layout puts input dimensions first), not from any deep mathematical reason. When deriving gradients by hand, the safest check is always the finite-difference approximation: Wf(f(W+ϵEij)f(WϵEij))/(2ϵ)\nabla_W f \approx (f(W + \epsilon E_{ij}) - f(W - \epsilon E_{ij})) / (2\epsilon) for each standard basis matrix EijE_{ij}.

Numerically Checking Gradients

Before trusting an analytically derived gradient, verify it numerically. The finite difference approximation:

fxif(x+hei)f(xhei)2h,h105.\frac{\partial f}{\partial x_i} \approx \frac{f(\mathbf{x} + h\mathbf{e}_i) - f(\mathbf{x} - h\mathbf{e}_i)}{2h}, \qquad h \approx 10^{-5}.

Gradient check: compute the relative difference between analytical and numerical gradients:

relative error=analyticnumericanalytic+numeric.\text{relative error} = \frac{\|\nabla_{\text{analytic}} - \nabla_{\text{numeric}}\|}{\|\nabla_{\text{analytic}}\| + \|\nabla_{\text{numeric}}\|}.

Acceptable: <105< 10^{-5}. Red flag: >103> 10^{-3}.

⚠️Warning

Layout convention bugs are the most common matrix calculus error. If your gradient has the wrong shape (should be (n×1)(n \times 1) but you get (1×n)(1 \times n)), you have a layout mismatch. Establish a single convention at the start of every derivation and use it consistently. PyTorch uses "denominator layout" for .grad attributes — gradients have the same shape as parameters.

💡Intuition

The gradient points perpendicular to level curves. For f(x)=xTAxf(\mathbf{x}) = \mathbf{x}^TA\mathbf{x}, the level curve through x0\mathbf{x}_0 is {x:f(x)=f(x0)}\{\mathbf{x} : f(\mathbf{x}) = f(\mathbf{x}_0)\}. The gradient f(x0)=2Ax0\nabla f(\mathbf{x}_0) = 2A\mathbf{x}_0 is perpendicular to this curve. This is because moving along the level curve doesn't change ff, so the directional derivative in the tangent direction is zero — and the gradient is the unique direction with this property (up to scale). Gradient descent follows f-\nabla f, which is the direction of steepest descent.

💡Intuition

Second-order methods: Newton's method vs gradient descent. Gradient descent step: xxηf\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla f. Newton step: xxH1f\mathbf{x} \leftarrow \mathbf{x} - H^{-1}\nabla f. Newton uses the Hessian to adapt the step to the local curvature — it converges in one step for quadratics, and quadratically near the optimum for smooth functions. The cost: forming and inverting an n×nn \times n Hessian is O(n3)O(n^3). For n=108n = 10^8 parameters, this is completely infeasible. Quasi-Newton methods (L-BFGS, Adagrad, Adam) approximate H1H^{-1} cheaply, capturing some of the curvature information at O(n)O(n) cost.

Connecting Matrix Calculus to the Module

This lesson closes the linear algebra arc:

TopicMatrix calculus connection
Eigenvalues (Lesson 3)λmax=maxx=1xTAx\lambda_{\max} = \max_{\|\mathbf{x}\|=1} \mathbf{x}^TA\mathbf{x} — Lagrangian gradient = 0 gives Ax=λxA\mathbf{x} = \lambda\mathbf{x}
Spectral Theorem (Lesson 4)Hessian of f=xTAxf=\mathbf{x}^TA\mathbf{x} is 2A2A — eigenvectors are principal curvature directions
SVD (Lesson 5)σ1=maxx=1Ax\sigma_1 = \max_{\|\mathbf{x}\|=1}\|A\mathbf{x}\| — Lagrangian of Ax2\|A\mathbf{x}\|^2 gives left/right singular vector equations
PSD (Lesson 6)A0    f(x)=xTAxA \succeq 0 \iff f(\mathbf{x})=\mathbf{x}^TA\mathbf{x} is convex     H(f)0\iff H(f) \succeq 0
Least squares (Lesson 7)Normal equations = gradient of Axb2\|A\mathbf{x}-\mathbf{b}\|^2 set to zero

Bridge to Multivariate Calculus

Module 04 (Multivariate Calculus & Differential Geometry) extends these ideas to nonlinear functions: the gradient becomes a covector on a manifold, the Hessian becomes the second fundamental form, and the chain rule becomes the pullback of differential forms. For machine learning, the key extension is to Riemannian optimization — gradient descent on curved parameter spaces (e.g., optimization over orthogonal matrices or probability simplices), where the Euclidean gradient must be corrected by the metric tensor.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.