Requires:Positive Semidefinite Matrices & Quadratic Forms Linear Systems & Least Squares

Matrix Calculus & Differentiation

Matrix calculus is the language in which backpropagation, gradient-based optimization, and the derivation of learning rules are written. Extending scalar differentiation to vector and matrix arguments requires careful attention to layout conventions, but the payoff is enormous: every gradient descent update, every second-order Newton step, and every derivation of a loss function's gradient reduces to a handful of matrix calculus identities. This lesson develops the full toolkit — gradients, Jacobians, Hessians, and the chain rule in matrix form — and applies it to derive OLS, PCA, and backpropagation as special cases.

Concepts

Gradient Field — f(x) = xᵀAx, ∇f = 2Ax (drag the point)

Level curvesGradient field

x = (1.200, 0.800)

f(x) = xᵀAx = 2.080

∇f = 2Ax =

(2.400, 1.600)

‖∇f‖ = 2.884

A = [1, 0; 0, 1]

H(f) = 2A

f = x₁²+x₂². Gradient = 2x — points straight out from origin. All directions equal curvature.

The yellow arrow is ∇f(x) = 2Ax, always perpendicular to the level curve through x. Gradient descent follows −∇f.

Every time you run backpropagation, the chain rule is computing a matrix product of Jacobians. When you write loss.backward() in PyTorch, the framework is mechanically applying a handful of matrix calculus identities — the gradient of a quadratic, the chain rule, the Jacobian of a matrix-vector product. Learning these identities by hand once makes all of optimization, backpropagation, and second-order methods transparent rather than magical.

Layout Conventions

Two conventions exist for arranging partial derivatives, and confusing them causes sign errors, transposed answers, and impossible-to-debug gradients.

Numerator layout (Jacobian layout): For $f : \mathbb{R}^n \to \mathbb{R}^m$ , the Jacobian $\frac{\partial \mathbf{f}}{\partial \mathbf{x}} \in \mathbb{R}^{m \times n}$ with $\left(\frac{\partial \mathbf{f}}{\partial \mathbf{x}}\right)_{ij} = \frac{\partial f_i}{\partial x_j}$ .

Denominator layout: Jacobian is $\frac{\partial \mathbf{f}}{\partial \mathbf{x}} \in \mathbb{R}^{n \times m}$ — the transpose of numerator layout.

This course uses denominator layout (most common in ML textbooks, consistent with the convention that $\nabla_{\mathbf{x}} f$ is a column vector of the same shape as $\mathbf{x}$ ). Key consequence: the gradient of $f : \mathbb{R}^n \to \mathbb{R}$ is a column vector $\nabla_{\mathbf{x}} f \in \mathbb{R}^n$ .

Gradient of a Scalar Function

For $f : \mathbb{R}^n \to \mathbb{R}$ , the gradient is:

$\nabla_{\mathbf{x}} f = \frac{\partial f}{\partial \mathbf{x}} = \begin{pmatrix}\partial f/\partial x_1 \\ \vdots \\ \partial f/\partial x_n\end{pmatrix} \in \mathbb{R}^n.$

The gradient must have the same shape as $\mathbf{x}$ because it specifies a step direction in the same space: updating $\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla_{\mathbf{x}} f$ requires both sides to be vectors of the same dimension. The denominator layout convention (column gradient) is the one consistent with this update rule — switching to numerator layout produces a row vector, requiring explicit transposes everywhere gradient descent appears.

Fundamental identities (denominator layout):

Function $f(\mathbf{x})$	Gradient $\nabla_\mathbf{x} f$	Hessian $\nabla^2_\mathbf{x} f$
$\mathbf{a}^T\mathbf{x}$	$\mathbf{a}$	$0$
$\mathbf{x}^T A \mathbf{x}$	$(A + A^T)\mathbf{x}$	$A + A^T$
$\mathbf{x}^T A \mathbf{x}$ ( $A$ symmetric)	$2A\mathbf{x}$	$2A$
$\\|\mathbf{x}\\|^2 = \mathbf{x}^T\mathbf{x}$	$2\mathbf{x}$	$2I$
$\\|\mathbf{x} - \mathbf{a}\\|^2$	$2(\mathbf{x} - \mathbf{a})$	$2I$
$\mathbf{x}^T A \mathbf{b}$	$A\mathbf{b}$	$0$
$\\|\mathbf{x}\\|_1$	$\operatorname{sign}(\mathbf{x})$	undefined at 0

Derivation: $\nabla_\mathbf{x}(\mathbf{x}^T A \mathbf{x})$ . Write $f = \sum_{i,j} a_{ij} x_i x_j$ . The $k$ -th component:

$\frac{\partial f}{\partial x_k} = \sum_j a_{kj} x_j + \sum_i a_{ik} x_i = (A\mathbf{x})_k + (A^T\mathbf{x})_k.$

Hence $\nabla_\mathbf{x}(\mathbf{x}^TA\mathbf{x}) = (A + A^T)\mathbf{x} = 2A\mathbf{x}$ when $A$ is symmetric.

Jacobian of a Vector Function

For $\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m$ , the Jacobian (denominator layout) is:

$J = \frac{\partial \mathbf{f}}{\partial \mathbf{x}} \in \mathbb{R}^{n \times m}, \qquad J_{ij} = \frac{\partial f_j}{\partial x_i}.$

The Jacobian is the linear map that best approximates $\mathbf{f}$ at a point: $\mathbf{f}(\mathbf{x} + \Delta\mathbf{x}) \approx \mathbf{f}(\mathbf{x}) + J^T \Delta\mathbf{x}$ (note the transpose — a consequence of denominator layout).

Key examples:

$\mathbf{f}(\mathbf{x})$	Jacobian $\frac{\partial \mathbf{f}}{\partial \mathbf{x}}$
$A\mathbf{x}$ (linear)	$A^T$
$\mathbf{x}^T A$	$A$
$\sigma(\mathbf{x})$ (elementwise sigmoid)	$\operatorname{diag}(\sigma(\mathbf{x}) \odot (1 - \sigma(\mathbf{x})))$
$\operatorname{softmax}(\mathbf{x})$	$\operatorname{diag}(\mathbf{p}) - \mathbf{p}\mathbf{p}^T$ where $\mathbf{p} = \operatorname{softmax}(\mathbf{x})$

The Chain Rule

For composed functions $f(\mathbf{x}) = g(h(\mathbf{x}))$ where $\mathbf{x} \in \mathbb{R}^n$ , $h : \mathbb{R}^n \to \mathbb{R}^k$ , $g : \mathbb{R}^k \to \mathbb{R}$ :

$\nabla_\mathbf{x} f = J_h \cdot \nabla_\mathbf{u} g \big|_{\mathbf{u} = h(\mathbf{x})},$

where $J_h = \frac{\partial h}{\partial \mathbf{x}} \in \mathbb{R}^{n \times k}$ is the Jacobian of $h$ (denominator layout).

For the full vector-to-vector composition $\mathbf{f} = g \circ h$ :

$\frac{\partial \mathbf{f}}{\partial \mathbf{x}} = \frac{\partial h}{\partial \mathbf{x}} \cdot \frac{\partial \mathbf{f}}{\partial h}.$

This is precisely the rule that backpropagation implements: accumulate Jacobians layer by layer from output to input.

Gradients of Matrix Functions

For functions of matrices $A \in \mathbb{R}^{m \times n}$ , define $\frac{\partial f}{\partial A} \in \mathbb{R}^{m \times n}$ with $\left(\frac{\partial f}{\partial A}\right)_{ij} = \frac{\partial f}{\partial A_{ij}}$ .

Key identities:

Function $f(A)$	Gradient $\frac{\partial f}{\partial A}$
$\operatorname{tr}(A^TB)$	$B$
$\operatorname{tr}(AB)$	$B^T$
$\operatorname{tr}(ABA^T)$	$A(B + B^T)$
$\operatorname{tr}(A^TA)$	$2A$
$\log\det(A)$ (A sym PD)	$A^{-1}$
$\det(A)$	$\det(A) \cdot A^{-T}$
$\\|A\\|_F^2$	$2A$

Derivation: $\frac{\partial}{\partial A}\log\det(A)$ . Use the Jacobi identity: $d(\det A) = \det(A) \operatorname{tr}(A^{-1} dA)$ . Then:

$d(\log\det A) = \operatorname{tr}(A^{-1}dA) \implies \frac{\partial \log\det A}{\partial A_{ij}} = (A^{-1})_{ji} \implies \nabla_A \log\det(A) = A^{-T}.$

For symmetric $A$ : $\nabla_A \log\det(A) = A^{-1}$ . This gradient appears in the MLE for a Gaussian model's covariance matrix.

The Hessian

The Hessian of $f : \mathbb{R}^n \to \mathbb{R}$ is the symmetric matrix of second partial derivatives:

$H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}, \qquad \mathbf{H} = \nabla^2 f \in \mathbb{R}^{n \times n}.$

By Clairaut's theorem (for twice-continuously-differentiable $f$ ), $H_{ij} = H_{ji}$ — the Hessian is symmetric.

Taylor expansion. Near $\mathbf{x}_0$ :

$f(\mathbf{x}_0 + \Delta\mathbf{x}) = f(\mathbf{x}_0) + (\nabla f(\mathbf{x}_0))^T\Delta\mathbf{x} + \frac{1}{2}(\Delta\mathbf{x})^T H(\mathbf{x}_0) \Delta\mathbf{x} + O(\|\Delta\mathbf{x}\|^3).$

Newton's method uses this: minimize the quadratic approximation to get the update $\Delta\mathbf{x}^* = -H^{-1}\nabla f$ . This requires solving an $n \times n$ linear system (or approximating the Hessian) at every step.

Automatic Differentiation: Forward and Reverse Mode

Modern deep learning frameworks (PyTorch, JAX, TensorFlow) compute gradients via automatic differentiation (AD), not symbolic calculus or finite differences.

Forward mode AD. Propagate derivative information forward through the computation graph, accumulating the Jacobian one input at a time. Cost: $O(n)$ forward passes for a function $\mathbb{R}^n \to \mathbb{R}^m$ — efficient when $n \ll m$ .

Reverse mode AD (backpropagation). Propagate gradient information backward through the graph, accumulating the gradient one output at a time. Cost: $O(m)$ backward passes — efficient when $m \ll n$ . For neural network training, $m = 1$ (scalar loss) and $n =$ number of parameters (millions to billions), so reverse mode is overwhelmingly preferred.

Computational complexity. For $f : \mathbb{R}^n \to \mathbb{R}$ : reverse mode computes the full gradient $\nabla f \in \mathbb{R}^n$ in $O(\text{cost of computing } f)$ — essentially for free relative to the forward pass. This is why gradient descent scales to billions of parameters.

Jacobian-vector products (JVPs) and vector-Jacobian products (VJPs). Forward mode computes JVPs ( $J \mathbf{v}$ for a vector $\mathbf{v}$ ), reverse mode computes VJPs ( $\mathbf{u}^T J$ for a covector $\mathbf{u}$ ). These are the primitives that AD engines expose and compose.

Worked Example

Example 1: Deriving the OLS Gradient

Loss $\mathcal{L}(\mathbf{w}) = \|A\mathbf{w} - \mathbf{b}\|^2 = \mathbf{w}^TA^TA\mathbf{w} - 2\mathbf{b}^TA\mathbf{w} + \mathbf{b}^T\mathbf{b}$ .

Using the table identities:

$\nabla_\mathbf{w}(\mathbf{w}^TA^TA\mathbf{w}) = 2A^TA\mathbf{w}$ (since $A^TA$ is symmetric)
$\nabla_\mathbf{w}(-2\mathbf{b}^TA\mathbf{w}) = -2A^T\mathbf{b}$
$\nabla_\mathbf{w}(\mathbf{b}^T\mathbf{b}) = \mathbf{0}$

Setting $\nabla_\mathbf{w}\mathcal{L} = 2A^TA\mathbf{w} - 2A^T\mathbf{b} = \mathbf{0}$ recovers the normal equations $A^TA\mathbf{w} = A^T\mathbf{b}$ .

Example 2: Gaussian MLE for $\mu$ and $\Sigma$

Log-likelihood for $N$ iid observations from $\mathcal{N}(\boldsymbol{\mu}, \Sigma)$ :

$\ell(\boldsymbol{\mu}, \Sigma) = -\frac{N}{2}\log\det(\Sigma) - \frac{1}{2}\sum_{i=1}^N (\mathbf{x}_i - \boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}_i - \boldsymbol{\mu}) + \text{const.}$

Gradient w.r.t. $\boldsymbol{\mu}$ :

$\nabla_{\boldsymbol{\mu}} \ell = \Sigma^{-1}\sum_{i=1}^N (\mathbf{x}_i - \boldsymbol{\mu}) = \mathbf{0} \implies \hat{\boldsymbol{\mu}} = \frac{1}{N}\sum_{i=1}^N \mathbf{x}_i.$

Gradient w.r.t. $\Sigma$ (using $\nabla_\Sigma \log\det\Sigma = \Sigma^{-1}$ and $\nabla_\Sigma \operatorname{tr}(\Sigma^{-1}B) = -\Sigma^{-1}B\Sigma^{-1}$ ):

$\nabla_\Sigma \ell = \mathbf{0} \implies \hat{\Sigma} = \frac{1}{N}\sum_{i=1}^N (\mathbf{x}_i - \hat{\boldsymbol{\mu}})(\mathbf{x}_i - \hat{\boldsymbol{\mu}})^T.$

The sample covariance is the MLE for $\Sigma$ . Both results follow purely from matrix calculus identities.

Example 3: Backpropagation as Chain Rule

A single fully-connected layer: $\mathbf{y} = \sigma(W\mathbf{x} + \mathbf{b})$ where $W \in \mathbb{R}^{m \times n}$ .

Let $\boldsymbol{\delta} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \in \mathbb{R}^m$ be the upstream gradient.

Chain rule:

$\frac{\partial \mathcal{L}}{\partial \mathbf{z}} = \boldsymbol{\delta} \odot \sigma'(\mathbf{z}), \qquad \mathbf{z} = W\mathbf{x} + \mathbf{b}.$

$\frac{\partial \mathcal{L}}{\partial W} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}} \cdot \mathbf{x}^T \in \mathbb{R}^{m \times n}, \qquad \frac{\partial \mathcal{L}}{\partial \mathbf{b}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}}, \qquad \frac{\partial \mathcal{L}}{\partial \mathbf{x}} = W^T \frac{\partial \mathcal{L}}{\partial \mathbf{z}}.$

These three lines are exactly what a layer's backward pass computes. The $W^T$ in the final line (gradient flows backward through the transpose) is the signature of the chain rule applied to the linear map $\mathbf{z} = W\mathbf{x}$ , whose Jacobian (in denominator layout) is $W^T$ .

Connections

Where Your Intuition Breaks

The gradient of a matrix-valued function $f(W)$ is a matrix of the same shape as $W$ . This feels natural, but the layout convention determines the exact form of every identity. The most common error in manual backpropagation is a transposition mistake — the gradient of $f(W) = \mathbf{u}^T W \mathbf{v}$ is $\nabla_W f = \mathbf{u}\mathbf{v}^T$ (outer product), not $\mathbf{v}\mathbf{u}^T$ . The transpose arises from the layout convention (denominator layout puts input dimensions first), not from any deep mathematical reason. When deriving gradients by hand, the safest check is always the finite-difference approximation: $\nabla_W f \approx (f(W + \epsilon E_{ij}) - f(W - \epsilon E_{ij})) / (2\epsilon)$ for each standard basis matrix $E_{ij}$ .

Numerically Checking Gradients

Before trusting an analytically derived gradient, verify it numerically. The finite difference approximation:

$\frac{\partial f}{\partial x_i} \approx \frac{f(\mathbf{x} + h\mathbf{e}_i) - f(\mathbf{x} - h\mathbf{e}_i)}{2h}, \qquad h \approx 10^{-5}.$

Gradient check: compute the relative difference between analytical and numerical gradients:

$\text{relative error} = \frac{\|\nabla_{\text{analytic}} - \nabla_{\text{numeric}}\|}{\|\nabla_{\text{analytic}}\| + \|\nabla_{\text{numeric}}\|}.$

Acceptable: $< 10^{-5}$ . Red flag: $> 10^{-3}$ .

⚠️Warning

Layout convention bugs are the most common matrix calculus error. If your gradient has the wrong shape (should be $(n \times 1)$ but you get $(1 \times n)$ ), you have a layout mismatch. Establish a single convention at the start of every derivation and use it consistently. PyTorch uses "denominator layout" for .grad attributes — gradients have the same shape as parameters.

💡Intuition

The gradient points perpendicular to level curves. For $f(\mathbf{x}) = \mathbf{x}^TA\mathbf{x}$ , the level curve through $\mathbf{x}_0$ is $\{\mathbf{x} : f(\mathbf{x}) = f(\mathbf{x}_0)\}$ . The gradient $\nabla f(\mathbf{x}_0) = 2A\mathbf{x}_0$ is perpendicular to this curve. This is because moving along the level curve doesn't change $f$ , so the directional derivative in the tangent direction is zero — and the gradient is the unique direction with this property (up to scale). Gradient descent follows $-\nabla f$ , which is the direction of steepest descent.

💡Intuition

Second-order methods: Newton's method vs gradient descent. Gradient descent step: $\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla f$ . Newton step: $\mathbf{x} \leftarrow \mathbf{x} - H^{-1}\nabla f$ . Newton uses the Hessian to adapt the step to the local curvature — it converges in one step for quadratics, and quadratically near the optimum for smooth functions. The cost: forming and inverting an $n \times n$ Hessian is $O(n^3)$ . For $n = 10^8$ parameters, this is completely infeasible. Quasi-Newton methods (L-BFGS, Adagrad, Adam) approximate $H^{-1}$ cheaply, capturing some of the curvature information at $O(n)$ cost.

Connecting Matrix Calculus to the Module

This lesson closes the linear algebra arc:

Topic	Matrix calculus connection
Eigenvalues (Lesson 3)	$\lambda_{\max} = \max_{\\|\mathbf{x}\\|=1} \mathbf{x}^TA\mathbf{x}$ — Lagrangian gradient = 0 gives $A\mathbf{x} = \lambda\mathbf{x}$
Spectral Theorem (Lesson 4)	Hessian of $f=\mathbf{x}^TA\mathbf{x}$ is $2A$ — eigenvectors are principal curvature directions
SVD (Lesson 5)	$\sigma_1 = \max_{\\|\mathbf{x}\\|=1}\\|A\mathbf{x}\\|$ — Lagrangian of $\\|A\mathbf{x}\\|^2$ gives left/right singular vector equations
PSD (Lesson 6)	$A \succeq 0 \iff f(\mathbf{x})=\mathbf{x}^TA\mathbf{x}$ is convex $\iff H(f) \succeq 0$
Least squares (Lesson 7)	Normal equations = gradient of $\\|A\mathbf{x}-\mathbf{b}\\|^2$ set to zero

Bridge to Multivariate Calculus

Module 04 (Multivariate Calculus & Differential Geometry) extends these ideas to nonlinear functions: the gradient becomes a covector on a manifold, the Hessian becomes the second fundamental form, and the chain rule becomes the pullback of differential forms. For machine learning, the key extension is to Riemannian optimization — gradient descent on curved parameter spaces (e.g., optimization over orthogonal matrices or probability simplices), where the Euclidean gradient must be corrected by the metric tensor.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Linear Systems & Least Squares

Calculus & Geometry

Topology Primer: Metric Spaces, Continuity & Compactness