Differentiation in Rⁿ: Jacobians, Hessians & the Chain Rule

Differentiation in $\mathbb{R}^n$ is not just about partial derivatives — it is about finding the best linear approximation to a smooth map at each point. The Jacobian captures this approximation in matrix form, the Hessian captures second-order curvature, and the chain rule tells how these approximations compose. Together they are the machinery that makes backpropagation, implicit differentiation, and second-order optimization tractable.

Concepts

Jacobian Approximation — drag the point in the domain (left)

Nonlinear mapLinear approx (J)

Polar-like map. Near any point the nonlinear wrap linearizes to a rotation + scaling.

Purple = nonlinear image. Green dashed = Jacobian linearization at the yellow point. As you zoom in, they become indistinguishable — this is what differentiability means.

When you zoom in on any smooth curve in calculus, it starts to look like a straight line — that is the core idea of the derivative. The same thing happens in $\mathbb{R}^n$ : zoom in on a smooth map and it looks like a linear map, represented by a matrix. That matrix is the Jacobian, and it is the fundamental object that makes backpropagation work. Every time PyTorch propagates gradients through a layer, it is multiplying Jacobians in the reverse direction — the chain rule in matrix form.

The Total Derivative

The fundamental concept generalizing "derivative" to maps $f : \mathbb{R}^n \to \mathbb{R}^m$ is the total derivative (or Fréchet derivative).

Definition. $f$ is differentiable at $\mathbf{x}_0$ if there exists a linear map $Df(\mathbf{x}_0) : \mathbb{R}^n \to \mathbb{R}^m$ such that

$\lim_{\mathbf{h} \to \mathbf{0}} \frac{\|f(\mathbf{x}_0 + \mathbf{h}) - f(\mathbf{x}_0) - Df(\mathbf{x}_0)\mathbf{h}\|}{\|\mathbf{h}\|} = 0.$

In words: $f(\mathbf{x}_0 + \mathbf{h}) \approx f(\mathbf{x}_0) + Df(\mathbf{x}_0)\mathbf{h}$ to first order in $\mathbf{h}$ . The matrix representing $Df(\mathbf{x}_0)$ in the standard basis is the Jacobian:

$J_f(\mathbf{x}_0) = Df(\mathbf{x}_0) \in \mathbb{R}^{m \times n}, \qquad (J_f)_{ij} = \frac{\partial f_i}{\partial x_j}\bigg|_{\mathbf{x}_0}.$

The limit condition says: the error in the linear approximation must vanish faster than $\|\mathbf{h}\|$ as $\mathbf{h} \to \mathbf{0}$ . This is stronger than just requiring partial derivatives to exist — you must also require that the approximation works simultaneously in all directions, not just along coordinate axes. That is why partial derivatives alone are not enough: a function can have all its directional derivatives yet still fail to be locally linear when you approach from a non-coordinate direction.

Existence. If all partial derivatives $\partial f_i / \partial x_j$ exist and are continuous near $\mathbf{x}_0$ , then $f$ is differentiable at $\mathbf{x}_0$ (the converse is false). Existence of all partials does not imply differentiability without continuity.

Non-example. $f(x,y) = xy/(x^2+y^2)$ for $(x,y) \neq (0,0)$ , $f(0,0)=0$ . Both partial derivatives at the origin are 0, but the function is not continuous at the origin (approach along $y=x$ gives $1/2$ ). No linear map can approximate it.

Directional Derivatives

The directional derivative of $f : \mathbb{R}^n \to \mathbb{R}$ at $\mathbf{x}$ in direction $\mathbf{v}$ (unit vector):

$D_{\mathbf{v}} f(\mathbf{x}) = \lim_{t \to 0} \frac{f(\mathbf{x} + t\mathbf{v}) - f(\mathbf{x})}{t} = \nabla f(\mathbf{x}) \cdot \mathbf{v}.$

Steepest ascent direction: $\arg\max_{\|\mathbf{v}\|=1} D_\mathbf{v} f = \nabla f / \|\nabla f\|$ . The gradient direction maximizes the rate of increase — this is why gradient descent moves in the negative gradient direction.

Cauchy-Schwarz bound: $|D_\mathbf{v} f| \leq \|\nabla f\| \cdot \|\mathbf{v}\| = \|\nabla f\|$ , with equality when $\mathbf{v} \parallel \nabla f$ .

The Chain Rule in Multiple Dimensions

For composable smooth maps $g : \mathbb{R}^k \to \mathbb{R}^m$ and $f : \mathbb{R}^n \to \mathbb{R}^k$ :

$D(g \circ f)(\mathbf{x}) = Dg(f(\mathbf{x})) \circ Df(\mathbf{x}),$

or in matrix form (numerator layout):

$J_{g \circ f}(\mathbf{x}) = J_g(f(\mathbf{x})) \cdot J_f(\mathbf{x}) \in \mathbb{R}^{m \times n}.$

This is matrix multiplication of Jacobians. Backpropagation is exactly this, applied to a deep composition $f = f_L \circ f_{L-1} \circ \cdots \circ f_1$ :

$J_{f}(\mathbf{x}) = J_{f_L} \cdot J_{f_{L-1}} \cdots J_{f_1}.$

Reverse-mode AD computes $\mathbf{u}^T J_f = \mathbf{u}^T J_{f_L} \cdots J_{f_1}$ from right to left, using only matrix-vector products — never forming the full Jacobian matrix.

Higher-Order Derivatives: the Hessian

For $f : \mathbb{R}^n \to \mathbb{R}$ , the Hessian $H = \nabla^2 f \in \mathbb{R}^{n \times n}$ with $H_{ij} = \partial^2 f / \partial x_i \partial x_j$ .

Second-order Taylor expansion around $\mathbf{x}_0$ :

$f(\mathbf{x}_0 + \mathbf{h}) = f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^T\mathbf{h} + \frac{1}{2}\mathbf{h}^T H(\mathbf{x}_0) \mathbf{h} + O(\|\mathbf{h}\|^3).$

Symmetry. By Clairaut-Schwarz theorem: if all second partials are continuous, $H_{ij} = H_{ji}$ — the Hessian is symmetric.

Critical point classification:

$\nabla f(\mathbf{x}^*) = \mathbf{0}$ and $H(\mathbf{x}^*) \succ 0$ : strict local minimum
$\nabla f(\mathbf{x}^*) = \mathbf{0}$ and $H(\mathbf{x}^*) \prec 0$ : strict local maximum
$\nabla f(\mathbf{x}^*) = \mathbf{0}$ and $H(\mathbf{x}^*)$ indefinite: saddle point

The Jacobian Determinant and Change of Variables

For $f : \mathbb{R}^n \to \mathbb{R}^n$ (same dimension domain and range), the Jacobian determinant $\det J_f(\mathbf{x})$ measures the factor by which $f$ scales $n$ -dimensional volume near $\mathbf{x}$ :

$\text{volume}(f(S)) \approx |\det J_f(\mathbf{x})| \cdot \text{volume}(S)$

for small regions $S$ containing $\mathbf{x}$ .

Change of variables formula. For an integral and a smooth bijection $\mathbf{y} = f(\mathbf{x})$ :

$\int_{f(U)} g(\mathbf{y}) \, d\mathbf{y} = \int_U g(f(\mathbf{x})) |\det J_f(\mathbf{x})| \, d\mathbf{x}.$

This is the multivariable substitution rule. The Jacobian determinant is the "stretching factor" that corrects for the change of coordinates.

Normalizing flows in generative modeling use this formula explicitly: to learn a probability density $p(\mathbf{x})$ by mapping from a simple base density $p_z(\mathbf{z})$ via an invertible map $\mathbf{x} = f(\mathbf{z})$ :

$\log p(\mathbf{x}) = \log p_z(f^{-1}(\mathbf{x})) - \log |\det J_f(f^{-1}(\mathbf{x}))|.$

The log-determinant of the Jacobian is the key term; architectures like RealNVP and Glow are designed so that $\det J_f$ is cheap to compute.

Inverse and Implicit Function Theorems

Inverse Function Theorem. If $f : \mathbb{R}^n \to \mathbb{R}^n$ is $C^1$ at $\mathbf{x}_0$ and $\det J_f(\mathbf{x}_0) \neq 0$ , then $f$ is locally invertible near $\mathbf{x}_0$ , and the Jacobian of the local inverse is $(J_f)^{-1}$ .

Implicit Function Theorem. If $F(\mathbf{x}, \mathbf{y}) = \mathbf{0}$ defines $\mathbf{y}$ implicitly as a function of $\mathbf{x}$ near $(\mathbf{x}_0, \mathbf{y}_0)$ , and $\det \partial F / \partial \mathbf{y} \neq 0$ at $(\mathbf{x}_0, \mathbf{y}_0)$ , then locally $\mathbf{y} = g(\mathbf{x})$ for a smooth $g$ , with:

$\frac{\partial g}{\partial \mathbf{x}} = -\left(\frac{\partial F}{\partial \mathbf{y}}\right)^{-1} \frac{\partial F}{\partial \mathbf{x}}.$

Implicit differentiation through optimization. If $\hat{\mathbf{y}}(\mathbf{x}) = \arg\min_\mathbf{y} L(\mathbf{x}, \mathbf{y})$ and the optimality condition $\nabla_\mathbf{y} L = \mathbf{0}$ defines $\hat{\mathbf{y}}$ implicitly, then:

$\frac{d\hat{\mathbf{y}}}{d\mathbf{x}} = -\left(\nabla^2_{\mathbf{y}\mathbf{y}} L\right)^{-1} \nabla^2_{\mathbf{x}\mathbf{y}} L.$

This is implicit differentiation through a solver — the foundation of bilevel optimization, hyperparameter optimization via implicit gradients, and meta-learning.

Worked Example

Example 1: Jacobian of Softmax

For $\boldsymbol{\pi} = \operatorname{softmax}(\mathbf{z})$ with $\pi_i = e^{z_i} / \sum_j e^{z_j}$ :

$\frac{\partial \pi_i}{\partial z_j} = \pi_i(\delta_{ij} - \pi_j).$

In matrix form: $J_{\operatorname{softmax}} = \operatorname{diag}(\boldsymbol{\pi}) - \boldsymbol{\pi}\boldsymbol{\pi}^T \in \mathbb{R}^{n \times n}$ .

This is a rank- $(n-1)$ matrix (since $\boldsymbol{\pi}^T \mathbf{1} = 1$ implies $J_{\operatorname{softmax}} \mathbf{1} = \mathbf{0}$ — gradient is zero in the direction of all-ones, reflecting the normalization constraint).

Cross-entropy gradient. For loss $L = -\log \pi_y$ (true class $y$ ):

$\frac{\partial L}{\partial \mathbf{z}} = \boldsymbol{\pi} - \mathbf{e}_y,$

where $\mathbf{e}_y$ is the one-hot vector. The gradient is simply prediction minus truth — a clean formula that emerges from the chain rule through the softmax Jacobian.

Example 2: Newton's Method and the Hessian

For $f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^TA\mathbf{x} - \mathbf{b}^T\mathbf{x}$ (quadratic, $A \succ 0$ ):

$\nabla f = A\mathbf{x} - \mathbf{b}$ , $H = A$ (constant Hessian).
Newton step from $\mathbf{x}_k$ : $\mathbf{x}_{k+1} = \mathbf{x}_k - A^{-1}(A\mathbf{x}_k - \mathbf{b}) = A^{-1}\mathbf{b}$ — exact solution in one step.

For general smooth $f$ , Newton's method converges quadratically near the optimum: if $\|\mathbf{x}_k - \mathbf{x}^*\| \leq \varepsilon$ , then $\|\mathbf{x}_{k+1} - \mathbf{x}^*\| = O(\varepsilon^2)$ . The per-step cost is $O(n^3)$ to factor the Hessian — prohibitive for large $n$ , motivating quasi-Newton (L-BFGS) and diagonal approximations (Adagrad, Adam).

Example 3: The Jacobian in Normalizing Flows

RealNVP coupling layer: given input $\mathbf{x} = (\mathbf{x}_1, \mathbf{x}_2)$ :

$\mathbf{y}_1 = \mathbf{x}_1, \qquad \mathbf{y}_2 = \mathbf{x}_2 \odot \exp(s(\mathbf{x}_1)) + t(\mathbf{x}_1).$

The Jacobian is lower-triangular (because $\mathbf{y}_1$ doesn't depend on $\mathbf{x}_2$ ), so:

$\det J = \prod_i \exp(s(\mathbf{x}_1)_i) = \exp\!\left(\sum_i s(\mathbf{x}_1)_i\right).$

The log-determinant is just $\sum_i s_i$ — computable in $O(d)$ instead of $O(d^3)$ . This architectural choice (triangular Jacobian) makes density evaluation tractable and is the core insight behind the affine coupling layer.

Connections

Where Your Intuition Breaks

The most dangerous assumption: the gradient direction is always useful for optimization. The gradient tells you the direction of steepest ascent — but "steepest" is measured in the Euclidean metric on parameter space. For neural networks with very different scales across layers (early layers have small gradients due to the chain rule, later layers have large ones), the Euclidean gradient direction is far from the direction of steepest descent in a more natural geometry. This is precisely why Adam and RMSProp rescale gradients per coordinate: they are implicitly approximating a better metric on parameter space. Gradient flow theory formalizes this — the natural gradient (preconditioned by the Fisher information matrix) is the true steepest descent direction in the geometry of the probability simplex.

Differentiability vs Partial Derivatives

Property	Condition	Implication
All partials exist	$\partial f_i/\partial x_j$ exist at $\mathbf{x}_0$	Does NOT imply continuity or differentiability
All partials continuous	$\partial f_i/\partial x_j$ continuous near $\mathbf{x}_0$	Implies differentiability (hence continuity)
Differentiable	Total derivative exists at $\mathbf{x}_0$	Implies continuity, implies all directional derivatives exist
$C^1$ (continuously diff.)	Partials exist and are continuous	Strongest common assumption; holds for neural network layers

💡Intuition

The Jacobian as zooming in. If you zoom in on the graph of a differentiable function at a point, the nonlinear map looks increasingly like its Jacobian (linear). This is what the diagram illustrates: the green dashed grid (linear) and purple grid (nonlinear) become indistinguishable near the query point. Differentiability is literally "locally linear." In optimization, we exploit this: gradient descent treats the loss as linear locally, which is valid when steps are small relative to the curvature radius $1/L$ .

⚠️Warning

Non-differentiability in deep learning. ReLU is not differentiable at 0. In practice, every implementation simply picks a subgradient (usually 0 at the kink). The chain rule still applies via subgradient calculus, and in practice the set of inputs landing exactly at a ReLU kink has measure zero. Nevertheless, for theoretical guarantees (convergence proofs, gradient flow analysis), one typically works with smooth approximations (SiLU/Swish, GELU) or uses tools from non-smooth analysis.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Topology Primer: Metric Spaces, Continuity & Compactness

Integration in Rⁿ: Fubini, Change of Variables & Surface Integrals