Neural-Path/Notes
45 min

Differentiation in Rⁿ: Jacobians, Hessians & the Chain Rule

Differentiation in Rn\mathbb{R}^n is not just about partial derivatives — it is about finding the best linear approximation to a smooth map at each point. The Jacobian captures this approximation in matrix form, the Hessian captures second-order curvature, and the chain rule tells how these approximations compose. Together they are the machinery that makes backpropagation, implicit differentiation, and second-order optimization tractable.

Concepts

Jacobian Approximation — drag the point in the domain (left)

domainrange = f(domain)J = [0.287, -0.340]    [0.089, 1.099]det(J) = 0.345

Polar-like map. Near any point the nonlinear wrap linearizes to a rotation + scaling.

Purple = nonlinear image. Green dashed = Jacobian linearization at the yellow point. As you zoom in, they become indistinguishable — this is what differentiability means.

When you zoom in on any smooth curve in calculus, it starts to look like a straight line — that is the core idea of the derivative. The same thing happens in Rn\mathbb{R}^n: zoom in on a smooth map and it looks like a linear map, represented by a matrix. That matrix is the Jacobian, and it is the fundamental object that makes backpropagation work. Every time PyTorch propagates gradients through a layer, it is multiplying Jacobians in the reverse direction — the chain rule in matrix form.

The Total Derivative

The fundamental concept generalizing "derivative" to maps f:RnRmf : \mathbb{R}^n \to \mathbb{R}^m is the total derivative (or Fréchet derivative).

Definition. ff is differentiable at x0\mathbf{x}_0 if there exists a linear map Df(x0):RnRmDf(\mathbf{x}_0) : \mathbb{R}^n \to \mathbb{R}^m such that

limh0f(x0+h)f(x0)Df(x0)hh=0.\lim_{\mathbf{h} \to \mathbf{0}} \frac{\|f(\mathbf{x}_0 + \mathbf{h}) - f(\mathbf{x}_0) - Df(\mathbf{x}_0)\mathbf{h}\|}{\|\mathbf{h}\|} = 0.

In words: f(x0+h)f(x0)+Df(x0)hf(\mathbf{x}_0 + \mathbf{h}) \approx f(\mathbf{x}_0) + Df(\mathbf{x}_0)\mathbf{h} to first order in h\mathbf{h}. The matrix representing Df(x0)Df(\mathbf{x}_0) in the standard basis is the Jacobian:

Jf(x0)=Df(x0)Rm×n,(Jf)ij=fixjx0.J_f(\mathbf{x}_0) = Df(\mathbf{x}_0) \in \mathbb{R}^{m \times n}, \qquad (J_f)_{ij} = \frac{\partial f_i}{\partial x_j}\bigg|_{\mathbf{x}_0}.

The limit condition says: the error in the linear approximation must vanish faster than h\|\mathbf{h}\| as h0\mathbf{h} \to \mathbf{0}. This is stronger than just requiring partial derivatives to exist — you must also require that the approximation works simultaneously in all directions, not just along coordinate axes. That is why partial derivatives alone are not enough: a function can have all its directional derivatives yet still fail to be locally linear when you approach from a non-coordinate direction.

Existence. If all partial derivatives fi/xj\partial f_i / \partial x_j exist and are continuous near x0\mathbf{x}_0, then ff is differentiable at x0\mathbf{x}_0 (the converse is false). Existence of all partials does not imply differentiability without continuity.

Non-example. f(x,y)=xy/(x2+y2)f(x,y) = xy/(x^2+y^2) for (x,y)(0,0)(x,y) \neq (0,0), f(0,0)=0f(0,0)=0. Both partial derivatives at the origin are 0, but the function is not continuous at the origin (approach along y=xy=x gives 1/21/2). No linear map can approximate it.

Directional Derivatives

The directional derivative of f:RnRf : \mathbb{R}^n \to \mathbb{R} at x\mathbf{x} in direction v\mathbf{v} (unit vector):

Dvf(x)=limt0f(x+tv)f(x)t=f(x)v.D_{\mathbf{v}} f(\mathbf{x}) = \lim_{t \to 0} \frac{f(\mathbf{x} + t\mathbf{v}) - f(\mathbf{x})}{t} = \nabla f(\mathbf{x}) \cdot \mathbf{v}.

Steepest ascent direction: argmaxv=1Dvf=f/f\arg\max_{\|\mathbf{v}\|=1} D_\mathbf{v} f = \nabla f / \|\nabla f\|. The gradient direction maximizes the rate of increase — this is why gradient descent moves in the negative gradient direction.

Cauchy-Schwarz bound: Dvffv=f|D_\mathbf{v} f| \leq \|\nabla f\| \cdot \|\mathbf{v}\| = \|\nabla f\|, with equality when vf\mathbf{v} \parallel \nabla f.

The Chain Rule in Multiple Dimensions

For composable smooth maps g:RkRmg : \mathbb{R}^k \to \mathbb{R}^m and f:RnRkf : \mathbb{R}^n \to \mathbb{R}^k:

D(gf)(x)=Dg(f(x))Df(x),D(g \circ f)(\mathbf{x}) = Dg(f(\mathbf{x})) \circ Df(\mathbf{x}),

or in matrix form (numerator layout):

Jgf(x)=Jg(f(x))Jf(x)Rm×n.J_{g \circ f}(\mathbf{x}) = J_g(f(\mathbf{x})) \cdot J_f(\mathbf{x}) \in \mathbb{R}^{m \times n}.

This is matrix multiplication of Jacobians. Backpropagation is exactly this, applied to a deep composition f=fLfL1f1f = f_L \circ f_{L-1} \circ \cdots \circ f_1:

Jf(x)=JfLJfL1Jf1.J_{f}(\mathbf{x}) = J_{f_L} \cdot J_{f_{L-1}} \cdots J_{f_1}.

Reverse-mode AD computes uTJf=uTJfLJf1\mathbf{u}^T J_f = \mathbf{u}^T J_{f_L} \cdots J_{f_1} from right to left, using only matrix-vector products — never forming the full Jacobian matrix.

Higher-Order Derivatives: the Hessian

For f:RnRf : \mathbb{R}^n \to \mathbb{R}, the Hessian H=2fRn×nH = \nabla^2 f \in \mathbb{R}^{n \times n} with Hij=2f/xixjH_{ij} = \partial^2 f / \partial x_i \partial x_j.

Second-order Taylor expansion around x0\mathbf{x}_0:

f(x0+h)=f(x0)+f(x0)Th+12hTH(x0)h+O(h3).f(\mathbf{x}_0 + \mathbf{h}) = f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^T\mathbf{h} + \frac{1}{2}\mathbf{h}^T H(\mathbf{x}_0) \mathbf{h} + O(\|\mathbf{h}\|^3).

Symmetry. By Clairaut-Schwarz theorem: if all second partials are continuous, Hij=HjiH_{ij} = H_{ji} — the Hessian is symmetric.

Critical point classification:

  • f(x)=0\nabla f(\mathbf{x}^*) = \mathbf{0} and H(x)0H(\mathbf{x}^*) \succ 0: strict local minimum
  • f(x)=0\nabla f(\mathbf{x}^*) = \mathbf{0} and H(x)0H(\mathbf{x}^*) \prec 0: strict local maximum
  • f(x)=0\nabla f(\mathbf{x}^*) = \mathbf{0} and H(x)H(\mathbf{x}^*) indefinite: saddle point

The Jacobian Determinant and Change of Variables

For f:RnRnf : \mathbb{R}^n \to \mathbb{R}^n (same dimension domain and range), the Jacobian determinant detJf(x)\det J_f(\mathbf{x}) measures the factor by which ff scales nn-dimensional volume near x\mathbf{x}:

volume(f(S))detJf(x)volume(S)\text{volume}(f(S)) \approx |\det J_f(\mathbf{x})| \cdot \text{volume}(S)

for small regions SS containing x\mathbf{x}.

Change of variables formula. For an integral and a smooth bijection y=f(x)\mathbf{y} = f(\mathbf{x}):

f(U)g(y)dy=Ug(f(x))detJf(x)dx.\int_{f(U)} g(\mathbf{y}) \, d\mathbf{y} = \int_U g(f(\mathbf{x})) |\det J_f(\mathbf{x})| \, d\mathbf{x}.

This is the multivariable substitution rule. The Jacobian determinant is the "stretching factor" that corrects for the change of coordinates.

Normalizing flows in generative modeling use this formula explicitly: to learn a probability density p(x)p(\mathbf{x}) by mapping from a simple base density pz(z)p_z(\mathbf{z}) via an invertible map x=f(z)\mathbf{x} = f(\mathbf{z}):

logp(x)=logpz(f1(x))logdetJf(f1(x)).\log p(\mathbf{x}) = \log p_z(f^{-1}(\mathbf{x})) - \log |\det J_f(f^{-1}(\mathbf{x}))|.

The log-determinant of the Jacobian is the key term; architectures like RealNVP and Glow are designed so that detJf\det J_f is cheap to compute.

Inverse and Implicit Function Theorems

Inverse Function Theorem. If f:RnRnf : \mathbb{R}^n \to \mathbb{R}^n is C1C^1 at x0\mathbf{x}_0 and detJf(x0)0\det J_f(\mathbf{x}_0) \neq 0, then ff is locally invertible near x0\mathbf{x}_0, and the Jacobian of the local inverse is (Jf)1(J_f)^{-1}.

Implicit Function Theorem. If F(x,y)=0F(\mathbf{x}, \mathbf{y}) = \mathbf{0} defines y\mathbf{y} implicitly as a function of x\mathbf{x} near (x0,y0)(\mathbf{x}_0, \mathbf{y}_0), and detF/y0\det \partial F / \partial \mathbf{y} \neq 0 at (x0,y0)(\mathbf{x}_0, \mathbf{y}_0), then locally y=g(x)\mathbf{y} = g(\mathbf{x}) for a smooth gg, with:

gx=(Fy)1Fx.\frac{\partial g}{\partial \mathbf{x}} = -\left(\frac{\partial F}{\partial \mathbf{y}}\right)^{-1} \frac{\partial F}{\partial \mathbf{x}}.

Implicit differentiation through optimization. If y^(x)=argminyL(x,y)\hat{\mathbf{y}}(\mathbf{x}) = \arg\min_\mathbf{y} L(\mathbf{x}, \mathbf{y}) and the optimality condition yL=0\nabla_\mathbf{y} L = \mathbf{0} defines y^\hat{\mathbf{y}} implicitly, then:

dy^dx=(yy2L)1xy2L.\frac{d\hat{\mathbf{y}}}{d\mathbf{x}} = -\left(\nabla^2_{\mathbf{y}\mathbf{y}} L\right)^{-1} \nabla^2_{\mathbf{x}\mathbf{y}} L.

This is implicit differentiation through a solver — the foundation of bilevel optimization, hyperparameter optimization via implicit gradients, and meta-learning.

Worked Example

Example 1: Jacobian of Softmax

For π=softmax(z)\boldsymbol{\pi} = \operatorname{softmax}(\mathbf{z}) with πi=ezi/jezj\pi_i = e^{z_i} / \sum_j e^{z_j}:

πizj=πi(δijπj).\frac{\partial \pi_i}{\partial z_j} = \pi_i(\delta_{ij} - \pi_j).

In matrix form: Jsoftmax=diag(π)ππTRn×nJ_{\operatorname{softmax}} = \operatorname{diag}(\boldsymbol{\pi}) - \boldsymbol{\pi}\boldsymbol{\pi}^T \in \mathbb{R}^{n \times n}.

This is a rank-(n1)(n-1) matrix (since πT1=1\boldsymbol{\pi}^T \mathbf{1} = 1 implies Jsoftmax1=0J_{\operatorname{softmax}} \mathbf{1} = \mathbf{0} — gradient is zero in the direction of all-ones, reflecting the normalization constraint).

Cross-entropy gradient. For loss L=logπyL = -\log \pi_y (true class yy):

Lz=πey,\frac{\partial L}{\partial \mathbf{z}} = \boldsymbol{\pi} - \mathbf{e}_y,

where ey\mathbf{e}_y is the one-hot vector. The gradient is simply prediction minus truth — a clean formula that emerges from the chain rule through the softmax Jacobian.

Example 2: Newton's Method and the Hessian

For f(x)=12xTAxbTxf(\mathbf{x}) = \frac{1}{2}\mathbf{x}^TA\mathbf{x} - \mathbf{b}^T\mathbf{x} (quadratic, A0A \succ 0):

  • f=Axb\nabla f = A\mathbf{x} - \mathbf{b}, H=AH = A (constant Hessian).
  • Newton step from xk\mathbf{x}_k: xk+1=xkA1(Axkb)=A1b\mathbf{x}_{k+1} = \mathbf{x}_k - A^{-1}(A\mathbf{x}_k - \mathbf{b}) = A^{-1}\mathbf{b} — exact solution in one step.

For general smooth ff, Newton's method converges quadratically near the optimum: if xkxε\|\mathbf{x}_k - \mathbf{x}^*\| \leq \varepsilon, then xk+1x=O(ε2)\|\mathbf{x}_{k+1} - \mathbf{x}^*\| = O(\varepsilon^2). The per-step cost is O(n3)O(n^3) to factor the Hessian — prohibitive for large nn, motivating quasi-Newton (L-BFGS) and diagonal approximations (Adagrad, Adam).

Example 3: The Jacobian in Normalizing Flows

RealNVP coupling layer: given input x=(x1,x2)\mathbf{x} = (\mathbf{x}_1, \mathbf{x}_2):

y1=x1,y2=x2exp(s(x1))+t(x1).\mathbf{y}_1 = \mathbf{x}_1, \qquad \mathbf{y}_2 = \mathbf{x}_2 \odot \exp(s(\mathbf{x}_1)) + t(\mathbf{x}_1).

The Jacobian is lower-triangular (because y1\mathbf{y}_1 doesn't depend on x2\mathbf{x}_2), so:

detJ=iexp(s(x1)i)=exp ⁣(is(x1)i).\det J = \prod_i \exp(s(\mathbf{x}_1)_i) = \exp\!\left(\sum_i s(\mathbf{x}_1)_i\right).

The log-determinant is just isi\sum_i s_i — computable in O(d)O(d) instead of O(d3)O(d^3). This architectural choice (triangular Jacobian) makes density evaluation tractable and is the core insight behind the affine coupling layer.

Connections

Where Your Intuition Breaks

The most dangerous assumption: the gradient direction is always useful for optimization. The gradient tells you the direction of steepest ascent — but "steepest" is measured in the Euclidean metric on parameter space. For neural networks with very different scales across layers (early layers have small gradients due to the chain rule, later layers have large ones), the Euclidean gradient direction is far from the direction of steepest descent in a more natural geometry. This is precisely why Adam and RMSProp rescale gradients per coordinate: they are implicitly approximating a better metric on parameter space. Gradient flow theory formalizes this — the natural gradient (preconditioned by the Fisher information matrix) is the true steepest descent direction in the geometry of the probability simplex.

Differentiability vs Partial Derivatives

PropertyConditionImplication
All partials existfi/xj\partial f_i/\partial x_j exist at x0\mathbf{x}_0Does NOT imply continuity or differentiability
All partials continuousfi/xj\partial f_i/\partial x_j continuous near x0\mathbf{x}_0Implies differentiability (hence continuity)
DifferentiableTotal derivative exists at x0\mathbf{x}_0Implies continuity, implies all directional derivatives exist
C1C^1 (continuously diff.)Partials exist and are continuousStrongest common assumption; holds for neural network layers
💡Intuition

The Jacobian as zooming in. If you zoom in on the graph of a differentiable function at a point, the nonlinear map looks increasingly like its Jacobian (linear). This is what the diagram illustrates: the green dashed grid (linear) and purple grid (nonlinear) become indistinguishable near the query point. Differentiability is literally "locally linear." In optimization, we exploit this: gradient descent treats the loss as linear locally, which is valid when steps are small relative to the curvature radius 1/L1/L.

⚠️Warning

Non-differentiability in deep learning. ReLU is not differentiable at 0. In practice, every implementation simply picks a subgradient (usually 0 at the kink). The chain rule still applies via subgradient calculus, and in practice the set of inputs landing exactly at a ReLU kink has measure zero. Nevertheless, for theoretical guarantees (convergence proofs, gradient flow analysis), one typically works with smooth approximations (SiLU/Swish, GELU) or uses tools from non-smooth analysis.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.