Differentiation in Rⁿ: Jacobians, Hessians & the Chain Rule
Differentiation in is not just about partial derivatives — it is about finding the best linear approximation to a smooth map at each point. The Jacobian captures this approximation in matrix form, the Hessian captures second-order curvature, and the chain rule tells how these approximations compose. Together they are the machinery that makes backpropagation, implicit differentiation, and second-order optimization tractable.
Concepts
Jacobian Approximation — drag the point in the domain (left)
Polar-like map. Near any point the nonlinear wrap linearizes to a rotation + scaling.
Purple = nonlinear image. Green dashed = Jacobian linearization at the yellow point. As you zoom in, they become indistinguishable — this is what differentiability means.
When you zoom in on any smooth curve in calculus, it starts to look like a straight line — that is the core idea of the derivative. The same thing happens in : zoom in on a smooth map and it looks like a linear map, represented by a matrix. That matrix is the Jacobian, and it is the fundamental object that makes backpropagation work. Every time PyTorch propagates gradients through a layer, it is multiplying Jacobians in the reverse direction — the chain rule in matrix form.
The Total Derivative
The fundamental concept generalizing "derivative" to maps is the total derivative (or Fréchet derivative).
Definition. is differentiable at if there exists a linear map such that
In words: to first order in . The matrix representing in the standard basis is the Jacobian:
The limit condition says: the error in the linear approximation must vanish faster than as . This is stronger than just requiring partial derivatives to exist — you must also require that the approximation works simultaneously in all directions, not just along coordinate axes. That is why partial derivatives alone are not enough: a function can have all its directional derivatives yet still fail to be locally linear when you approach from a non-coordinate direction.
Existence. If all partial derivatives exist and are continuous near , then is differentiable at (the converse is false). Existence of all partials does not imply differentiability without continuity.
Non-example. for , . Both partial derivatives at the origin are 0, but the function is not continuous at the origin (approach along gives ). No linear map can approximate it.
Directional Derivatives
The directional derivative of at in direction (unit vector):
Steepest ascent direction: . The gradient direction maximizes the rate of increase — this is why gradient descent moves in the negative gradient direction.
Cauchy-Schwarz bound: , with equality when .
The Chain Rule in Multiple Dimensions
For composable smooth maps and :
or in matrix form (numerator layout):
This is matrix multiplication of Jacobians. Backpropagation is exactly this, applied to a deep composition :
Reverse-mode AD computes from right to left, using only matrix-vector products — never forming the full Jacobian matrix.
Higher-Order Derivatives: the Hessian
For , the Hessian with .
Second-order Taylor expansion around :
Symmetry. By Clairaut-Schwarz theorem: if all second partials are continuous, — the Hessian is symmetric.
Critical point classification:
- and : strict local minimum
- and : strict local maximum
- and indefinite: saddle point
The Jacobian Determinant and Change of Variables
For (same dimension domain and range), the Jacobian determinant measures the factor by which scales -dimensional volume near :
for small regions containing .
Change of variables formula. For an integral and a smooth bijection :
This is the multivariable substitution rule. The Jacobian determinant is the "stretching factor" that corrects for the change of coordinates.
Normalizing flows in generative modeling use this formula explicitly: to learn a probability density by mapping from a simple base density via an invertible map :
The log-determinant of the Jacobian is the key term; architectures like RealNVP and Glow are designed so that is cheap to compute.
Inverse and Implicit Function Theorems
Inverse Function Theorem. If is at and , then is locally invertible near , and the Jacobian of the local inverse is .
Implicit Function Theorem. If defines implicitly as a function of near , and at , then locally for a smooth , with:
Implicit differentiation through optimization. If and the optimality condition defines implicitly, then:
This is implicit differentiation through a solver — the foundation of bilevel optimization, hyperparameter optimization via implicit gradients, and meta-learning.
Worked Example
Example 1: Jacobian of Softmax
For with :
In matrix form: .
This is a rank- matrix (since implies — gradient is zero in the direction of all-ones, reflecting the normalization constraint).
Cross-entropy gradient. For loss (true class ):
where is the one-hot vector. The gradient is simply prediction minus truth — a clean formula that emerges from the chain rule through the softmax Jacobian.
Example 2: Newton's Method and the Hessian
For (quadratic, ):
- , (constant Hessian).
- Newton step from : — exact solution in one step.
For general smooth , Newton's method converges quadratically near the optimum: if , then . The per-step cost is to factor the Hessian — prohibitive for large , motivating quasi-Newton (L-BFGS) and diagonal approximations (Adagrad, Adam).
Example 3: The Jacobian in Normalizing Flows
RealNVP coupling layer: given input :
The Jacobian is lower-triangular (because doesn't depend on ), so:
The log-determinant is just — computable in instead of . This architectural choice (triangular Jacobian) makes density evaluation tractable and is the core insight behind the affine coupling layer.
Connections
Where Your Intuition Breaks
The most dangerous assumption: the gradient direction is always useful for optimization. The gradient tells you the direction of steepest ascent — but "steepest" is measured in the Euclidean metric on parameter space. For neural networks with very different scales across layers (early layers have small gradients due to the chain rule, later layers have large ones), the Euclidean gradient direction is far from the direction of steepest descent in a more natural geometry. This is precisely why Adam and RMSProp rescale gradients per coordinate: they are implicitly approximating a better metric on parameter space. Gradient flow theory formalizes this — the natural gradient (preconditioned by the Fisher information matrix) is the true steepest descent direction in the geometry of the probability simplex.
Differentiability vs Partial Derivatives
| Property | Condition | Implication |
|---|---|---|
| All partials exist | exist at | Does NOT imply continuity or differentiability |
| All partials continuous | continuous near | Implies differentiability (hence continuity) |
| Differentiable | Total derivative exists at | Implies continuity, implies all directional derivatives exist |
| (continuously diff.) | Partials exist and are continuous | Strongest common assumption; holds for neural network layers |
The Jacobian as zooming in. If you zoom in on the graph of a differentiable function at a point, the nonlinear map looks increasingly like its Jacobian (linear). This is what the diagram illustrates: the green dashed grid (linear) and purple grid (nonlinear) become indistinguishable near the query point. Differentiability is literally "locally linear." In optimization, we exploit this: gradient descent treats the loss as linear locally, which is valid when steps are small relative to the curvature radius .
Non-differentiability in deep learning. ReLU is not differentiable at 0. In practice, every implementation simply picks a subgradient (usually 0 at the kink). The chain rule still applies via subgradient calculus, and in practice the set of inputs landing exactly at a ReLU kink has measure zero. Nevertheless, for theoretical guarantees (convergence proofs, gradient flow analysis), one typically works with smooth approximations (SiLU/Swish, GELU) or uses tools from non-smooth analysis.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.