Reproducing Kernel Hilbert Spaces: Mercer's Theorem & Feature Maps
A reproducing kernel Hilbert space is a Hilbert space of functions where point evaluation is a bounded linear functional — this forces the Hilbert space to be "rich enough" to remember function values, not just equivalence classes. Mercer's theorem gives the spectral decomposition of the kernel as an inner product in this function space, connecting kernel methods to eigenfunction expansions and making infinite-dimensional optimization tractable.
Concepts
Gram matrix K(i,j) = k(xᵢ, xⱼ) for N=8 equally-spaced points on [0,1]. Color encodes kernel value: low → high.
The kernel trick lets you compute inner products in high-dimensional — or infinite-dimensional — feature spaces without ever computing the feature vectors explicitly. A function replaces the inner product, and every algorithm that depends only on pairwise inner products immediately works in this richer space. The RKHS is the function space that this kernel defines: a Hilbert space where evaluating a function at a point is equivalent to taking an inner product, and where the kernel itself is the "feature vector" of a point.
Positive Definite Kernels
A function is a positive definite kernel if it is symmetric () and its Gram matrices are positive semidefinite: for all and , the matrix satisfies for all .
Standard examples with their interpretations:
| Kernel | Formula | Inductive bias |
|---|---|---|
| Linear | Dot-product similarity | |
| Polynomial | Interactions up to degree | |
| RBF (Gaussian) | Smooth, infinitely differentiable functions | |
| Periodic | $k(x,y) = \exp(-2\sin^2(\pi | x-y |
The RBF kernel corresponds to an infinite-dimensional feature map; as the kernel approaches a delta function (very local), and as it approaches a constant (very global).
Reproducing Kernel Hilbert Spaces
A reproducing kernel Hilbert space (RKHS) associated with a kernel on is a Hilbert space of functions satisfying:
- For every , the function belongs to
- Reproducing property: for every and every ,
The reproducing property means point evaluation is a bounded linear functional on (by Cauchy-Schwarz: ). The kernel "evaluates" functions via the inner product.
The reproducing property is not a definition but a theorem: if you require point evaluation to be a bounded linear functional on a Hilbert space (so that is well-defined and stable, not just almost-everywhere), the Riesz representation theorem immediately forces for some representing element . Calling that element and requiring is the only consistent assignment — the kernel function and the RKHS inner product are not separate structures but two faces of the same object.
Moore-Aronszajn Theorem
Theorem. There is a bijection between positive definite kernels on and reproducing kernel Hilbert spaces of functions on . Given , the RKHS is the completion of under the inner product defined by .
This is remarkable: a kernel function, which looks like just a similarity measure, completely determines an entire function space with its geometry.
Mercer's Theorem
Theorem. Let be a compact metric space and a continuous positive definite kernel. Then the integral operator on is compact and self-adjoint, and admits an eigendecomposition with . The kernel decomposes as
with convergence uniform and absolute. The feature map defined by satisfies .
The Kernel Trick and the Representer Theorem
The kernel trick. Computing requires an inner product in possibly infinite-dimensional space. The kernel trick replaces this with , which is a scalar computation. Algorithms that depend only on pairwise inner products — SVM, kernel PCA, Gaussian process regression — can be "kernelized" at no additional per-prediction cost.
Representer theorem. Consider regularized empirical risk minimization:
The minimizer lies in the finite-dimensional subspace , so for some . Despite being infinite-dimensional, the problem reduces to solving an linear system for .
Worked Example
Explicit Feature Map for the Polynomial Kernel
For on , expand using the binomial theorem. For :
The explicit feature map is , confirming . For -dimensional input and degree , the feature map has components — computing it explicitly is infeasible for large , but evaluating takes time.
Computing a Gram Matrix for the RBF Kernel
For and four points :
The matrix is symmetric positive definite, with eigenvalues that can be used (via Mercer's theorem) as the weights of the eigenfunction expansion.
Representer Theorem Derivation
Decompose any as where and . By the reproducing property, — the perpendicular component does not affect function values at training points. Since , the objective is minimized by taking . Therefore for some coefficients .
Connections
Where Your Intuition Breaks
The kernel trick is often presented as a computational shortcut — you save time by computing instead of . But Mercer's theorem reveals it is an exact representation, not an approximation: with uniform and absolute convergence. There is no truncation error. The more subtle misconception is that the RKHS norm and the kernel are separate choices — in reality they are the same object. Choosing determines which functions are "smooth" (small ) and which are "rough" (large norm). Regularization by is not adding a separate penalty — it enforces exactly the smoothness the kernel itself defines. A practitioner who tunes the regularization strength without understanding what actually measures is regularizing blindly against an implicit smoothness notion they never chose explicitly.
The representer theorem is a remarkable compression: despite being infinite-dimensional, the solution to regularized ERM always has finite representation , reducing an optimization problem over an infinite-dimensional function space to solving an linear system for . The RKHS norm penalizes functions that deviate from the smoothness encoded in , and the representer theorem says the optimal smooth function interpolates training data in the "kernel direction."
Infinite-width neural networks trained by gradient descent are equivalent to kernel regression with the neural tangent kernel (NTK). In this regime, the network function lies in an RKHS determined by its architecture, and the representer theorem applies: the trained network is a sum of NTK evaluations at training points. This connection is why overparameterized networks can interpolate training data while generalizing — the NTK RKHS norm provides implicit regularization.
Kernel selection determines which RKHS the solution lives in, and the RKHS norm encodes a specific smoothness assumption. Choosing the wrong kernel gives the wrong inductive bias. The RBF kernel with large biases toward nearly constant functions; with small it biases toward very local interpolation (effectively nearest-neighbor). Mercer's theorem reveals why: different values give different eigenvalue decay rates , penalizing high-frequency eigenfunctions (small = smooth = small penalty; large = oscillatory = large penalty). A poor kernel choice can make regularization penalize exactly the features needed for good prediction.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.