Neural-Path/Notes
45 min

Reproducing Kernel Hilbert Spaces: Mercer's Theorem & Feature Maps

A reproducing kernel Hilbert space is a Hilbert space of functions where point evaluation is a bounded linear functional — this forces the Hilbert space to be "rich enough" to remember function values, not just L2L^2 equivalence classes. Mercer's theorem gives the spectral decomposition of the kernel as an inner product in this function space, connecting kernel methods to eigenfunction expansions and making infinite-dimensional optimization tractable.

Concepts

Gram matrix K(i,j) = k(xᵢ, xⱼ) for N=8 equally-spaced points on [0,1]. Color encodes kernel value: low high.

γ:medium
x1x2x3x4x5x6x7x8x1x2x3x4x5x6x7x81.001.001.001.001.001.001.001.00
RBF kernel with γ=2: K(x,x') = exp(-γ‖x-x'‖²). Large γ → narrow kernel → nearly diagonal Gram matrix (local similarity only).

The kernel trick lets you compute inner products in high-dimensional — or infinite-dimensional — feature spaces without ever computing the feature vectors explicitly. A function k(x,y)=ϕ(x),ϕ(y)k(x, y) = \langle \phi(x), \phi(y) \rangle replaces the inner product, and every algorithm that depends only on pairwise inner products immediately works in this richer space. The RKHS is the function space that this kernel defines: a Hilbert space where evaluating a function at a point is equivalent to taking an inner product, and where the kernel itself is the "feature vector" of a point.

Positive Definite Kernels

A function k:X×XRk: \mathcal{X} \times \mathcal{X} \to \mathbb{R} is a positive definite kernel if it is symmetric (k(x,y)=k(y,x)k(x,y) = k(y,x)) and its Gram matrices are positive semidefinite: for all n1n \geq 1 and x1,,xnXx_1, \ldots, x_n \in \mathcal{X}, the matrix Kij=k(xi,xj)K_{ij} = k(x_i, x_j) satisfies cKc0c^\top K c \geq 0 for all cRnc \in \mathbb{R}^n.

Standard examples with their interpretations:

KernelFormulaInductive bias
Lineark(x,y)=xyk(x,y) = x^\top yDot-product similarity
Polynomialk(x,y)=(xy+c)dk(x,y) = (x^\top y + c)^dInteractions up to degree dd
RBF (Gaussian)k(x,y)=exp(xy2/22)k(x,y) = \exp(-\|x-y\|^2 / 2\ell^2)Smooth, infinitely differentiable functions
Periodic$k(x,y) = \exp(-2\sin^2(\pix-y

The RBF kernel corresponds to an infinite-dimensional feature map; as 0\ell \to 0 the kernel approaches a delta function (very local), and as \ell \to \infty it approaches a constant (very global).

Reproducing Kernel Hilbert Spaces

A reproducing kernel Hilbert space (RKHS) Hk\mathcal{H}_k associated with a kernel kk on X\mathcal{X} is a Hilbert space of functions f:XRf: \mathcal{X} \to \mathbb{R} satisfying:

  1. For every xXx \in \mathcal{X}, the function k(,x):XRk(\cdot, x): \mathcal{X} \to \mathbb{R} belongs to Hk\mathcal{H}_k
  2. Reproducing property: for every fHkf \in \mathcal{H}_k and every xXx \in \mathcal{X},

f(x)=f,k(,x)Hkf(x) = \langle f,\, k(\cdot, x) \rangle_{\mathcal{H}_k}

The reproducing property means point evaluation δx:ff(x)\delta_x: f \mapsto f(x) is a bounded linear functional on Hk\mathcal{H}_k (by Cauchy-Schwarz: f(x)=f,k(,x)fHkk(,x)Hk=fHkk(x,x)|f(x)| = |\langle f, k(\cdot,x)\rangle| \leq \|f\|_{\mathcal{H}_k}\|k(\cdot,x)\|_{\mathcal{H}_k} = \|f\|_{\mathcal{H}_k}\sqrt{k(x,x)}). The kernel "evaluates" functions via the inner product.

The reproducing property is not a definition but a theorem: if you require point evaluation δx\delta_x to be a bounded linear functional on a Hilbert space (so that f(x)f(x) is well-defined and stable, not just almost-everywhere), the Riesz representation theorem immediately forces f(x)=f,rxf(x) = \langle f, r_x \rangle for some representing element rxr_x. Calling that element k(,x)k(\cdot, x) and requiring k(,x),k(,y)=k(x,y)\langle k(\cdot, x), k(\cdot, y) \rangle = k(x,y) is the only consistent assignment — the kernel function and the RKHS inner product are not separate structures but two faces of the same object.

Moore-Aronszajn Theorem

Theorem. There is a bijection between positive definite kernels kk on X\mathcal{X} and reproducing kernel Hilbert spaces Hk\mathcal{H}_k of functions on X\mathcal{X}. Given kk, the RKHS is the completion of span{k(,x):xX}\mathrm{span}\{k(\cdot, x) : x \in \mathcal{X}\} under the inner product defined by k(,x),k(,y)=k(x,y)\langle k(\cdot, x), k(\cdot, y)\rangle = k(x,y).

This is remarkable: a kernel function, which looks like just a similarity measure, completely determines an entire function space with its geometry.

Mercer's Theorem

Theorem. Let X\mathcal{X} be a compact metric space and k:X×XRk: \mathcal{X} \times \mathcal{X} \to \mathbb{R} a continuous positive definite kernel. Then the integral operator Tkf(x)=k(x,y)f(y)dμ(y)T_k f(x) = \int k(x,y)f(y)\,d\mu(y) on L2(X,μ)L^2(\mathcal{X}, \mu) is compact and self-adjoint, and admits an eigendecomposition Tkϕi=λiϕiT_k \phi_i = \lambda_i \phi_i with λ1λ2>0\lambda_1 \geq \lambda_2 \geq \cdots > 0. The kernel decomposes as

k(x,y)=i=1λiϕi(x)ϕi(y)k(x, y) = \sum_{i=1}^\infty \lambda_i\,\phi_i(x)\,\phi_i(y)

with convergence uniform and absolute. The feature map ϕ:X2\phi: \mathcal{X} \to \ell^2 defined by ϕ(x)=(λ1ϕ1(x),λ2ϕ2(x),)\phi(x) = (\sqrt{\lambda_1}\phi_1(x),\, \sqrt{\lambda_2}\phi_2(x),\, \ldots) satisfies k(x,y)=ϕ(x),ϕ(y)2k(x,y) = \langle \phi(x), \phi(y)\rangle_{\ell^2}.

The Kernel Trick and the Representer Theorem

The kernel trick. Computing ϕ(x),ϕ(y)Hk\langle \phi(x), \phi(y)\rangle_{\mathcal{H}_k} requires an inner product in possibly infinite-dimensional space. The kernel trick replaces this with k(x,y)k(x,y), which is a scalar computation. Algorithms that depend only on pairwise inner products — SVM, kernel PCA, Gaussian process regression — can be "kernelized" at no additional per-prediction cost.

Representer theorem. Consider regularized empirical risk minimization:

minfHki=1nL(yi,f(xi))+λfHk2\min_{f \in \mathcal{H}_k} \sum_{i=1}^n L(y_i, f(x_i)) + \lambda \|f\|_{\mathcal{H}_k}^2

The minimizer ff^* lies in the finite-dimensional subspace span{k(,xi)}i=1n\mathrm{span}\{k(\cdot, x_i)\}_{i=1}^n, so f(x)=i=1nαik(x,xi)f^*(x) = \sum_{i=1}^n \alpha_i k(x, x_i) for some αRn\alpha \in \mathbb{R}^n. Despite Hk\mathcal{H}_k being infinite-dimensional, the problem reduces to solving an n×nn \times n linear system for α\alpha.

Worked Example

Explicit Feature Map for the Polynomial Kernel

For k(x,y)=(xy+c)2k(x,y) = (x^\top y + c)^2 on Rd\mathbb{R}^d, expand using the binomial theorem. For d=2d = 2:

(xy+c)2=(x1y1+x2y2+c)2=x12y12+2x1x2y1y2+x22y22+2cx1y1+2cx2y2+c2\begin{aligned} (x^\top y + c)^2 &= (x_1 y_1 + x_2 y_2 + c)^2 \\ &= x_1^2 y_1^2 + 2x_1 x_2 y_1 y_2 + x_2^2 y_2^2 + 2cx_1 y_1 + 2cx_2 y_2 + c^2 \end{aligned}

The explicit feature map is ϕ(x)=(x12,2x1x2,x22,2cx1,2cx2,c)\phi(x) = (x_1^2,\, \sqrt{2}x_1 x_2,\, x_2^2,\, \sqrt{2c}x_1,\, \sqrt{2c}x_2,\, c), confirming k(x,y)=ϕ(x),ϕ(y)k(x,y) = \langle \phi(x), \phi(y)\rangle. For dd-dimensional input and degree DD, the feature map has O(dD)O(d^D) components — computing it explicitly is infeasible for large dd, but evaluating k(x,y)=(xy+c)Dk(x,y) = (x^\top y + c)^D takes O(d)O(d) time.

Computing a Gram Matrix for the RBF Kernel

For =1\ell = 1 and four points x1=0,x2=0.5,x3=1,x4=1.5x_1 = 0,\, x_2 = 0.5,\, x_3 = 1,\, x_4 = 1.5:

K=(1e0.125e0.5e1.125e0.1251e0.125e0.5e0.5e0.1251e0.125e1.125e0.5e0.1251)(10.880.610.320.8810.880.610.610.8810.880.320.610.881)K = \begin{pmatrix} 1 & e^{-0.125} & e^{-0.5} & e^{-1.125} \\ e^{-0.125} & 1 & e^{-0.125} & e^{-0.5} \\ e^{-0.5} & e^{-0.125} & 1 & e^{-0.125} \\ e^{-1.125} & e^{-0.5} & e^{-0.125} & 1 \end{pmatrix} \approx \begin{pmatrix} 1 & 0.88 & 0.61 & 0.32 \\ 0.88 & 1 & 0.88 & 0.61 \\ 0.61 & 0.88 & 1 & 0.88 \\ 0.32 & 0.61 & 0.88 & 1 \end{pmatrix}

The matrix is symmetric positive definite, with eigenvalues that can be used (via Mercer's theorem) as the weights λi\lambda_i of the eigenfunction expansion.

Representer Theorem Derivation

Decompose any fHkf \in \mathcal{H}_k as f=f+ff = f_\parallel + f_\perp where fspan{k(,xi)}f_\parallel \in \mathrm{span}\{k(\cdot, x_i)\} and fspan{k(,xi)}f_\perp \perp \mathrm{span}\{k(\cdot, x_i)\}. By the reproducing property, f(xi)=f,k(,xi)=f,k(,xi)=f(xi)f(x_i) = \langle f, k(\cdot, x_i)\rangle = \langle f_\parallel, k(\cdot, x_i)\rangle = f_\parallel(x_i) — the perpendicular component does not affect function values at training points. Since f2=f2+f2f2\|f\|^2 = \|f_\parallel\|^2 + \|f_\perp\|^2 \geq \|f_\parallel\|^2, the objective is minimized by taking f=0f_\perp = 0. Therefore f=iαik(,xi)f^* = \sum_i \alpha_i k(\cdot, x_i) for some coefficients α\alpha.

Connections

Where Your Intuition Breaks

The kernel trick is often presented as a computational shortcut — you save time by computing k(x,y)k(x,y) instead of ϕ(x),ϕ(y)Hk\langle\phi(x), \phi(y)\rangle_{\mathcal{H}_k}. But Mercer's theorem reveals it is an exact representation, not an approximation: k(x,y)=iλiϕi(x)ϕi(y)k(x,y) = \sum_i \lambda_i \phi_i(x)\phi_i(y) with uniform and absolute convergence. There is no truncation error. The more subtle misconception is that the RKHS norm and the kernel are separate choices — in reality they are the same object. Choosing kk determines which functions are "smooth" (small fHk2\|f\|_{\mathcal{H}_k}^2) and which are "rough" (large norm). Regularization by λfHk2\lambda\|f\|_{\mathcal{H}_k}^2 is not adding a separate penalty — it enforces exactly the smoothness the kernel itself defines. A practitioner who tunes the regularization strength λ\lambda without understanding what fHk2\|f\|_{\mathcal{H}_k}^2 actually measures is regularizing blindly against an implicit smoothness notion they never chose explicitly.

💡Intuition

The representer theorem is a remarkable compression: despite Hk\mathcal{H}_k being infinite-dimensional, the solution to regularized ERM always has finite representation f=i=1nαik(,xi)f^* = \sum_{i=1}^n \alpha_i k(\cdot, x_i), reducing an optimization problem over an infinite-dimensional function space to solving an n×nn \times n linear system for α\alpha. The RKHS norm fHk2\|f\|_{\mathcal{H}_k}^2 penalizes functions that deviate from the smoothness encoded in kk, and the representer theorem says the optimal smooth function interpolates training data in the "kernel direction."

💡Intuition

Infinite-width neural networks trained by gradient descent are equivalent to kernel regression with the neural tangent kernel (NTK). In this regime, the network function lies in an RKHS determined by its architecture, and the representer theorem applies: the trained network is a sum of NTK evaluations at training points. This connection is why overparameterized networks can interpolate training data while generalizing — the NTK RKHS norm provides implicit regularization.

⚠️Warning

Kernel selection determines which RKHS the solution lives in, and the RKHS norm encodes a specific smoothness assumption. Choosing the wrong kernel gives the wrong inductive bias. The RBF kernel with large \ell biases toward nearly constant functions; with small \ell it biases toward very local interpolation (effectively nearest-neighbor). Mercer's theorem reveals why: different \ell values give different eigenvalue decay rates λi\lambda_i, penalizing high-frequency eigenfunctions (small ii = smooth = small penalty; large ii = oscillatory = large penalty). A poor kernel choice can make regularization penalize exactly the features needed for good prediction.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.