Reproducing Kernel Hilbert Spaces: Mercer's Theorem & Feature Maps

A reproducing kernel Hilbert space is a Hilbert space of functions where point evaluation is a bounded linear functional — this forces the Hilbert space to be "rich enough" to remember function values, not just $L^2$ equivalence classes. Mercer's theorem gives the spectral decomposition of the kernel as an inner product in this function space, connecting kernel methods to eigenfunction expansions and making infinite-dimensional optimization tractable.

Concepts

Gram matrix K(i,j) = k(xᵢ, xⱼ) for N=8 equally-spaced points on [0,1]. Color encodes kernel value: low → high.

γ:medium

RBF kernel with γ=2: K(x,x') = exp(-γ‖x-x'‖²). Large γ → narrow kernel → nearly diagonal Gram matrix (local similarity only).

The kernel trick lets you compute inner products in high-dimensional — or infinite-dimensional — feature spaces without ever computing the feature vectors explicitly. A function $k(x, y) = \langle \phi(x), \phi(y) \rangle$ replaces the inner product, and every algorithm that depends only on pairwise inner products immediately works in this richer space. The RKHS is the function space that this kernel defines: a Hilbert space where evaluating a function at a point is equivalent to taking an inner product, and where the kernel itself is the "feature vector" of a point.

Positive Definite Kernels

A function $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is a positive definite kernel if it is symmetric ( $k(x,y) = k(y,x)$ ) and its Gram matrices are positive semidefinite: for all $n \geq 1$ and $x_1, \ldots, x_n \in \mathcal{X}$ , the matrix $K_{ij} = k(x_i, x_j)$ satisfies $c^\top K c \geq 0$ for all $c \in \mathbb{R}^n$ .

Standard examples with their interpretations:

Kernel	Formula	Inductive bias
Linear	$k(x,y) = x^\top y$	Dot-product similarity
Polynomial	$k(x,y) = (x^\top y + c)^d$	Interactions up to degree $d$
RBF (Gaussian)	$k(x,y) = \exp(-\\|x-y\\|^2 / 2\ell^2)$	Smooth, infinitely differentiable functions
Periodic	$k(x,y) = \exp(-2\sin^2(\pi	x-y

The RBF kernel corresponds to an infinite-dimensional feature map; as $\ell \to 0$ the kernel approaches a delta function (very local), and as $\ell \to \infty$ it approaches a constant (very global).

Reproducing Kernel Hilbert Spaces

A reproducing kernel Hilbert space (RKHS) $\mathcal{H}_k$ associated with a kernel $k$ on $\mathcal{X}$ is a Hilbert space of functions $f: \mathcal{X} \to \mathbb{R}$ satisfying:

For every $x \in \mathcal{X}$ , the function $k(\cdot, x): \mathcal{X} \to \mathbb{R}$ belongs to $\mathcal{H}_k$
Reproducing property: for every $f \in \mathcal{H}_k$ and every $x \in \mathcal{X}$ ,

$f(x) = \langle f,\, k(\cdot, x) \rangle_{\mathcal{H}_k}$

The reproducing property means point evaluation $\delta_x: f \mapsto f(x)$ is a bounded linear functional on $\mathcal{H}_k$ (by Cauchy-Schwarz: $|f(x)| = |\langle f, k(\cdot,x)\rangle| \leq \|f\|_{\mathcal{H}_k}\|k(\cdot,x)\|_{\mathcal{H}_k} = \|f\|_{\mathcal{H}_k}\sqrt{k(x,x)}$ ). The kernel "evaluates" functions via the inner product.

The reproducing property is not a definition but a theorem: if you require point evaluation $\delta_x$ to be a bounded linear functional on a Hilbert space (so that $f(x)$ is well-defined and stable, not just almost-everywhere), the Riesz representation theorem immediately forces $f(x) = \langle f, r_x \rangle$ for some representing element $r_x$ . Calling that element $k(\cdot, x)$ and requiring $\langle k(\cdot, x), k(\cdot, y) \rangle = k(x,y)$ is the only consistent assignment — the kernel function and the RKHS inner product are not separate structures but two faces of the same object.

Moore-Aronszajn Theorem

Theorem. There is a bijection between positive definite kernels $k$ on $\mathcal{X}$ and reproducing kernel Hilbert spaces $\mathcal{H}_k$ of functions on $\mathcal{X}$ . Given $k$ , the RKHS is the completion of $\mathrm{span}\{k(\cdot, x) : x \in \mathcal{X}\}$ under the inner product defined by $\langle k(\cdot, x), k(\cdot, y)\rangle = k(x,y)$ .

This is remarkable: a kernel function, which looks like just a similarity measure, completely determines an entire function space with its geometry.

Mercer's Theorem

Theorem. Let $\mathcal{X}$ be a compact metric space and $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ a continuous positive definite kernel. Then the integral operator $T_k f(x) = \int k(x,y)f(y)\,d\mu(y)$ on $L^2(\mathcal{X}, \mu)$ is compact and self-adjoint, and admits an eigendecomposition $T_k \phi_i = \lambda_i \phi_i$ with $\lambda_1 \geq \lambda_2 \geq \cdots > 0$ . The kernel decomposes as

$k(x, y) = \sum_{i=1}^\infty \lambda_i\,\phi_i(x)\,\phi_i(y)$

with convergence uniform and absolute. The feature map $\phi: \mathcal{X} \to \ell^2$ defined by $\phi(x) = (\sqrt{\lambda_1}\phi_1(x),\, \sqrt{\lambda_2}\phi_2(x),\, \ldots)$ satisfies $k(x,y) = \langle \phi(x), \phi(y)\rangle_{\ell^2}$ .

The Kernel Trick and the Representer Theorem

The kernel trick. Computing $\langle \phi(x), \phi(y)\rangle_{\mathcal{H}_k}$ requires an inner product in possibly infinite-dimensional space. The kernel trick replaces this with $k(x,y)$ , which is a scalar computation. Algorithms that depend only on pairwise inner products — SVM, kernel PCA, Gaussian process regression — can be "kernelized" at no additional per-prediction cost.

Representer theorem. Consider regularized empirical risk minimization:

$\min_{f \in \mathcal{H}_k} \sum_{i=1}^n L(y_i, f(x_i)) + \lambda \|f\|_{\mathcal{H}_k}^2$

The minimizer $f^*$ lies in the finite-dimensional subspace $\mathrm{span}\{k(\cdot, x_i)\}_{i=1}^n$ , so $f^*(x) = \sum_{i=1}^n \alpha_i k(x, x_i)$ for some $\alpha \in \mathbb{R}^n$ . Despite $\mathcal{H}_k$ being infinite-dimensional, the problem reduces to solving an $n \times n$ linear system for $\alpha$ .

Worked Example

Explicit Feature Map for the Polynomial Kernel

For $k(x,y) = (x^\top y + c)^2$ on $\mathbb{R}^d$ , expand using the binomial theorem. For $d = 2$ :

\begin{aligned} (x^\top y + c)^2 &= (x_1 y_1 + x_2 y_2 + c)^2 \\ &= x_1^2 y_1^2 + 2x_1 x_2 y_1 y_2 + x_2^2 y_2^2 + 2cx_1 y_1 + 2cx_2 y_2 + c^2 \end{aligned}

The explicit feature map is $\phi(x) = (x_1^2,\, \sqrt{2}x_1 x_2,\, x_2^2,\, \sqrt{2c}x_1,\, \sqrt{2c}x_2,\, c)$ , confirming $k(x,y) = \langle \phi(x), \phi(y)\rangle$ . For $d$ -dimensional input and degree $D$ , the feature map has $O(d^D)$ components — computing it explicitly is infeasible for large $d$ , but evaluating $k(x,y) = (x^\top y + c)^D$ takes $O(d)$ time.

Computing a Gram Matrix for the RBF Kernel

For $\ell = 1$ and four points $x_1 = 0,\, x_2 = 0.5,\, x_3 = 1,\, x_4 = 1.5$ :

K = \begin{pmatrix} 1 & e^{-0.125} & e^{-0.5} & e^{-1.125} \\ e^{-0.125} & 1 & e^{-0.125} & e^{-0.5} \\ e^{-0.5} & e^{-0.125} & 1 & e^{-0.125} \\ e^{-1.125} & e^{-0.5} & e^{-0.125} & 1 \end{pmatrix} \approx \begin{pmatrix} 1 & 0.88 & 0.61 & 0.32 \\ 0.88 & 1 & 0.88 & 0.61 \\ 0.61 & 0.88 & 1 & 0.88 \\ 0.32 & 0.61 & 0.88 & 1 \end{pmatrix}

The matrix is symmetric positive definite, with eigenvalues that can be used (via Mercer's theorem) as the weights $\lambda_i$ of the eigenfunction expansion.

Representer Theorem Derivation

Decompose any $f \in \mathcal{H}_k$ as $f = f_\parallel + f_\perp$ where $f_\parallel \in \mathrm{span}\{k(\cdot, x_i)\}$ and $f_\perp \perp \mathrm{span}\{k(\cdot, x_i)\}$ . By the reproducing property, $f(x_i) = \langle f, k(\cdot, x_i)\rangle = \langle f_\parallel, k(\cdot, x_i)\rangle = f_\parallel(x_i)$ — the perpendicular component does not affect function values at training points. Since $\|f\|^2 = \|f_\parallel\|^2 + \|f_\perp\|^2 \geq \|f_\parallel\|^2$ , the objective is minimized by taking $f_\perp = 0$ . Therefore $f^* = \sum_i \alpha_i k(\cdot, x_i)$ for some coefficients $\alpha$ .

Connections

Where Your Intuition Breaks

The kernel trick is often presented as a computational shortcut — you save time by computing $k(x,y)$ instead of $\langle\phi(x), \phi(y)\rangle_{\mathcal{H}_k}$ . But Mercer's theorem reveals it is an exact representation, not an approximation: $k(x,y) = \sum_i \lambda_i \phi_i(x)\phi_i(y)$ with uniform and absolute convergence. There is no truncation error. The more subtle misconception is that the RKHS norm and the kernel are separate choices — in reality they are the same object. Choosing $k$ determines which functions are "smooth" (small $\|f\|_{\mathcal{H}_k}^2$ ) and which are "rough" (large norm). Regularization by $\lambda\|f\|_{\mathcal{H}_k}^2$ is not adding a separate penalty — it enforces exactly the smoothness the kernel itself defines. A practitioner who tunes the regularization strength $\lambda$ without understanding what $\|f\|_{\mathcal{H}_k}^2$ actually measures is regularizing blindly against an implicit smoothness notion they never chose explicitly.

💡Intuition

The representer theorem is a remarkable compression: despite $\mathcal{H}_k$ being infinite-dimensional, the solution to regularized ERM always has finite representation $f^* = \sum_{i=1}^n \alpha_i k(\cdot, x_i)$ , reducing an optimization problem over an infinite-dimensional function space to solving an $n \times n$ linear system for $\alpha$ . The RKHS norm $\|f\|_{\mathcal{H}_k}^2$ penalizes functions that deviate from the smoothness encoded in $k$ , and the representer theorem says the optimal smooth function interpolates training data in the "kernel direction."

💡Intuition

Infinite-width neural networks trained by gradient descent are equivalent to kernel regression with the neural tangent kernel (NTK). In this regime, the network function lies in an RKHS determined by its architecture, and the representer theorem applies: the trained network is a sum of NTK evaluations at training points. This connection is why overparameterized networks can interpolate training data while generalizing — the NTK RKHS norm provides implicit regularization.

⚠️Warning

Kernel selection determines which RKHS the solution lives in, and the RKHS norm encodes a specific smoothness assumption. Choosing the wrong kernel gives the wrong inductive bias. The RBF kernel with large $\ell$ biases toward nearly constant functions; with small $\ell$ it biases toward very local interpolation (effectively nearest-neighbor). Mercer's theorem reveals why: different $\ell$ values give different eigenvalue decay rates $\lambda_i$ , penalizing high-frequency eigenfunctions (small $i$ = smooth = small penalty; large $i$ = oscillatory = large penalty). A poor kernel choice can make regularization penalize exactly the features needed for good prediction.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Bounded Linear Operators: Adjoints, Spectrum & Compact Operators

Bridge: Kernel Methods, Attention as Operators & Neural Operators (FNO)