Bridge: Kernel Methods, Attention as Operators & Neural Operators (FNO)

The tools of functional analysis — Hilbert spaces, operator theory, and RKHS — are not abstract curiosities; they are the mathematical substrate of kernel SVMs, Gaussian processes, self-attention, and infinite-width neural network theory. This lesson maps each machine learning construct to its functional-analytic foundation, showing why the abstractions are load-bearing and not merely decorative.

Concepts

Orthogonal Projection — drag the vector tip

v =(0.50, 0.90)

P(v) =(0.76, 0.44)

v − P(v) =(-0.26, 0.46)

‖v‖ =1.030

‖P(v)‖ =0.883

‖v − P(v)‖ =0.529

cos θ =0.858

θ =30.9°

Rotate subspace W

angle = 30°

The residual v − P(v) is always perpendicular to W (right-angle box). This is the Best Approximation Theorem: P(v) is the closest point in W to v.

SVMs, Gaussian processes, self-attention, and infinite-width neural networks look like unrelated models — but each is computing projections in a function space defined by a kernel, and the mathematical structure they share is the RKHS. The functional-analytic lens unifies these methods: training a kernel SVM, conditioning a GP posterior, and running gradient descent on a wide network all reduce to finding the element of a Hilbert space closest to the data, with the kernel determining what "close" means.

RKHS in Kernel SVMs and Gaussian Processes

A kernel SVM solves the dual problem $\max_\alpha \sum_i \alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_j y_i y_j k(x_i, x_j)$ — entirely in terms of kernel evaluations. The primal solution $f^*(x) = \sum_i \alpha_i y_i k(x, x_i)$ lives in the RKHS $\mathcal{H}_k$ by the representer theorem. The margin $2/\|f^*\|_{\mathcal{H}_k}$ being maximized is exactly the RKHS norm being controlled.

In a Gaussian process (GP), the covariance function $k(x, x')$ is the kernel, and the prior $f \sim \mathcal{GP}(0, k)$ places a Gaussian measure on the RKHS. The posterior mean — the MAP estimate after observing data — is exactly the representer theorem solution for squared loss: $\bar{f}(x) = k(x, X)(K + \sigma^2 I)^{-1}y$ where $k(x, X)$ is the row vector of kernel evaluations. Mercer's theorem connects GP covariance to the eigenfunction expansion: the GP prior concentrates on functions with large projections onto eigenfunctions with large eigenvalues $\lambda_i$ .

Neural Tangent Kernel Theory

Consider a network $f(x; \theta) = f_\theta(x)$ with parameters $\theta \in \mathbb{R}^P$ . The neural tangent kernel is

$k^\infty(x, x') = \mathbb{E}_{\theta \sim \text{init}}\left[\nabla_\theta f_\theta(x) \cdot \nabla_\theta f_\theta(x')\right]$

In the limit of infinite width, two things happen simultaneously: (1) the NTK $k^\infty$ converges to a deterministic kernel that depends on the architecture but not on the specific weight initialization, and (2) the NTK stays constant during training (the "lazy training" or "kernel regime"). Under these conditions, gradient descent on the squared loss gives a trajectory that satisfies the linear ODE:

$\dot{f}_t(x) = -\int k^\infty(x, x')\,(f_t(x') - y')\,d\mu(x')$

where $\mu$ is the empirical measure on training data. This is gradient flow in function space — an infinite-dimensional ODE whose solution is computable in terms of the NTK eigendecomposition.

Spectral bias / frequency principle. Decompose the training error $f_t - f^*$ in the basis of NTK eigenfunctions $\{\phi_i\}$ with eigenvalues $\lambda_1 \geq \lambda_2 \geq \cdots$ . The component along $\phi_i$ decays as $e^{-\lambda_i t}$ . Modes with large $\lambda_i$ (low-frequency, smooth components) decay fastest — networks learn low-frequency functions first. This explains empirical observations of generalization before memorization and connects to the implicit regularization of early stopping.

The spectral bias is not a property of gradient descent in weight space — it is a property of gradient descent in function space. The NTK eigendecomposition shows that gradient descent implicitly projects the target onto the RKHS defined by $k^\infty$ , weighting each eigenfunction by its eigenvalue. The kernel had to be defined as $\nabla_\theta f(x) \cdot \nabla_\theta f(x')$ because this inner product is exactly what a gradient step computes: the change in $f(x)$ from a step at training point $x'$ is proportional to how correlated the parameter gradients at $x$ and $x'$ are — which is $k^\infty(x, x')$ .

Gradient Flows in Function Space

The NTK describes gradient descent as a flow in function space. More generally, given a divergence or energy functional $\mathcal{F}[p]$ over probability distributions $p$ , the Wasserstein gradient flow is the flow that decreases $\mathcal{F}$ most rapidly in the geometry of the Wasserstein-2 metric on probability measures:

$\partial_t p_t = \nabla \cdot \left(p_t \nabla \frac{\delta \mathcal{F}}{\delta p_t}\right)$

Training a generative model by minimizing KL divergence or the Wasserstein distance is a gradient flow in the space of probability measures $(\mathcal{P}_2(\mathbb{R}^d), W_2)$ . The functional derivative $\delta\mathcal{F}/\delta p$ is the infinite-dimensional analogue of a gradient — evaluated at a point $x$ , it gives the direction of steepest descent for the "particle" at $x$ .

Stein Variational Gradient Descent (SVGD). Rather than flowing a continuous density, SVGD transports a set of $n$ particles $\{x_i\}$ that approximate $p$ . The update is:

$x_i \leftarrow x_i + \epsilon\,\phi^*(x_i), \quad \phi^*(x) = \frac{1}{n}\sum_{j=1}^n \left[k(x_j, x)\nabla_{x_j}\log p(x_j) + \nabla_{x_j}k(x_j, x)\right]$

where $k$ is a kernel and $\phi^*$ is the Stein operator applied to the empirical measure. The first term pushes particles toward high-probability regions; the second term is a repulsion force (via the kernel gradient) that prevents collapse. SVGD is a deterministic particle transport method derived from minimizing the Kernelized Stein Discrepancy (a RKHS-based divergence).

Worked Example

NTK for a Single-Layer Network in the Lazy Training Regime

Consider $f(x; \theta) = \frac{1}{\sqrt{m}}\sum_{j=1}^m a_j\,\sigma(w_j^\top x + b_j)$ where only $W = \{w_j\}$ and $b = \{b_j\}$ are trained (the output weights $a_j$ are fixed at initialization). The gradient with respect to $w_j$ is $\nabla_{w_j} f = \frac{a_j}{\sqrt{m}}\sigma'(w_j^\top x + b_j)\,x$ . The NTK is:

\begin{aligned} k_\theta(x, x') &= \nabla_\theta f(x;\theta)^\top \nabla_\theta f(x';\theta) \\ &= \frac{1}{m}\sum_{j=1}^m a_j^2\,\sigma'(w_j^\top x + b_j)\,\sigma'(w_j^\top x' + b_j)\,(x^\top x') \end{aligned}

As $m \to \infty$ , by the law of large numbers this converges to the deterministic limit:

$k^\infty(x, x') = \mathbb{E}_{a,w,b}\left[a^2\,\sigma'(w^\top x + b)\,\sigma'(w^\top x' + b)\right] \cdot (x^\top x')$

Crucially, this kernel is data-dependent (through $x, x'$ ) but not weight-dependent at initialization — it is determined entirely by the architecture. In the lazy training regime, the kernel stays approximately constant during training, so gradient descent reduces to kernel regression with $k^\infty$ .

Connecting GP Posterior to Operator Projection

For a GP with kernel $k$ and observations $(X, y)$ with noise $\sigma^2$ , the posterior mean is

$\bar{f}(x) = k(x, X)(K_{XX} + \sigma^2 I)^{-1}y = \sum_{i=1}^n \alpha_i k(x, x_i)$

where $\alpha = (K_{XX} + \sigma^2 I)^{-1}y$ . In Mercer's basis $\{(\lambda_i, \phi_i)\}$ , this is a projection: each mode $\phi_i$ is retained with weight $\lambda_i / (\lambda_i + \sigma^2)$ — a Wiener filter. Modes with $\lambda_i \gg \sigma^2$ are passed through; modes with $\lambda_i \ll \sigma^2$ are shrunk toward zero.

Connections

Where Your Intuition Breaks

The NTK analysis shows that infinitely wide networks are equivalent to kernel regression — which might suggest it gives a complete theory of deep learning. It does not. Real networks of practical width operate in the "feature learning" regime: their NTK changes throughout training as the network learns useful internal representations, violating the frozen-kernel assumption. The kernel regime and the feature learning regime make opposite predictions about the role of architecture: in the kernel regime, only the initial NTK determines generalization; in the feature learning regime, the representation learned during training matters far more. Empirically, networks operating in the feature learning regime generalize substantially better than kernel regression with the corresponding initial NTK — the theoretical clean room and the practical regime are different worlds, and intuition from one does not transfer to the other.

💡Intuition

The NTK regime is theoretically clean because it linearizes training dynamics, turning a nonlinear optimization into a linear ODE whose solution is governed by eigenvalue decay. The spectral bias follows directly: modes with large NTK eigenvalues (smooth, low-frequency functions) converge exponentially faster than high-frequency modes. This is why gradient descent on overparameterized networks first fits the smooth signal before fitting noise — it is a consequence of the functional-analytic structure, not a property specific to any particular architecture.

💡Intuition

The Wasserstein gradient flow connects deep learning optimization to optimal transport theory. Training a neural network by minimizing a divergence between the model distribution and the data distribution (as in variational inference or GANs) is a gradient flow in the space of probability measures $(\mathcal{P}_2, W_2)$ . The Wasserstein metric gives this space a Riemannian structure, and the flow follows the Riemannian gradient — a functional derivative that generalizes the Euclidean gradient to the space of distributions.

⚠️Warning

The NTK regime is theoretically tractable but practically limited. Real networks of finite width operate in the "feature learning" regime: the NTK changes substantially during training as the network learns internal representations. NTK theory predicts that overparameterized networks can interpolate training data (correct) but also that they generalize no better than kernel regression with the initial NTK (often incorrect for large networks). The gap between NTK theory and empirical behavior — particularly for networks that learn useful features from data — is an active area of research.

⚠️Warning

SVGD requires computing all $n^2$ pairwise kernel evaluations $k(x_i, x_j)$ and their gradients at each iteration, giving $O(n^2 d)$ cost per step. This limits practical SVGD to particle counts of hundreds to a few thousand. For high-dimensional distributions (e.g., posterior inference over neural network weights), naive SVGD is computationally infeasible. Approximations include random feature expansions of the kernel (reducing to $O(n \cdot D)$ for $D$ random features) and structured kernels that exploit problem geometry — active areas in scalable Bayesian inference.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Reproducing Kernel Hilbert Spaces: Mercer's Theorem & Feature Maps

Graph Theory

Graph Fundamentals: Connectivity, Trees, Planarity & Colorings