Neural-Path/Notes
35 min

Bridge: Kernel Methods, Attention as Operators & Neural Operators (FNO)

The tools of functional analysis — Hilbert spaces, operator theory, and RKHS — are not abstract curiosities; they are the mathematical substrate of kernel SVMs, Gaussian processes, self-attention, and infinite-width neural network theory. This lesson maps each machine learning construct to its functional-analytic foundation, showing why the abstractions are load-bearing and not merely decorative.

Concepts

Orthogonal Projection — drag the vector tip

WvP(v)v − P(v)O
v =(0.50, 0.90)
P(v) =(0.76, 0.44)
v − P(v) =(-0.26, 0.46)
‖v‖ =1.030
‖P(v)‖ =0.883
‖v − P(v)‖ =0.529
cos θ =0.858
θ =30.9°

Rotate subspace W

angle = 30°

The residual v − P(v) is always perpendicular to W (right-angle box). This is the Best Approximation Theorem: P(v) is the closest point in W to v.

SVMs, Gaussian processes, self-attention, and infinite-width neural networks look like unrelated models — but each is computing projections in a function space defined by a kernel, and the mathematical structure they share is the RKHS. The functional-analytic lens unifies these methods: training a kernel SVM, conditioning a GP posterior, and running gradient descent on a wide network all reduce to finding the element of a Hilbert space closest to the data, with the kernel determining what "close" means.

RKHS in Kernel SVMs and Gaussian Processes

A kernel SVM solves the dual problem maxαiαi12i,jαiαjyiyjk(xi,xj)\max_\alpha \sum_i \alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_j y_i y_j k(x_i, x_j) — entirely in terms of kernel evaluations. The primal solution f(x)=iαiyik(x,xi)f^*(x) = \sum_i \alpha_i y_i k(x, x_i) lives in the RKHS Hk\mathcal{H}_k by the representer theorem. The margin 2/fHk2/\|f^*\|_{\mathcal{H}_k} being maximized is exactly the RKHS norm being controlled.

In a Gaussian process (GP), the covariance function k(x,x)k(x, x') is the kernel, and the prior fGP(0,k)f \sim \mathcal{GP}(0, k) places a Gaussian measure on the RKHS. The posterior mean — the MAP estimate after observing data — is exactly the representer theorem solution for squared loss: fˉ(x)=k(x,X)(K+σ2I)1y\bar{f}(x) = k(x, X)(K + \sigma^2 I)^{-1}y where k(x,X)k(x, X) is the row vector of kernel evaluations. Mercer's theorem connects GP covariance to the eigenfunction expansion: the GP prior concentrates on functions with large projections onto eigenfunctions with large eigenvalues λi\lambda_i.

Neural Tangent Kernel Theory

Consider a network f(x;θ)=fθ(x)f(x; \theta) = f_\theta(x) with parameters θRP\theta \in \mathbb{R}^P. The neural tangent kernel is

k(x,x)=Eθinit[θfθ(x)θfθ(x)]k^\infty(x, x') = \mathbb{E}_{\theta \sim \text{init}}\left[\nabla_\theta f_\theta(x) \cdot \nabla_\theta f_\theta(x')\right]

In the limit of infinite width, two things happen simultaneously: (1) the NTK kk^\infty converges to a deterministic kernel that depends on the architecture but not on the specific weight initialization, and (2) the NTK stays constant during training (the "lazy training" or "kernel regime"). Under these conditions, gradient descent on the squared loss gives a trajectory that satisfies the linear ODE:

f˙t(x)=k(x,x)(ft(x)y)dμ(x)\dot{f}_t(x) = -\int k^\infty(x, x')\,(f_t(x') - y')\,d\mu(x')

where μ\mu is the empirical measure on training data. This is gradient flow in function space — an infinite-dimensional ODE whose solution is computable in terms of the NTK eigendecomposition.

Spectral bias / frequency principle. Decompose the training error ftff_t - f^* in the basis of NTK eigenfunctions {ϕi}\{\phi_i\} with eigenvalues λ1λ2\lambda_1 \geq \lambda_2 \geq \cdots. The component along ϕi\phi_i decays as eλite^{-\lambda_i t}. Modes with large λi\lambda_i (low-frequency, smooth components) decay fastest — networks learn low-frequency functions first. This explains empirical observations of generalization before memorization and connects to the implicit regularization of early stopping.

The spectral bias is not a property of gradient descent in weight space — it is a property of gradient descent in function space. The NTK eigendecomposition shows that gradient descent implicitly projects the target onto the RKHS defined by kk^\infty, weighting each eigenfunction by its eigenvalue. The kernel had to be defined as θf(x)θf(x)\nabla_\theta f(x) \cdot \nabla_\theta f(x') because this inner product is exactly what a gradient step computes: the change in f(x)f(x) from a step at training point xx' is proportional to how correlated the parameter gradients at xx and xx' are — which is k(x,x)k^\infty(x, x').

Gradient Flows in Function Space

The NTK describes gradient descent as a flow in function space. More generally, given a divergence or energy functional F[p]\mathcal{F}[p] over probability distributions pp, the Wasserstein gradient flow is the flow that decreases F\mathcal{F} most rapidly in the geometry of the Wasserstein-2 metric on probability measures:

tpt=(ptδFδpt)\partial_t p_t = \nabla \cdot \left(p_t \nabla \frac{\delta \mathcal{F}}{\delta p_t}\right)

Training a generative model by minimizing KL divergence or the Wasserstein distance is a gradient flow in the space of probability measures (P2(Rd),W2)(\mathcal{P}_2(\mathbb{R}^d), W_2). The functional derivative δF/δp\delta\mathcal{F}/\delta p is the infinite-dimensional analogue of a gradient — evaluated at a point xx, it gives the direction of steepest descent for the "particle" at xx.

Stein Variational Gradient Descent (SVGD). Rather than flowing a continuous density, SVGD transports a set of nn particles {xi}\{x_i\} that approximate pp. The update is:

xixi+ϵϕ(xi),ϕ(x)=1nj=1n[k(xj,x)xjlogp(xj)+xjk(xj,x)]x_i \leftarrow x_i + \epsilon\,\phi^*(x_i), \quad \phi^*(x) = \frac{1}{n}\sum_{j=1}^n \left[k(x_j, x)\nabla_{x_j}\log p(x_j) + \nabla_{x_j}k(x_j, x)\right]

where kk is a kernel and ϕ\phi^* is the Stein operator applied to the empirical measure. The first term pushes particles toward high-probability regions; the second term is a repulsion force (via the kernel gradient) that prevents collapse. SVGD is a deterministic particle transport method derived from minimizing the Kernelized Stein Discrepancy (a RKHS-based divergence).

Worked Example

NTK for a Single-Layer Network in the Lazy Training Regime

Consider f(x;θ)=1mj=1majσ(wjx+bj)f(x; \theta) = \frac{1}{\sqrt{m}}\sum_{j=1}^m a_j\,\sigma(w_j^\top x + b_j) where only W={wj}W = \{w_j\} and b={bj}b = \{b_j\} are trained (the output weights aja_j are fixed at initialization). The gradient with respect to wjw_j is wjf=ajmσ(wjx+bj)x\nabla_{w_j} f = \frac{a_j}{\sqrt{m}}\sigma'(w_j^\top x + b_j)\,x. The NTK is:

kθ(x,x)=θf(x;θ)θf(x;θ)=1mj=1maj2σ(wjx+bj)σ(wjx+bj)(xx)\begin{aligned} k_\theta(x, x') &= \nabla_\theta f(x;\theta)^\top \nabla_\theta f(x';\theta) \\ &= \frac{1}{m}\sum_{j=1}^m a_j^2\,\sigma'(w_j^\top x + b_j)\,\sigma'(w_j^\top x' + b_j)\,(x^\top x') \end{aligned}

As mm \to \infty, by the law of large numbers this converges to the deterministic limit:

k(x,x)=Ea,w,b[a2σ(wx+b)σ(wx+b)](xx)k^\infty(x, x') = \mathbb{E}_{a,w,b}\left[a^2\,\sigma'(w^\top x + b)\,\sigma'(w^\top x' + b)\right] \cdot (x^\top x')

Crucially, this kernel is data-dependent (through x,xx, x') but not weight-dependent at initialization — it is determined entirely by the architecture. In the lazy training regime, the kernel stays approximately constant during training, so gradient descent reduces to kernel regression with kk^\infty.

Connecting GP Posterior to Operator Projection

For a GP with kernel kk and observations (X,y)(X, y) with noise σ2\sigma^2, the posterior mean is

fˉ(x)=k(x,X)(KXX+σ2I)1y=i=1nαik(x,xi)\bar{f}(x) = k(x, X)(K_{XX} + \sigma^2 I)^{-1}y = \sum_{i=1}^n \alpha_i k(x, x_i)

where α=(KXX+σ2I)1y\alpha = (K_{XX} + \sigma^2 I)^{-1}y. In Mercer's basis {(λi,ϕi)}\{(\lambda_i, \phi_i)\}, this is a projection: each mode ϕi\phi_i is retained with weight λi/(λi+σ2)\lambda_i / (\lambda_i + \sigma^2) — a Wiener filter. Modes with λiσ2\lambda_i \gg \sigma^2 are passed through; modes with λiσ2\lambda_i \ll \sigma^2 are shrunk toward zero.

Connections

Where Your Intuition Breaks

The NTK analysis shows that infinitely wide networks are equivalent to kernel regression — which might suggest it gives a complete theory of deep learning. It does not. Real networks of practical width operate in the "feature learning" regime: their NTK changes throughout training as the network learns useful internal representations, violating the frozen-kernel assumption. The kernel regime and the feature learning regime make opposite predictions about the role of architecture: in the kernel regime, only the initial NTK determines generalization; in the feature learning regime, the representation learned during training matters far more. Empirically, networks operating in the feature learning regime generalize substantially better than kernel regression with the corresponding initial NTK — the theoretical clean room and the practical regime are different worlds, and intuition from one does not transfer to the other.

💡Intuition

The NTK regime is theoretically clean because it linearizes training dynamics, turning a nonlinear optimization into a linear ODE whose solution is governed by eigenvalue decay. The spectral bias follows directly: modes with large NTK eigenvalues (smooth, low-frequency functions) converge exponentially faster than high-frequency modes. This is why gradient descent on overparameterized networks first fits the smooth signal before fitting noise — it is a consequence of the functional-analytic structure, not a property specific to any particular architecture.

💡Intuition

The Wasserstein gradient flow connects deep learning optimization to optimal transport theory. Training a neural network by minimizing a divergence between the model distribution and the data distribution (as in variational inference or GANs) is a gradient flow in the space of probability measures (P2,W2)(\mathcal{P}_2, W_2). The Wasserstein metric gives this space a Riemannian structure, and the flow follows the Riemannian gradient — a functional derivative that generalizes the Euclidean gradient to the space of distributions.

⚠️Warning

The NTK regime is theoretically tractable but practically limited. Real networks of finite width operate in the "feature learning" regime: the NTK changes substantially during training as the network learns internal representations. NTK theory predicts that overparameterized networks can interpolate training data (correct) but also that they generalize no better than kernel regression with the initial NTK (often incorrect for large networks). The gap between NTK theory and empirical behavior — particularly for networks that learn useful features from data — is an active area of research.

⚠️Warning

SVGD requires computing all n2n^2 pairwise kernel evaluations k(xi,xj)k(x_i, x_j) and their gradients at each iteration, giving O(n2d)O(n^2 d) cost per step. This limits practical SVGD to particle counts of hundreds to a few thousand. For high-dimensional distributions (e.g., posterior inference over neural network weights), naive SVGD is computationally infeasible. Approximations include random feature expansions of the kernel (reducing to O(nD)O(n \cdot D) for DD random features) and structured kernels that exploit problem geometry — active areas in scalable Bayesian inference.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.