Bridge: Spectral Clustering, GNN Expressivity & the Weisfeiler-Leman Hierarchy

Spectral clustering uses graph Laplacian eigenvectors to partition nodes; GNN expressivity is bounded by the Weisfeiler-Leman graph isomorphism test; understanding these limits shapes architecture design. The graph-theoretic foundations of this module now connect directly to the design and limitations of learned models on graph-structured data.

Concepts

Graph Laplacian $L = D - A$. The Fiedler value λ₂ (second smallest eigenvalue) measures algebraic connectivity: λ₂ = 0 iff the graph is disconnected.

Eigenvalues λ:

λ1

0.00

λ2

0.73

λ3

2.00

λ4

3.27

λ5

4.00

λ₂ = 0.73→ well connected

L is always positive semi-definite (all λ ≥ 0) with λ₁ = 0. Number of zero eigenvalues = number of connected components.

Message passing neural networks, spectral graph convolution, and the WL isomorphism test look like three separate topics — but they answer the same question: what does a neural network on a graph compute, and what are its fundamental limits? The answer comes from graph theory and linear algebra, not from neural network theory.

Message Passing Neural Networks

A message passing neural network (MPNN) computes vertex representations through $L$ rounds of neighborhood aggregation. At each layer $l$ :

$h_v^{(l+1)} = \text{UPDATE}\!\left(h_v^{(l)},\ \text{AGGREGATE}\!\left(\{h_u^{(l)} : u \in N(v)\}\right)\right)$

After $L$ layers, each vertex representation $h_v^{(L)}$ depends on the $L$ -hop neighborhood of $v$ . The AGGREGATE function must be permutation invariant (outputs the same regardless of neighbor ordering) — natural choices include sum, mean, and max. The UPDATE function can be any learnable function (MLP, GRU).

The key theoretical property is permutation equivariance: if vertices are relabeled by a permutation $\pi$ , the output representations are relabeled by the same $\pi$ . This means the MPNN respects the symmetry of graphs — unlabeled graphs with the same structure produce the same representations.

Permutation equivariance is not a design choice — it is a requirement. A neural network on a graph must produce the same output for isomorphic graphs (same structure, different vertex labels). If it did not, two vertices with identical neighborhoods would receive different representations based on their arbitrary numerical indices, not on their structural role. This forces AGGREGATE to be permutation invariant: the output must depend on the multiset of neighbor representations, not on any ordering of them.

Spectral Graph Convolution and GCN

Graph Fourier transform: represent a graph signal $x \in \mathbb{R}^n$ (a scalar per vertex) in the eigenbasis of the normalized Laplacian $L_{\text{sym}} = U\Lambda U^T$ : $\hat{x} = U^T x$ . A spectral graph convolution applies a filter $g_\theta(\Lambda)$ in this basis:

$y = U g_\theta(\Lambda) U^T x$

This is exact but requires $O(n^2)$ computation. ChebConv (Defferrard et al., 2016) approximates $g_\theta(\Lambda) \approx \sum_{k=0}^{K} \theta_k T_k(\tilde{\Lambda})$ using Chebyshev polynomials $T_k$ of the rescaled eigenvalues $\tilde{\Lambda} = 2\Lambda/\lambda_{\max} - I$ , reducing to $K$ matrix-vector products: $O(Km)$ total.

GCN (Kipf and Welling, 2017) takes the first-order approximation ( $K=1$ ), assumes $\lambda_{\max} \approx 2$ , and applies the renormalization trick to avoid vanishing/exploding features:

$H^{(l+1)} = \sigma\!\left(\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} H^{(l)} W^{(l)}\right)$

where $\tilde{A} = A + I$ (self-loops added) and $\tilde{D}_{ii} = \sum_j \tilde{A}_{ij}$ .

Over-Smoothing and Over-Squashing

Over-smoothing: as the number of GCN layers $L \to \infty$ , all vertex representations converge to the same value (the stationary distribution of the random walk). This is because $\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}$ acts like a low-pass filter; repeated application erases high-frequency (locally distinctive) signal. In practice, 2–4 layers is optimal for most tasks.

Over-squashing: information from exponentially many vertices must be compressed into a fixed-size vector. A vertex $v$ after $L$ layers aggregates information from all vertices within $L$ hops — for expander graphs, this is $\Omega(n)$ vertices — but through a bottleneck of $O(1)$ width per edge. The Jacobian $\partial h_v^{(L)}/\partial h_u^{(0)}$ for distant vertices decays exponentially in the number of hops, making long-range dependencies hard to learn. Graph rewiring (adding shortcut edges via commute-time spanning trees or diffusion-based methods) mitigates over-squashing.

1-WL Expressivity Bound

The Weisfeiler-Leman (1-WL) test iteratively refines a coloring of vertices: $c_v^{(t+1)} = \text{hash}(c_v^{(t)}, \{c_u^{(t)} : u \in N(v)\})$ . If two graphs produce different colorings at any round, they are non-isomorphic.

Theorem (Xu et al., 2019): Any MPNN with injective AGGREGATE and UPDATE functions (so that different multisets of inputs produce different outputs) is exactly as powerful as the 1-WL test. No MPNN can distinguish graphs that 1-WL cannot. The choice of AGGREGATE matters: sum is strictly more expressive than mean or max because it preserves multiset structure (mean conflates different-sized equal-average neighborhoods; max ignores multiplicity). Graph Isomorphism Networks (GIN) achieve the 1-WL upper bound by using sum aggregation and an MLP:

$h_v^{(l+1)} = \text{MLP}\!\left((1+\varepsilon)\,h_v^{(l)} + \sum_{u \in N(v)} h_u^{(l)}\right)$

To go beyond 1-WL, higher-order WL tests (2-WL, 3-WL) distinguish more graph pairs but at cost $O(n^k)$ for $k$ -WL. Random features, distance encodings, and structural positional encodings are practical alternatives.

Random Graph Models for Graph Generation

Generative models for graphs must produce valid, diverse graph structures — a combinatorially challenging problem. GraphRNN autoregressively generates the adjacency matrix row by row using an RNN; each step generates one row of the adjacency matrix conditioned on all previous rows. The challenge is that graph structure is order-invariant but sequence generation is order-dependent — canonical orderings (BFS, DFS) are used to break this symmetry.

Diffusion-based graph generation defines a forward process that adds noise to the graph (randomly adds/removes edges) and learns a reverse denoising process conditioned on graph-level labels. This avoids sequential generation and naturally handles variable graph sizes.

Graph structure learning infers the graph topology from node features when the graph is not observed: learn $A$ as a latent variable from data $X \in \mathbb{R}^{n \times d}$ . A differentiable approach uses the Gumbel-softmax reparameterization to sample discrete edges while allowing gradient flow through the graph sampling step.

Worked Example

Deriving GCN from spectral graph convolution. Start from spectral convolution with a filter $g_\theta$ . Approximate with $K=1$ Chebyshev terms: $g_\theta \approx \theta_0 + \theta_1(L_{\text{sym}} - I) = \theta_0 - \theta_1 D^{-1/2}AD^{-1/2}$ . Set $\theta_0 = -\theta_1 = \theta$ for a single parameter per layer: $g_\theta \approx \theta(I + D^{-1/2}AD^{-1/2})$ . The matrix $I + D^{-1/2}AD^{-1/2}$ has eigenvalues in $[0,2]$ , causing instability. The renormalization trick replaces $A \to \tilde{A} = A + I$ , $D \to \tilde{D}$ , giving eigenvalues in $[0,1]$ :

$H^{(l+1)} = \sigma\!\left(\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} H^{(l)} W^{(l)}\right)$

On a small example with 4 vertices ( $v_1$ - $v_2$ - $v_3$ - $v_4$ path) and 2-d features, one GCN layer averages each vertex's features with its neighbors' (reweighted by degree), then applies a linear transformation $W$ and nonlinearity. Vertex $v_2$ (degree 2) receives equal weight from $v_1$ , $v_2$ , $v_3$ ; vertex $v_1$ (degree 1) receives weight from $v_1$ and $v_2$ only.

1-WL expressiveness example. Consider two graphs: the 3-regular graph on 6 vertices (complete bipartite $K_{3,3}$ ) and the 3-regular graph on 6 vertices (prism graph $C_3 \times K_2$ ). Both are 3-regular with 6 vertices and 9 edges. The 1-WL test assigns the same initial color to all vertices (all degree 3); after one round, all neighborhoods contain the same multiset of colors; the test cannot distinguish these non-isomorphic graphs. Any MPNN with the same initial features would also fail to distinguish them, regardless of depth or weight choice.

Connections

Where Your Intuition Breaks

The 1-WL upper bound might seem like a theoretical limit that does not affect real tasks — surely learned networks extract more structure than color refinement. But the ceiling is tight in specific ways that affect practical performance: MPNNs genuinely cannot count triangles, cannot detect even cycles, and cannot distinguish the complete bipartite graph $K_{3,3}$ from the triangular prism, regardless of depth, width, or training. This is not a failure of a specific model — it is a logical consequence of local aggregation on unlabeled graphs. Practitioners often apply standard GCNs to tasks requiring structural information (molecular property prediction, social network analysis), observe poor performance, and attribute it to insufficient data or hyperparameter tuning rather than fundamental expressiveness limits. Adding structural encodings (triangle counts, distance-to-landmarks, random walk statistics) breaks the WL ceiling at modest computational cost and is frequently the correct diagnosis.

💡Intuition

The 1-WL expressiveness bound shows that MPNN topology-based discrimination is equivalent to color refinement — a very old algorithm for graph isomorphism. This means MPNNs cannot count triangles, detect cycles of specific lengths, or distinguish regular graphs with the same degree sequence. This is not a failure of a specific architecture but a fundamental limit of the local aggregation paradigm. Structural positional encodings (random walk statistics, distance-to-landmarks) inject global structural information that breaks the 1-WL ceiling without the computational cost of higher-order WL.

💡Intuition

GCN is performing localized spectral filtering: each layer applies a low-pass graph filter (smoothing across neighborhoods), and stacking layers widens the frequency band. The connection to the Laplacian from the previous lesson is precise — the renormalized GCN propagation matrix $\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}$ is exactly $I - L_{\text{sym}}^{\tilde{G}}/2$ for the self-loop augmented graph $\tilde{G}$ . Each GCN layer is one step of a lazy random walk on $\tilde{G}$ , smoothing features along the graph's geometry. The depth controls the spectral bandwidth: shallow GCNs are high-pass preserving local distinctions; deep GCNs are low-pass converging to a global average.

⚠️Warning

Spectral GCN and related methods are transductive: the Laplacian $L$ encodes the specific graph seen at training time, and the learned filters are tied to this fixed graph structure. When a new node is added or the graph changes, the entire Laplacian changes and all eigenvectors shift. Spectral methods cannot generalize to unseen graphs or evolving graphs. Inductive methods like GraphSAGE (Hamilton et al., 2017) sample fixed-size neighborhoods and apply shared parameters — the same weights apply to any graph at inference time. For production systems handling dynamic graphs (citation networks, social graphs, transaction graphs), inductive spatial methods are necessary; spectral methods remain theoretically important but operationally limited.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Generating Functions, Combinatorial Enumeration & the Probabilistic Method

ML Fundamentals

Linear & Logistic Regression