Neural-Path/Notes
35 min

Bridge: Dynamic Programming, Bellman Equations & Neural Architecture Search

Operations research is not merely a historical precursor to machine learning — its mathematical structures appear directly inside modern ML methods. LP relaxations drive structured prediction, network flows model optimal transport between distributions, and combinatorial optimization characterizes problems from neural architecture search to beam decoding. This lesson makes those connections precise.

Concepts

The feasible region of an LP is a convex polytope. The optimal solution always lies at a vertex. The objective function sweeps as a hyperplane — the last vertex it touches is optimal.

112233440x₁x₂∇c(2,2), val=4
Optimal: (2,2) with objective value 4. Blue dashed = constraints. Green solid = optimal iso-line (c·x = 4).

SVMs, Wasserstein distances, and neural architecture search look like completely different methods — but each reduces to the same underlying structure: a linear program, a network flow, or a combinatorial optimization problem. Operations research provides not just algorithms for these, but a classification system: is the problem exactly solvable in polynomial time (LP, min-cost flow), or NP-hard with a certified approximation ratio (submodular greedy, Christofides), or NP-hard with no good approximation at all (MAP in loopy graphs)? Recognizing the OR structure in an ML problem tells you immediately what guarantees are achievable and where the hardness is fundamental.

LP Relaxations in Structured Prediction

Support vector machines. The hard-margin SVM finds the maximum-margin hyperplane separating two classes. Written as a QP (quadratic program, a close relative of LP):

minw,b  12w2s.t.yi(wxi+b)1  i\min_{w, b} \; \frac{1}{2}\|w\|^2 \quad \text{s.t.} \quad y_i(w^\top x_i + b) \geq 1 \; \forall i

The dual is:

maxα  iαi12i,jαiαjyiyjxixjs.t.0αiC,  iαiyi=0\max_\alpha \; \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j x_i^\top x_j \quad \text{s.t.} \quad 0 \leq \alpha_i \leq C, \; \sum_i \alpha_i y_i = 0

This is an LP in α\alpha after fixing the quadratic terms, and the kernel trick replaces xixjx_i^\top x_j with k(xi,xj)k(x_i, x_j). The dual αi\alpha_i are shadow prices on the margin constraints; complementary slackness says non-zero αi\alpha_i correspond exactly to support vectors (points on the margin).

The support vector interpretation is not a design choice — it is a theorem forced by LP duality. Complementary slackness requires αi=0\alpha_i = 0 whenever yi(wxi+b)>1y_i(w^\top x_i + b) > 1: points strictly inside the margin contribute nothing to the decision boundary. This is why SVMs are sparse in training data, and why the kernel trick works: you only need k(xi,xj)k(x_i, x_j) evaluated at support vectors. The sparsity is a mathematical consequence of duality, not an engineering optimization.

MAP inference in CRFs. A conditional random field defines a distribution p(yx)exp(cθcϕc(yc,x))p(y|x) \propto \exp(\sum_c \theta_c \phi_c(y_c, x)) over structured outputs yy. Finding the MAP estimate y^=argmaxycθcϕc(yc,x)\hat{y} = \arg\max_y \sum_c \theta_c \phi_c(y_c, x) is an ILP: exponentially many possible yy, but LP relaxation on the marginal polytope gives a tractable lower bound. When the factor graph is a tree, the LP relaxation is tight (integral vertices), so belief propagation solves MAP exactly. For loopy graphs, LP relaxation + rounding gives approximate MAP with a certificate of optimality gap.

Network Flows and Optimal Transport

Wasserstein distance. Given two discrete distributions μ=iaiδxi\mu = \sum_i a_i \delta_{x_i} and ν=jbjδyj\nu = \sum_j b_j \delta_{y_j} with ai,bj0a_i, b_j \geq 0, iai=jbj=1\sum_i a_i = \sum_j b_j = 1, the 1-Wasserstein (Earth Mover's) distance is:

W1(μ,ν)=minγΠ(μ,ν)i,jγijd(xi,yj)W_1(\mu, \nu) = \min_{\gamma \in \Pi(\mu,\nu)} \sum_{i,j} \gamma_{ij} d(x_i, y_j)

where Π(μ,ν)\Pi(\mu,\nu) is the set of transport plans γ0\gamma \geq 0 with marginals jγij=ai\sum_j \gamma_{ij} = a_i and iγij=bj\sum_i \gamma_{ij} = b_j. This is exactly a minimum-cost flow problem on a bipartite graph: source nodes ii with supply aia_i, sink nodes jj with demand bjb_j, edge costs d(xi,yj)d(x_i, y_j). The totally unimodular structure guarantees an integer optimal plan when ai,bja_i, b_j are rational.

LP relaxations are the computational engine for both MAP inference and optimal transport. The feasible polytope structure — whether a marginal polytope or a transportation polytope — determines the tightness of the relaxation and the quality of the approximation.

Kantorovich-Rubinstein duality (derived in detail in the Walkthrough) states that W1W_1 is also the supremum of Eμ[f]Eν[f]\mathbb{E}_\mu[f] - \mathbb{E}_\nu[f] over all 1-Lipschitz functions ff. This is LP duality in function space and is the mathematical foundation of the Wasserstein GAN critic's objective.

Attention as soft assignment. The scaled dot-product attention computation Attention(Q,K,V)=softmax(QK/d)V\text{Attention}(Q,K,V) = \text{softmax}(QK^\top / \sqrt{d})V can be interpreted as a doubly-stochastic soft transport plan between query positions and key positions. The softmax row-normalizes each row (ensuring jγij=1\sum_j \gamma_{ij} = 1), making it a relaxed version of the assignment problem. Sinkhorn's algorithm (iterated row/column normalization) is the maximum-entropy regularized version of the optimal transport LP and produces the unique doubly-stochastic plan achieving W2W_2 under entropic regularization.

Combinatorial Optimization in ML

Neural architecture search (NAS). The architecture search space — number of layers, kernel sizes, skip connections — forms a discrete combinatorial space of exponential size. DARTS relaxes this to a continuous differentiable problem by replacing the discrete choice of operation oo at each edge with a weighted mixture osoftmax(αo)o(x)\sum_o \text{softmax}(\alpha_o) \cdot o(x). This is the LP relaxation of the combinatorial architecture selection problem. After training α\alpha, the architecture is recovered by taking argmaxoαo\arg\max_o \alpha_o — the rounding step.

Beam search. Autoregressive decoding maintains a beam of kk highest-probability partial sequences, expanding each at every step and pruning to the top-kk. This is a greedy approximation to the combinatorial problem of finding the highest-probability complete sequence — an NP-hard problem on the sequence lattice. Beam search has no provable approximation guarantee, but works well in practice because the sequence probability distribution is heavily concentrated.

Combinatorial bandits. In the combinatorial multi-armed bandit problem, the learner selects a subset SS from a combinatorial action set (e.g., a matching or a spanning tree) at each round and observes rewards. The regret bounds depend on the geometric structure of the action set, and LP-based algorithms achieve O(Tlogn)O(\sqrt{T \log n}) regret for matroid-constrained actions.

Submodularity: Convexity for Combinatorics

A set function f:2VRf : 2^V \to \mathbb{R} is submodular if for all ABA \subseteq B and vBv \notin B:

f(A{v})f(A)f(B{v})f(B)f(A \cup \{v\}) - f(A) \geq f(B \cup \{v\}) - f(B)

This is the diminishing returns property: adding element vv to a smaller set is at least as valuable as adding it to a larger set. Submodularity is the discrete analogue of concavity.

Greedy maximization. For a monotone submodular function ff (where ABf(A)f(B)A \subseteq B \Rightarrow f(A) \leq f(B)) subject to a cardinality constraint Sk|S| \leq k, the greedy algorithm — iteratively adding the element with the largest marginal gain — achieves:

f(Sgreedy)(11e)OPT0.632OPTf(S_{\text{greedy}}) \geq \left(1 - \frac{1}{e}\right) \text{OPT} \approx 0.632 \cdot \text{OPT}

Proof sketch. Let SS^* be the optimal set. After tt greedy steps, define Δt=OPTf(St)\Delta_t = \text{OPT} - f(S_t). The greedy step adds an element gaining at least Δt/k\Delta_t / k (by submodularity and the fact that SS^* can cover the remaining gap). So Δt+1Δt(11/k)\Delta_{t+1} \leq \Delta_t (1 - 1/k). After kk steps: Δk(11/k)kOPTe1OPT\Delta_k \leq (1-1/k)^k \cdot \text{OPT} \leq e^{-1} \cdot \text{OPT}.

The bound 11/e1-1/e is tight for the max kk-coverage problem (Nemhauser et al., 1978), and cannot be improved under P \neq NP (Feige, 1998).

ML applications of submodularity:

  • Feature selection: mutual information I(Y;XS)I(Y; X_S) is submodular in SS for many distributions. Greedy forward selection has the (11/e)(1-1/e) guarantee.
  • Facility location / data summarization: f(S)=imaxjSsim(i,j)f(S) = \sum_{i} \max_{j \in S} \text{sim}(i, j) is submodular. Greedy gives a (11/e)(1-1/e)-approximation for coreset construction and exemplar-based summarization.
  • Active learning: expected information gain (a form of entropy reduction) is submodular under certain independence assumptions, justifying greedy batch active learning.

Worked Example

Wasserstein Distance as a Linear Program

Let μ=0.5δ0+0.5δ2\mu = 0.5\delta_0 + 0.5\delta_2 and ν=0.4δ1+0.6δ3\nu = 0.4\delta_1 + 0.6\delta_3 on R\mathbb{R} with d(x,y)=xyd(x,y) = |x-y|.

Transport plan variables γij\gamma_{ij} for (i,j){0,2}×{1,3}(i,j) \in \{0,2\} \times \{1,3\}:

min  01γ01+03γ03+21γ21+23γ23=γ01+3γ03+γ21+γ23\min \; |0-1|\gamma_{01} + |0-3|\gamma_{03} + |2-1|\gamma_{21} + |2-3|\gamma_{23} = \gamma_{01} + 3\gamma_{03} + \gamma_{21} + \gamma_{23}

Subject to:

γ01+γ03=0.5(supply at 0)\gamma_{01} + \gamma_{03} = 0.5 \quad (\text{supply at } 0) γ21+γ23=0.5(supply at 2)\gamma_{21} + \gamma_{23} = 0.5 \quad (\text{supply at } 2) γ01+γ21=0.4(demand at 1)\gamma_{01} + \gamma_{21} = 0.4 \quad (\text{demand at } 1) γ03+γ23=0.6(demand at 3)\gamma_{03} + \gamma_{23} = 0.6 \quad (\text{demand at } 3) γij0\gamma_{ij} \geq 0

Solving: set γ01=0.4\gamma_{01} = 0.4, γ03=0.1\gamma_{03} = 0.1, γ21=0\gamma_{21} = 0, γ23=0.5\gamma_{23} = 0.5. Cost = 0.4+0.3+0+0.5=1.20.4 + 0.3 + 0 + 0.5 = 1.2.

Kantorovich-Rubinstein dual. The dual LP assigns potentials uiu_i to sources and vjv_j to sinks:

max  0.5u0+0.5u2+0.4v1+0.6v3s.t.ui+vjd(xi,yj)  i,j\max \; 0.5 u_0 + 0.5 u_2 + 0.4 v_1 + 0.6 v_3 \quad \text{s.t.} \quad u_i + v_j \leq d(x_i, y_j) \; \forall i,j

At optimality: u0=v1=0u_0 = -v_1 = -0 can be read off by complementary slackness (saturated constraints for non-zero γij\gamma_{ij}). The dual says W1(μ,ν)=supf:1-LipEμ[f]Eν[f]W_1(\mu,\nu) = \sup_{f : \text{1-Lip}} \mathbb{E}_\mu[f] - \mathbb{E}_\nu[f], which we verify: the 1-Lipschitz function f(x)=x/2f(x) = x/2 gives Eμ[f]Eν[f]=11.4/2...\mathbb{E}_\mu[f] - \mathbb{E}_\nu[f] = 1 - 1.4/2... More simply, f(x)=xf(x) = x is 1-Lipschitz and gives Eμ[x]Eν[x]=12.2=1.2\mathbb{E}_\mu[x] - \mathbb{E}_\nu[x] = 1 - 2.2 = -1.2, matching (up to sign convention). Strong duality: W1=1.2W_1 = 1.2. \checkmark

Connections

Where Your Intuition Breaks

The greedy (11/e)(1-1/e) guarantee for monotone submodular maximization sounds like a limitation — 63% of optimal, leaving a 37% gap. But the bound is exactly tight and cannot be improved by any polynomial algorithm unless P = NP: for the maximum kk-coverage problem, no polynomial algorithm achieves ratio better than 11/e1-1/e (Feige, 1998). More surprisingly, this is not a limit of greedy's strategy — it is a fundamental computational barrier. Adding randomization, exponentially many queries, or arbitrarily complex algorithms cannot help: the 11/e1-1/e ratio is the exact optimal achievable in polynomial time. Submodular maximization is one of the few problems where we know not just a bound on approximability, but the exact optimal polynomial-time ratio.

💡Intuition

The computational landscape of ML inference is largely a combinatorial optimization landscape in disguise. Sequence decoding is tree search. Structured prediction is ILP. Optimal transport is min-cost flow. Recognizing the OR structure in an ML problem immediately reveals whether polynomial-time exact algorithms exist (LP/flow), what the approximation ratio is (Christofides, greedy submodular), and where the hardness is fundamental (MAP in general Markov networks is NP-hard, so approximate inference is the only option at scale).

💡Intuition

Submodularity is often called the "convexity of combinatorics." Just as convexity gives greedy gradient methods optimal guarantees in continuous optimization, submodularity gives the greedy set-addition algorithm a (11/e)(1-1/e) guarantee in discrete optimization. The analogy runs deep: the multilinear extension of a submodular function is a continuous relaxation that enables convex optimization techniques, just as LP relaxation enables continuous methods for ILP.

When to Apply Each Framework

ML ProblemOR FrameworkGuarantee
SVM trainingQP / LP dualExact (polynomial)
MAP in tree-structured CRFLP (marginal polytope)Exact (integral LP)
MAP in loopy CRFLP relaxation + roundingApproximate (gap depends on LP)
Wasserstein distance (W1W_1)Min-cost flowExact (polynomial, TU)
Wasserstein distance (W2W_2)QP (regularized: Sinkhorn)Exact / approximate
Feature selectionSubmodular maximization(11/e)(1-1/e) greedy
NASLP relaxation (DARTS)Heuristic (rounding)
Beam search decodingApproximate combinatorialNo provable ratio

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.