Bridge: Dynamic Programming, Bellman Equations & Neural Architecture Search

Operations research is not merely a historical precursor to machine learning — its mathematical structures appear directly inside modern ML methods. LP relaxations drive structured prediction, network flows model optimal transport between distributions, and combinatorial optimization characterizes problems from neural architecture search to beam decoding. This lesson makes those connections precise.

Concepts

The feasible region of an LP is a convex polytope. The optimal solution always lies at a vertex. The objective function sweeps as a hyperplane — the last vertex it touches is optimal.

Optimal: (2,2) with objective value 4. Blue dashed = constraints. Green solid = optimal iso-line (c·x = 4).

SVMs, Wasserstein distances, and neural architecture search look like completely different methods — but each reduces to the same underlying structure: a linear program, a network flow, or a combinatorial optimization problem. Operations research provides not just algorithms for these, but a classification system: is the problem exactly solvable in polynomial time (LP, min-cost flow), or NP-hard with a certified approximation ratio (submodular greedy, Christofides), or NP-hard with no good approximation at all (MAP in loopy graphs)? Recognizing the OR structure in an ML problem tells you immediately what guarantees are achievable and where the hardness is fundamental.

LP Relaxations in Structured Prediction

Support vector machines. The hard-margin SVM finds the maximum-margin hyperplane separating two classes. Written as a QP (quadratic program, a close relative of LP):

\min_{w, b} \; \frac{1}{2}\|w\|^2 \quad \text{s.t.} \quad y_i(w^\top x_i + b) \geq 1 \; \forall i

The dual is:

\max_\alpha \; \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j x_i^\top x_j \quad \text{s.t.} \quad 0 \leq \alpha_i \leq C, \; \sum_i \alpha_i y_i = 0

This is an LP in $\alpha$ after fixing the quadratic terms, and the kernel trick replaces $x_i^\top x_j$ with $k(x_i, x_j)$ . The dual $\alpha_i$ are shadow prices on the margin constraints; complementary slackness says non-zero $\alpha_i$ correspond exactly to support vectors (points on the margin).

The support vector interpretation is not a design choice — it is a theorem forced by LP duality. Complementary slackness requires $\alpha_i = 0$ whenever $y_i(w^\top x_i + b) > 1$ : points strictly inside the margin contribute nothing to the decision boundary. This is why SVMs are sparse in training data, and why the kernel trick works: you only need $k(x_i, x_j)$ evaluated at support vectors. The sparsity is a mathematical consequence of duality, not an engineering optimization.

MAP inference in CRFs. A conditional random field defines a distribution $p(y|x) \propto \exp(\sum_c \theta_c \phi_c(y_c, x))$ over structured outputs $y$ . Finding the MAP estimate $\hat{y} = \arg\max_y \sum_c \theta_c \phi_c(y_c, x)$ is an ILP: exponentially many possible $y$ , but LP relaxation on the marginal polytope gives a tractable lower bound. When the factor graph is a tree, the LP relaxation is tight (integral vertices), so belief propagation solves MAP exactly. For loopy graphs, LP relaxation + rounding gives approximate MAP with a certificate of optimality gap.

Network Flows and Optimal Transport

Wasserstein distance. Given two discrete distributions $\mu = \sum_i a_i \delta_{x_i}$ and $\nu = \sum_j b_j \delta_{y_j}$ with $a_i, b_j \geq 0$ , $\sum_i a_i = \sum_j b_j = 1$ , the 1-Wasserstein (Earth Mover's) distance is:

W_1(\mu, \nu) = \min_{\gamma \in \Pi(\mu,\nu)} \sum_{i,j} \gamma_{ij} d(x_i, y_j)

where $\Pi(\mu,\nu)$ is the set of transport plans $\gamma \geq 0$ with marginals $\sum_j \gamma_{ij} = a_i$ and $\sum_i \gamma_{ij} = b_j$ . This is exactly a minimum-cost flow problem on a bipartite graph: source nodes $i$ with supply $a_i$ , sink nodes $j$ with demand $b_j$ , edge costs $d(x_i, y_j)$ . The totally unimodular structure guarantees an integer optimal plan when $a_i, b_j$ are rational.

LP relaxations are the computational engine for both MAP inference and optimal transport. The feasible polytope structure — whether a marginal polytope or a transportation polytope — determines the tightness of the relaxation and the quality of the approximation.

Kantorovich-Rubinstein duality (derived in detail in the Walkthrough) states that $W_1$ is also the supremum of $\mathbb{E}_\mu[f] - \mathbb{E}_\nu[f]$ over all 1-Lipschitz functions $f$ . This is LP duality in function space and is the mathematical foundation of the Wasserstein GAN critic's objective.

Attention as soft assignment. The scaled dot-product attention computation $\text{Attention}(Q,K,V) = \text{softmax}(QK^\top / \sqrt{d})V$ can be interpreted as a doubly-stochastic soft transport plan between query positions and key positions. The softmax row-normalizes each row (ensuring $\sum_j \gamma_{ij} = 1$ ), making it a relaxed version of the assignment problem. Sinkhorn's algorithm (iterated row/column normalization) is the maximum-entropy regularized version of the optimal transport LP and produces the unique doubly-stochastic plan achieving $W_2$ under entropic regularization.

Combinatorial Optimization in ML

Neural architecture search (NAS). The architecture search space — number of layers, kernel sizes, skip connections — forms a discrete combinatorial space of exponential size. DARTS relaxes this to a continuous differentiable problem by replacing the discrete choice of operation $o$ at each edge with a weighted mixture $\sum_o \text{softmax}(\alpha_o) \cdot o(x)$ . This is the LP relaxation of the combinatorial architecture selection problem. After training $\alpha$ , the architecture is recovered by taking $\arg\max_o \alpha_o$ — the rounding step.

Beam search. Autoregressive decoding maintains a beam of $k$ highest-probability partial sequences, expanding each at every step and pruning to the top- $k$ . This is a greedy approximation to the combinatorial problem of finding the highest-probability complete sequence — an NP-hard problem on the sequence lattice. Beam search has no provable approximation guarantee, but works well in practice because the sequence probability distribution is heavily concentrated.

Combinatorial bandits. In the combinatorial multi-armed bandit problem, the learner selects a subset $S$ from a combinatorial action set (e.g., a matching or a spanning tree) at each round and observes rewards. The regret bounds depend on the geometric structure of the action set, and LP-based algorithms achieve $O(\sqrt{T \log n})$ regret for matroid-constrained actions.

Submodularity: Convexity for Combinatorics

A set function $f : 2^V \to \mathbb{R}$ is submodular if for all $A \subseteq B$ and $v \notin B$ :

f(A \cup \{v\}) - f(A) \geq f(B \cup \{v\}) - f(B)

This is the diminishing returns property: adding element $v$ to a smaller set is at least as valuable as adding it to a larger set. Submodularity is the discrete analogue of concavity.

Greedy maximization. For a monotone submodular function $f$ (where $A \subseteq B \Rightarrow f(A) \leq f(B)$ ) subject to a cardinality constraint $|S| \leq k$ , the greedy algorithm — iteratively adding the element with the largest marginal gain — achieves:

f(S_{\text{greedy}}) \geq \left(1 - \frac{1}{e}\right) \text{OPT} \approx 0.632 \cdot \text{OPT}

Proof sketch. Let $S^*$ be the optimal set. After $t$ greedy steps, define $\Delta_t = \text{OPT} - f(S_t)$ . The greedy step adds an element gaining at least $\Delta_t / k$ (by submodularity and the fact that $S^*$ can cover the remaining gap). So $\Delta_{t+1} \leq \Delta_t (1 - 1/k)$ . After $k$ steps: $\Delta_k \leq (1-1/k)^k \cdot \text{OPT} \leq e^{-1} \cdot \text{OPT}$ .

The bound $1-1/e$ is tight for the max $k$ -coverage problem (Nemhauser et al., 1978), and cannot be improved under P $\neq$ NP (Feige, 1998).

ML applications of submodularity:

Feature selection: mutual information $I(Y; X_S)$ is submodular in $S$ for many distributions. Greedy forward selection has the $(1-1/e)$ guarantee.
Facility location / data summarization: $f(S) = \sum_{i} \max_{j \in S} \text{sim}(i, j)$ is submodular. Greedy gives a $(1-1/e)$ -approximation for coreset construction and exemplar-based summarization.
Active learning: expected information gain (a form of entropy reduction) is submodular under certain independence assumptions, justifying greedy batch active learning.

Worked Example

Wasserstein Distance as a Linear Program

Let $\mu = 0.5\delta_0 + 0.5\delta_2$ and $\nu = 0.4\delta_1 + 0.6\delta_3$ on $\mathbb{R}$ with $d(x,y) = |x-y|$ .

Transport plan variables $\gamma_{ij}$ for $(i,j) \in \{0,2\} \times \{1,3\}$ :

\min \; |0-1|\gamma_{01} + |0-3|\gamma_{03} + |2-1|\gamma_{21} + |2-3|\gamma_{23} = \gamma_{01} + 3\gamma_{03} + \gamma_{21} + \gamma_{23}

Subject to:

\gamma_{01} + \gamma_{03} = 0.5 \quad (\text{supply at } 0)

\gamma_{21} + \gamma_{23} = 0.5 \quad (\text{supply at } 2)

\gamma_{01} + \gamma_{21} = 0.4 \quad (\text{demand at } 1)

\gamma_{03} + \gamma_{23} = 0.6 \quad (\text{demand at } 3)

\gamma_{ij} \geq 0

Solving: set $\gamma_{01} = 0.4$ , $\gamma_{03} = 0.1$ , $\gamma_{21} = 0$ , $\gamma_{23} = 0.5$ . Cost = $0.4 + 0.3 + 0 + 0.5 = 1.2$ .

Kantorovich-Rubinstein dual. The dual LP assigns potentials $u_i$ to sources and $v_j$ to sinks:

\max \; 0.5 u_0 + 0.5 u_2 + 0.4 v_1 + 0.6 v_3 \quad \text{s.t.} \quad u_i + v_j \leq d(x_i, y_j) \; \forall i,j

At optimality: $u_0 = -v_1 = -0$ can be read off by complementary slackness (saturated constraints for non-zero $\gamma_{ij}$ ). The dual says $W_1(\mu,\nu) = \sup_{f : \text{1-Lip}} \mathbb{E}_\mu[f] - \mathbb{E}_\nu[f]$ , which we verify: the 1-Lipschitz function $f(x) = x/2$ gives $\mathbb{E}_\mu[f] - \mathbb{E}_\nu[f] = 1 - 1.4/2...$ More simply, $f(x) = x$ is 1-Lipschitz and gives $\mathbb{E}_\mu[x] - \mathbb{E}_\nu[x] = 1 - 2.2 = -1.2$ , matching (up to sign convention). Strong duality: $W_1 = 1.2$ . $\checkmark$

Connections

Where Your Intuition Breaks

The greedy $(1-1/e)$ guarantee for monotone submodular maximization sounds like a limitation — 63% of optimal, leaving a 37% gap. But the bound is exactly tight and cannot be improved by any polynomial algorithm unless P = NP: for the maximum $k$ -coverage problem, no polynomial algorithm achieves ratio better than $1-1/e$ (Feige, 1998). More surprisingly, this is not a limit of greedy's strategy — it is a fundamental computational barrier. Adding randomization, exponentially many queries, or arbitrarily complex algorithms cannot help: the $1-1/e$ ratio is the exact optimal achievable in polynomial time. Submodular maximization is one of the few problems where we know not just a bound on approximability, but the exact optimal polynomial-time ratio.

💡Intuition

The computational landscape of ML inference is largely a combinatorial optimization landscape in disguise. Sequence decoding is tree search. Structured prediction is ILP. Optimal transport is min-cost flow. Recognizing the OR structure in an ML problem immediately reveals whether polynomial-time exact algorithms exist (LP/flow), what the approximation ratio is (Christofides, greedy submodular), and where the hardness is fundamental (MAP in general Markov networks is NP-hard, so approximate inference is the only option at scale).

💡Intuition

Submodularity is often called the "convexity of combinatorics." Just as convexity gives greedy gradient methods optimal guarantees in continuous optimization, submodularity gives the greedy set-addition algorithm a $(1-1/e)$ guarantee in discrete optimization. The analogy runs deep: the multilinear extension of a submodular function is a continuous relaxation that enables convex optimization techniques, just as LP relaxation enables continuous methods for ILP.

When to Apply Each Framework

ML Problem	OR Framework	Guarantee
SVM training	QP / LP dual	Exact (polynomial)
MAP in tree-structured CRF	LP (marginal polytope)	Exact (integral LP)
MAP in loopy CRF	LP relaxation + rounding	Approximate (gap depends on LP)
Wasserstein distance ( $W_1$ )	Min-cost flow	Exact (polynomial, TU)
Wasserstein distance ( $W_2$ )	QP (regularized: Sinkhorn)	Exact / approximate
Feature selection	Submodular maximization	$(1-1/e)$ greedy
NAS	LP relaxation (DARTS)	Heuristic (rounding)
Beam search decoding	Approximate combinatorial	No provable ratio

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Combinatorial Optimization: Matching, Approximation Algorithms & Complexity

Game Theory

Normal Form Games, Nash Equilibria & Mixed Strategies