Bridge: Dynamic Programming, Bellman Equations & Neural Architecture Search
Operations research is not merely a historical precursor to machine learning — its mathematical structures appear directly inside modern ML methods. LP relaxations drive structured prediction, network flows model optimal transport between distributions, and combinatorial optimization characterizes problems from neural architecture search to beam decoding. This lesson makes those connections precise.
Concepts
The feasible region of an LP is a convex polytope. The optimal solution always lies at a vertex. The objective function sweeps as a hyperplane — the last vertex it touches is optimal.
SVMs, Wasserstein distances, and neural architecture search look like completely different methods — but each reduces to the same underlying structure: a linear program, a network flow, or a combinatorial optimization problem. Operations research provides not just algorithms for these, but a classification system: is the problem exactly solvable in polynomial time (LP, min-cost flow), or NP-hard with a certified approximation ratio (submodular greedy, Christofides), or NP-hard with no good approximation at all (MAP in loopy graphs)? Recognizing the OR structure in an ML problem tells you immediately what guarantees are achievable and where the hardness is fundamental.
LP Relaxations in Structured Prediction
Support vector machines. The hard-margin SVM finds the maximum-margin hyperplane separating two classes. Written as a QP (quadratic program, a close relative of LP):
The dual is:
This is an LP in after fixing the quadratic terms, and the kernel trick replaces with . The dual are shadow prices on the margin constraints; complementary slackness says non-zero correspond exactly to support vectors (points on the margin).
The support vector interpretation is not a design choice — it is a theorem forced by LP duality. Complementary slackness requires whenever : points strictly inside the margin contribute nothing to the decision boundary. This is why SVMs are sparse in training data, and why the kernel trick works: you only need evaluated at support vectors. The sparsity is a mathematical consequence of duality, not an engineering optimization.
MAP inference in CRFs. A conditional random field defines a distribution over structured outputs . Finding the MAP estimate is an ILP: exponentially many possible , but LP relaxation on the marginal polytope gives a tractable lower bound. When the factor graph is a tree, the LP relaxation is tight (integral vertices), so belief propagation solves MAP exactly. For loopy graphs, LP relaxation + rounding gives approximate MAP with a certificate of optimality gap.
Network Flows and Optimal Transport
Wasserstein distance. Given two discrete distributions and with , , the 1-Wasserstein (Earth Mover's) distance is:
where is the set of transport plans with marginals and . This is exactly a minimum-cost flow problem on a bipartite graph: source nodes with supply , sink nodes with demand , edge costs . The totally unimodular structure guarantees an integer optimal plan when are rational.
LP relaxations are the computational engine for both MAP inference and optimal transport. The feasible polytope structure — whether a marginal polytope or a transportation polytope — determines the tightness of the relaxation and the quality of the approximation.
Kantorovich-Rubinstein duality (derived in detail in the Walkthrough) states that is also the supremum of over all 1-Lipschitz functions . This is LP duality in function space and is the mathematical foundation of the Wasserstein GAN critic's objective.
Attention as soft assignment. The scaled dot-product attention computation can be interpreted as a doubly-stochastic soft transport plan between query positions and key positions. The softmax row-normalizes each row (ensuring ), making it a relaxed version of the assignment problem. Sinkhorn's algorithm (iterated row/column normalization) is the maximum-entropy regularized version of the optimal transport LP and produces the unique doubly-stochastic plan achieving under entropic regularization.
Combinatorial Optimization in ML
Neural architecture search (NAS). The architecture search space — number of layers, kernel sizes, skip connections — forms a discrete combinatorial space of exponential size. DARTS relaxes this to a continuous differentiable problem by replacing the discrete choice of operation at each edge with a weighted mixture . This is the LP relaxation of the combinatorial architecture selection problem. After training , the architecture is recovered by taking — the rounding step.
Beam search. Autoregressive decoding maintains a beam of highest-probability partial sequences, expanding each at every step and pruning to the top-. This is a greedy approximation to the combinatorial problem of finding the highest-probability complete sequence — an NP-hard problem on the sequence lattice. Beam search has no provable approximation guarantee, but works well in practice because the sequence probability distribution is heavily concentrated.
Combinatorial bandits. In the combinatorial multi-armed bandit problem, the learner selects a subset from a combinatorial action set (e.g., a matching or a spanning tree) at each round and observes rewards. The regret bounds depend on the geometric structure of the action set, and LP-based algorithms achieve regret for matroid-constrained actions.
Submodularity: Convexity for Combinatorics
A set function is submodular if for all and :
This is the diminishing returns property: adding element to a smaller set is at least as valuable as adding it to a larger set. Submodularity is the discrete analogue of concavity.
Greedy maximization. For a monotone submodular function (where ) subject to a cardinality constraint , the greedy algorithm — iteratively adding the element with the largest marginal gain — achieves:
Proof sketch. Let be the optimal set. After greedy steps, define . The greedy step adds an element gaining at least (by submodularity and the fact that can cover the remaining gap). So . After steps: .
The bound is tight for the max -coverage problem (Nemhauser et al., 1978), and cannot be improved under P NP (Feige, 1998).
ML applications of submodularity:
- Feature selection: mutual information is submodular in for many distributions. Greedy forward selection has the guarantee.
- Facility location / data summarization: is submodular. Greedy gives a -approximation for coreset construction and exemplar-based summarization.
- Active learning: expected information gain (a form of entropy reduction) is submodular under certain independence assumptions, justifying greedy batch active learning.
Worked Example
Wasserstein Distance as a Linear Program
Let and on with .
Transport plan variables for :
Subject to:
Solving: set , , , . Cost = .
Kantorovich-Rubinstein dual. The dual LP assigns potentials to sources and to sinks:
At optimality: can be read off by complementary slackness (saturated constraints for non-zero ). The dual says , which we verify: the 1-Lipschitz function gives More simply, is 1-Lipschitz and gives , matching (up to sign convention). Strong duality: .
Connections
Where Your Intuition Breaks
The greedy guarantee for monotone submodular maximization sounds like a limitation — 63% of optimal, leaving a 37% gap. But the bound is exactly tight and cannot be improved by any polynomial algorithm unless P = NP: for the maximum -coverage problem, no polynomial algorithm achieves ratio better than (Feige, 1998). More surprisingly, this is not a limit of greedy's strategy — it is a fundamental computational barrier. Adding randomization, exponentially many queries, or arbitrarily complex algorithms cannot help: the ratio is the exact optimal achievable in polynomial time. Submodular maximization is one of the few problems where we know not just a bound on approximability, but the exact optimal polynomial-time ratio.
The computational landscape of ML inference is largely a combinatorial optimization landscape in disguise. Sequence decoding is tree search. Structured prediction is ILP. Optimal transport is min-cost flow. Recognizing the OR structure in an ML problem immediately reveals whether polynomial-time exact algorithms exist (LP/flow), what the approximation ratio is (Christofides, greedy submodular), and where the hardness is fundamental (MAP in general Markov networks is NP-hard, so approximate inference is the only option at scale).
Submodularity is often called the "convexity of combinatorics." Just as convexity gives greedy gradient methods optimal guarantees in continuous optimization, submodularity gives the greedy set-addition algorithm a guarantee in discrete optimization. The analogy runs deep: the multilinear extension of a submodular function is a continuous relaxation that enables convex optimization techniques, just as LP relaxation enables continuous methods for ILP.
When to Apply Each Framework
| ML Problem | OR Framework | Guarantee |
|---|---|---|
| SVM training | QP / LP dual | Exact (polynomial) |
| MAP in tree-structured CRF | LP (marginal polytope) | Exact (integral LP) |
| MAP in loopy CRF | LP relaxation + rounding | Approximate (gap depends on LP) |
| Wasserstein distance () | Min-cost flow | Exact (polynomial, TU) |
| Wasserstein distance () | QP (regularized: Sinkhorn) | Exact / approximate |
| Feature selection | Submodular maximization | greedy |
| NAS | LP relaxation (DARTS) | Heuristic (rounding) |
| Beam search decoding | Approximate combinatorial | No provable ratio |
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.