Duality: Lagrangians, KKT Conditions & Strong Duality
Lagrangian duality transforms a constrained optimization problem into an unconstrained one by pricing constraint violations. The resulting dual problem always provides a lower bound on the original (primal), and under mild conditions (Slater's) the bound is tight. The KKT conditions are the first-order equations that characterize optimality — and their sparse structure (complementary slackness) is why SVMs have support vectors and why LASSO selects features.
Concepts
The feasible region of an LP is a convex polytope. The optimal solution always lies at a vertex. The objective function sweeps as a hyperplane — the last vertex it touches is optimal.
Suppose you want to minimize cost subject to constraints. A government building a road wants minimum cost, but must respect budget limits and land area. Economic theory says: instead of directly solving the constrained problem, you can price the constraints — assign a cost per unit of constraint violation — and then solve a simpler unconstrained problem. If you price constraints correctly, the constrained and unconstrained optima coincide. This is Lagrangian duality, and the optimal prices are the KKT multipliers. The SVM, LASSO, and LP all become their most revealing forms in the dual — because the dual variable structure makes the solution's geometry visible.
The Optimization Problem and Its Lagrangian
The standard form optimization problem:
The Lagrangian augments the objective with constraint penalties:
where are dual variables (or Lagrange multipliers) for inequality constraints and for equality constraints. The dual variables can be interpreted as prices: is the cost per unit of violating constraint .
Key property. For any primal feasible (satisfying all constraints) and any , :
The Dual Function and Dual Problem
The Lagrange dual function is:
Critical observation. is always concave (infimum of affine functions in ), regardless of whether are convex. This makes the dual problem always a concave maximization — which is convex after negation. This is the algebraic reason duality is so powerful: even if the primal is non-convex, you can always construct a convex dual. The dual may not be tight (strong duality may fail), but it always provides a lower bound and is always tractable to optimize.
The Lagrange dual problem is:
The dual optimal value and primal optimal value satisfy weak duality: . The duality gap is .
Strong Duality and Slater's Condition
Strong duality: — the duality gap is zero.
Slater's constraint qualification (CQ). For a convex primal problem ( convex, affine), if there exists a strictly feasible point with for all and , then strong duality holds.
Strong duality for LPs holds always (no constraint qualification needed). For SDPs, Slater's condition or strict feasibility of both primal and dual suffices.
Why it matters. With strong duality, you can solve the easier concave dual instead of the possibly harder primal, and the optimal dual variables provide sensitivity information about constraint prices.
KKT Conditions
At a point satisfying regularity conditions (a constraint qualification, e.g., Slater's), optimality is equivalent to the KKT system:
1. Stationarity:
The gradient of the Lagrangian vanishes at the optimum.
2. Primal feasibility:
3. Dual feasibility:
4. Complementary slackness:
Each inequality constraint is either active (, so can be positive) or inactive (, so — the constraint is irrelevant at the optimum).
For convex problems: KKT conditions are necessary and sufficient for global optimality (assuming strong duality / regularity). The KKT system is a system of equations and inequalities that uniquely pins down both the primal and dual optimal points.
Sensitivity Analysis
From the KKT conditions, shadow prices: if the right-hand side of constraint is perturbed by (i.e., ), the optimal value changes as:
A large means constraint is "expensive" — relaxing it slightly would significantly improve the objective. This is the economic interpretation: dual variables are marginal values of resources.
Linear Programming Duality
The LP primal and dual form a symmetric pair:
The dual constraint says that the dual variables price the resources consistently with the objective costs. Complementary slackness in the LP context: for all — either variable is zero (nonbasic) or its reduced cost is zero (basic).
Strong duality for LP (Dantzig's theorem): if both primal and dual are feasible, with no duality gap.
Worked Example
Example 1: SVM Dual Derivation
The hard-margin SVM primal: minimize subject to for all . Rewrite as .
Lagrangian:
Stationarity in : , so .
Stationarity in : .
Substituting back into to get the dual function :
Dual problem: .
This is a quadratic program in with variables (one per training example). The kernel trick replaces with — the dual formulation only requires inner products.
Complementary slackness: . Either (example not a support vector) or (example lies exactly on the margin). The weight vector is a sparse combination — only support vectors contribute.
Example 2: LASSO via KKT
The LASSO: minimize .
Since is non-smooth, KKT uses subgradients. The stationarity condition at optimum :
where is the subdifferential: is if , and if .
This gives the feature selection rule: iff . Only features whose correlation with the residual exceeds the threshold are selected. This is the KKT subgradient condition made explicit.
Example 3: Quadratic Program (QP) — Verifying KKT
Minimize subject to .
Lagrangian: .
Stationarity: , , so .
Primal feasibility: .
Solution: , . Dual function: ... wait let me compute directly. At the minimizer of in : .
Maximizing : , . Strong duality holds (linear equality constraint satisfies Slater's).
Connections
Where Your Intuition Breaks
The most common misapplication: using KKT conditions to verify optimality for non-convex problems. KKT conditions are necessary for optimality at any smooth constrained problem — but they are not sufficient unless the problem is convex. A non-convex problem can have many KKT points that are saddle points or local minima, not global minima. In deep learning, the conditions at a trained weight are KKT stationarity — but that doesn't mean minimizes the loss globally, or even locally in all directions. When you use SGD and it "converges" to a stopping criterion of small gradient, you've found a KKT point; the question of whether it's a useful minimum is separate, answered by validation performance rather than the optimality equations.
Complementary slackness is the source of sparsity. The KKT condition means every inequality constraint is either active () or free (). In the SVM, this sparsity lives in the dual: most training examples get (irrelevant to the decision boundary). In LASSO, the analogous sparsity is in the primal: most features get because their KKT subgradient condition is satisfied at zero. Sparsity in ML solutions is almost always traceable to a complementary slackness condition at the heart of the optimization.
The dual is always a concave (convex) problem. No matter how messy the primal — non-convex objective, combinatorial constraints — the Lagrange dual is always concave in and thus the dual problem is always convex. This is why dual decomposition and ADMM can be applied to non-convex problems: you get a convex relaxation via the dual. The dual optimal gives a lower bound on the non-convex primal , and the duality gap measures how much you lose from the relaxation.
Strong duality requires constraint qualifications. Slater's condition (strict feasibility) is the standard CQ for convex problems. It fails if all feasible points lie on constraint boundaries (e.g., the feasible set is a single point). For non-convex problems, strong duality typically fails: the duality gap in general. The gap is exploited in SDP relaxations (e.g., MAX-CUT): solve the convex SDP dual, get a lower bound on the NP-hard primal, and the ratio gap/(primal value) measures approximation quality.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.