Neural-Path/Notes
45 min

Duality: Lagrangians, KKT Conditions & Strong Duality

Lagrangian duality transforms a constrained optimization problem into an unconstrained one by pricing constraint violations. The resulting dual problem always provides a lower bound on the original (primal), and under mild conditions (Slater's) the bound is tight. The KKT conditions are the first-order equations that characterize optimality — and their sparse structure (complementary slackness) is why SVMs have support vectors and why LASSO selects features.

Concepts

The feasible region of an LP is a convex polytope. The optimal solution always lies at a vertex. The objective function sweeps as a hyperplane — the last vertex it touches is optimal.

112233440x₁x₂∇c(2,2), val=4
Optimal: (2,2) with objective value 4. Blue dashed = constraints. Green solid = optimal iso-line (c·x = 4).

Suppose you want to minimize cost subject to constraints. A government building a road wants minimum cost, but must respect budget limits and land area. Economic theory says: instead of directly solving the constrained problem, you can price the constraints — assign a cost per unit of constraint violation — and then solve a simpler unconstrained problem. If you price constraints correctly, the constrained and unconstrained optima coincide. This is Lagrangian duality, and the optimal prices are the KKT multipliers. The SVM, LASSO, and LP all become their most revealing forms in the dual — because the dual variable structure makes the solution's geometry visible.

The Optimization Problem and Its Lagrangian

The standard form optimization problem:

minxRnf0(x)s.t.fi(x)0,i=1,,mhj(x)=0,j=1,,p.\begin{aligned} \min_{x \in \mathbb{R}^n} \quad & f_0(x) \\ \text{s.t.} \quad & f_i(x) \leq 0, \quad i = 1, \ldots, m \\ & h_j(x) = 0, \quad j = 1, \ldots, p. \end{aligned}

The Lagrangian augments the objective with constraint penalties:

L(x,λ,ν)=f0(x)+i=1mλifi(x)+j=1pνjhj(x),\mathcal{L}(x, \lambda, \nu) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{j=1}^p \nu_j h_j(x),

where λR0m\lambda \in \mathbb{R}^m_{\geq 0} are dual variables (or Lagrange multipliers) for inequality constraints and νRp\nu \in \mathbb{R}^p for equality constraints. The dual variables can be interpreted as prices: λi\lambda_i is the cost per unit of violating constraint ii.

Key property. For any primal feasible xx (satisfying all constraints) and any λ0\lambda \geq 0, ν\nu:

f0(x)L(x,λ,ν)infxL(x,λ,ν).f_0(x) \geq \mathcal{L}(x, \lambda, \nu) \geq \inf_{x'} \mathcal{L}(x', \lambda, \nu).

The Dual Function and Dual Problem

The Lagrange dual function is:

g(λ,ν)=infxRnL(x,λ,ν).g(\lambda, \nu) = \inf_{x \in \mathbb{R}^n} \mathcal{L}(x, \lambda, \nu).

Critical observation. gg is always concave (infimum of affine functions in (λ,ν)(\lambda, \nu)), regardless of whether f0,fif_0, f_i are convex. This makes the dual problem always a concave maximization — which is convex after negation. This is the algebraic reason duality is so powerful: even if the primal is non-convex, you can always construct a convex dual. The dual may not be tight (strong duality may fail), but it always provides a lower bound and is always tractable to optimize.

The Lagrange dual problem is:

maxλ0,ν  g(λ,ν).\max_{\lambda \geq 0, \nu} \; g(\lambda, \nu).

The dual optimal value dd^* and primal optimal value pp^* satisfy weak duality: dpd^* \leq p^*. The duality gap is pd0p^* - d^* \geq 0.

Strong Duality and Slater's Condition

Strong duality: d=pd^* = p^* — the duality gap is zero.

Slater's constraint qualification (CQ). For a convex primal problem (f0,fif_0, f_i convex, hjh_j affine), if there exists a strictly feasible point x~\tilde{x} with fi(x~)<0f_i(\tilde{x}) < 0 for all ii and hj(x~)=0h_j(\tilde{x}) = 0, then strong duality holds.

Strong duality for LPs holds always (no constraint qualification needed). For SDPs, Slater's condition or strict feasibility of both primal and dual suffices.

Why it matters. With strong duality, you can solve the easier concave dual instead of the possibly harder primal, and the optimal dual variables provide sensitivity information about constraint prices.

KKT Conditions

At a point xx^* satisfying regularity conditions (a constraint qualification, e.g., Slater's), optimality is equivalent to the KKT system:

1. Stationarity: f0(x)+i=1mλifi(x)+j=1pνjhj(x)=0.\nabla f_0(x^*) + \sum_{i=1}^m \lambda_i^* \nabla f_i(x^*) + \sum_{j=1}^p \nu_j^* \nabla h_j(x^*) = 0.

The gradient of the Lagrangian vanishes at the optimum.

2. Primal feasibility: fi(x)0i,hj(x)=0j.f_i(x^*) \leq 0 \quad \forall i, \qquad h_j(x^*) = 0 \quad \forall j.

3. Dual feasibility: λi0i.\lambda_i^* \geq 0 \quad \forall i.

4. Complementary slackness: λifi(x)=0i.\lambda_i^* f_i(x^*) = 0 \quad \forall i.

Each inequality constraint is either active (fi(x)=0f_i(x^*) = 0, so λi\lambda_i^* can be positive) or inactive (fi(x)<0f_i(x^*) < 0, so λi=0\lambda_i^* = 0 — the constraint is irrelevant at the optimum).

For convex problems: KKT conditions are necessary and sufficient for global optimality (assuming strong duality / regularity). The KKT system is a system of equations and inequalities that uniquely pins down both the primal and dual optimal points.

Sensitivity Analysis

From the KKT conditions, shadow prices: if the right-hand side of constraint ii is perturbed by ϵ\epsilon (i.e., fi(x)ϵif_i(x) \leq \epsilon_i), the optimal value changes as:

pϵiϵ=0=λi.\frac{\partial p^*}{\partial \epsilon_i}\bigg|_{\epsilon=0} = -\lambda_i^*.

A large λi\lambda_i^* means constraint ii is "expensive" — relaxing it slightly would significantly improve the objective. This is the economic interpretation: dual variables are marginal values of resources.

Linear Programming Duality

The LP primal and dual form a symmetric pair:

Primal:minxcTx    s.t.    Ax=b,x0Dual:maxνbTν    s.t.    ATνc.\text{Primal:} \quad \min_{x} c^Tx \;\;\text{s.t.}\;\; Ax = b,\, x \geq 0 \qquad \longleftrightarrow \qquad \text{Dual:} \quad \max_{\nu} b^T\nu \;\;\text{s.t.}\;\; A^T\nu \leq c.

The dual constraint ATνcA^T\nu \leq c says that the dual variables price the resources consistently with the objective costs. Complementary slackness in the LP context: xj(cjAjTν)=0x_j^*(c_j - A_j^T \nu^*) = 0 for all jj — either variable jj is zero (nonbasic) or its reduced cost is zero (basic).

Strong duality for LP (Dantzig's theorem): if both primal and dual are feasible, p=dp^* = d^* with no duality gap.

Worked Example

Example 1: SVM Dual Derivation

The hard-margin SVM primal: minimize 12w2\frac{1}{2}\|w\|^2 subject to yi(wTxi+b)1y_i(w^Tx_i + b) \geq 1 for all ii. Rewrite as fi(w,b)=1yi(wTxi+b)0f_i(w,b) = 1 - y_i(w^Tx_i+b) \leq 0.

Lagrangian:

L(w,b,α)=12w2+i=1nαi(1yi(wTxi+b)),αi0.\mathcal{L}(w, b, \alpha) = \frac{1}{2}\|w\|^2 + \sum_{i=1}^n \alpha_i(1 - y_i(w^Tx_i + b)), \quad \alpha_i \geq 0.

Stationarity in ww: wL=wiαiyixi=0\nabla_w \mathcal{L} = w - \sum_i \alpha_i y_i x_i = 0, so w=iαiyixiw^* = \sum_i \alpha_i y_i x_i.

Stationarity in bb: L/b=iαiyi=0\partial \mathcal{L}/\partial b = -\sum_i \alpha_i y_i = 0.

Substituting back into L\mathcal{L} to get the dual function g(α)g(\alpha):

g(α)=i=1nαi12i,jαiαjyiyjxiTxj.g(\alpha) = \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j x_i^T x_j.

Dual problem: maxα0,iαiyi=0g(α)\max_{\alpha \geq 0,\, \sum_i \alpha_i y_i = 0} g(\alpha).

This is a quadratic program in α\alpha with nn variables (one per training example). The kernel trick replaces xiTxjx_i^T x_j with k(xi,xj)k(x_i, x_j) — the dual formulation only requires inner products.

Complementary slackness: αi(1yi(wTxi+b))=0\alpha_i(1 - y_i(w^Tx_i+b)) = 0. Either αi=0\alpha_i = 0 (example not a support vector) or yi(wTxi+b)=1y_i(w^Tx_i+b) = 1 (example lies exactly on the margin). The weight vector w=iαiyixiw^* = \sum_i \alpha_i y_i x_i is a sparse combination — only support vectors contribute.

Example 2: LASSO via KKT

The LASSO: minimize 12Axb2+λx1\frac{1}{2}\|Ax - b\|^2 + \lambda\|x\|_1.

Since x1=jxj\|x\|_1 = \sum_j |x_j| is non-smooth, KKT uses subgradients. The stationarity condition at optimum xx^*:

0AT(Axb)+λx1,0 \in A^T(Ax^* - b) + \lambda \partial\|x^*\|_1,

where x1\partial\|x\|_1 is the subdifferential: xj\partial|x_j| is sign(xj)\text{sign}(x_j^*) if xj0x_j^* \neq 0, and [1,1][-1,1] if xj=0x_j^* = 0.

This gives the feature selection rule: xj0x_j^* \neq 0 iff [AT(bAx)]j=λ|[A^T(b - Ax^*)]_j| = \lambda. Only features whose correlation with the residual exceeds the threshold λ\lambda are selected. This is the KKT subgradient condition made explicit.

Example 3: Quadratic Program (QP) — Verifying KKT

Minimize f0(x)=12xTx=12(x12+x22)f_0(x) = \frac{1}{2}x^Tx = \frac{1}{2}(x_1^2 + x_2^2) subject to x1+x2=1x_1 + x_2 = 1.

Lagrangian: L(x,ν)=12x2+ν(x1+x21)\mathcal{L}(x, \nu) = \frac{1}{2}\|x\|^2 + \nu(x_1 + x_2 - 1).

Stationarity: x1+ν=0x_1 + \nu = 0, x2+ν=0x_2 + \nu = 0, so x1=x2=νx_1 = x_2 = -\nu.

Primal feasibility: x1+x2=12ν=1ν=1/2x_1 + x_2 = 1 \Rightarrow -2\nu = 1 \Rightarrow \nu^* = -1/2.

Solution: x=(1/2,1/2)x^* = (1/2, 1/2), ν=1/2\nu^* = -1/2. Dual function: g(ν)=12ν22ν=ν2νg(\nu) = -\frac{1}{2}\nu^2 \cdot 2 - \nu = -\nu^2 - \nu... wait let me compute directly. At the minimizer of L\mathcal{L} in xx: g(ν)=122ν2+ν(2ν1)=ν22ν2ν=ν2νg(\nu) = \frac{1}{2}\cdot 2\nu^2 + \nu(-2\nu - 1) = \nu^2 - 2\nu^2 - \nu = -\nu^2 - \nu.

Maximizing gg: dg/dν=2ν1=0ν=1/2dg/d\nu = -2\nu - 1 = 0 \Rightarrow \nu^* = -1/2, d=g(1/2)=1/4+1/2=1/4=f0(x)d^* = g(-1/2) = -1/4 + 1/2 = 1/4 = f_0(x^*). Strong duality holds (linear equality constraint satisfies Slater's).

Connections

Where Your Intuition Breaks

The most common misapplication: using KKT conditions to verify optimality for non-convex problems. KKT conditions are necessary for optimality at any smooth constrained problem — but they are not sufficient unless the problem is convex. A non-convex problem can have many KKT points that are saddle points or local minima, not global minima. In deep learning, the conditions WL=0\nabla_W L = 0 at a trained weight WW^* are KKT stationarity — but that doesn't mean WW^* minimizes the loss globally, or even locally in all directions. When you use SGD and it "converges" to a stopping criterion of small gradient, you've found a KKT point; the question of whether it's a useful minimum is separate, answered by validation performance rather than the optimality equations.

💡Intuition

Complementary slackness is the source of sparsity. The KKT condition λifi(x)=0\lambda_i^* f_i(x^*) = 0 means every inequality constraint is either active (fi=0f_i = 0) or free (λi=0\lambda_i = 0). In the SVM, this sparsity lives in the dual: most training examples get αi=0\alpha_i = 0 (irrelevant to the decision boundary). In LASSO, the analogous sparsity is in the primal: most features get xj=0x_j^* = 0 because their KKT subgradient condition is satisfied at zero. Sparsity in ML solutions is almost always traceable to a complementary slackness condition at the heart of the optimization.

💡Intuition

The dual is always a concave (convex) problem. No matter how messy the primal — non-convex objective, combinatorial constraints — the Lagrange dual g(λ,ν)=infxL(x,λ,ν)g(\lambda, \nu) = \inf_x \mathcal{L}(x, \lambda, \nu) is always concave in (λ,ν)(\lambda, \nu) and thus the dual problem is always convex. This is why dual decomposition and ADMM can be applied to non-convex problems: you get a convex relaxation via the dual. The dual optimal dd^* gives a lower bound on the non-convex primal pp^*, and the duality gap pdp^* - d^* measures how much you lose from the relaxation.

⚠️Warning

Strong duality requires constraint qualifications. Slater's condition (strict feasibility) is the standard CQ for convex problems. It fails if all feasible points lie on constraint boundaries (e.g., the feasible set is a single point). For non-convex problems, strong duality typically fails: the duality gap pd>0p^* - d^* > 0 in general. The gap is exploited in SDP relaxations (e.g., MAX-CUT): solve the convex SDP dual, get a lower bound on the NP-hard primal, and the ratio gap/(primal value) measures approximation quality.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.