Duality: Lagrangians, KKT Conditions & Strong Duality

Lagrangian duality transforms a constrained optimization problem into an unconstrained one by pricing constraint violations. The resulting dual problem always provides a lower bound on the original (primal), and under mild conditions (Slater's) the bound is tight. The KKT conditions are the first-order equations that characterize optimality — and their sparse structure (complementary slackness) is why SVMs have support vectors and why LASSO selects features.

Concepts

The feasible region of an LP is a convex polytope. The optimal solution always lies at a vertex. The objective function sweeps as a hyperplane — the last vertex it touches is optimal.

Optimal: (2,2) with objective value 4. Blue dashed = constraints. Green solid = optimal iso-line (c·x = 4).

Suppose you want to minimize cost subject to constraints. A government building a road wants minimum cost, but must respect budget limits and land area. Economic theory says: instead of directly solving the constrained problem, you can price the constraints — assign a cost per unit of constraint violation — and then solve a simpler unconstrained problem. If you price constraints correctly, the constrained and unconstrained optima coincide. This is Lagrangian duality, and the optimal prices are the KKT multipliers. The SVM, LASSO, and LP all become their most revealing forms in the dual — because the dual variable structure makes the solution's geometry visible.

The Optimization Problem and Its Lagrangian

The standard form optimization problem:

\begin{aligned} \min_{x \in \mathbb{R}^n} \quad & f_0(x) \\ \text{s.t.} \quad & f_i(x) \leq 0, \quad i = 1, \ldots, m \\ & h_j(x) = 0, \quad j = 1, \ldots, p. \end{aligned}

The Lagrangian augments the objective with constraint penalties:

$\mathcal{L}(x, \lambda, \nu) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{j=1}^p \nu_j h_j(x),$

where $\lambda \in \mathbb{R}^m_{\geq 0}$ are dual variables (or Lagrange multipliers) for inequality constraints and $\nu \in \mathbb{R}^p$ for equality constraints. The dual variables can be interpreted as prices: $\lambda_i$ is the cost per unit of violating constraint $i$ .

Key property. For any primal feasible $x$ (satisfying all constraints) and any $\lambda \geq 0$ , $\nu$ :

$f_0(x) \geq \mathcal{L}(x, \lambda, \nu) \geq \inf_{x'} \mathcal{L}(x', \lambda, \nu).$

The Dual Function and Dual Problem

The Lagrange dual function is:

$g(\lambda, \nu) = \inf_{x \in \mathbb{R}^n} \mathcal{L}(x, \lambda, \nu).$

Critical observation. $g$ is always concave (infimum of affine functions in $(\lambda, \nu)$ ), regardless of whether $f_0, f_i$ are convex. This makes the dual problem always a concave maximization — which is convex after negation. This is the algebraic reason duality is so powerful: even if the primal is non-convex, you can always construct a convex dual. The dual may not be tight (strong duality may fail), but it always provides a lower bound and is always tractable to optimize.

The Lagrange dual problem is:

$\max_{\lambda \geq 0, \nu} \; g(\lambda, \nu).$

The dual optimal value $d^*$ and primal optimal value $p^*$ satisfy weak duality: $d^* \leq p^*$ . The duality gap is $p^* - d^* \geq 0$ .

Strong Duality and Slater's Condition

Strong duality: $d^* = p^*$ — the duality gap is zero.

Slater's constraint qualification (CQ). For a convex primal problem ( $f_0, f_i$ convex, $h_j$ affine), if there exists a strictly feasible point $\tilde{x}$ with $f_i(\tilde{x}) < 0$ for all $i$ and $h_j(\tilde{x}) = 0$ , then strong duality holds.

Strong duality for LPs holds always (no constraint qualification needed). For SDPs, Slater's condition or strict feasibility of both primal and dual suffices.

Why it matters. With strong duality, you can solve the easier concave dual instead of the possibly harder primal, and the optimal dual variables provide sensitivity information about constraint prices.

KKT Conditions

At a point $x^*$ satisfying regularity conditions (a constraint qualification, e.g., Slater's), optimality is equivalent to the KKT system:

1. Stationarity: $\nabla f_0(x^*) + \sum_{i=1}^m \lambda_i^* \nabla f_i(x^*) + \sum_{j=1}^p \nu_j^* \nabla h_j(x^*) = 0.$

The gradient of the Lagrangian vanishes at the optimum.

2. Primal feasibility: $f_i(x^*) \leq 0 \quad \forall i, \qquad h_j(x^*) = 0 \quad \forall j.$

3. Dual feasibility: $\lambda_i^* \geq 0 \quad \forall i.$

4. Complementary slackness: $\lambda_i^* f_i(x^*) = 0 \quad \forall i.$

Each inequality constraint is either active ( $f_i(x^*) = 0$ , so $\lambda_i^*$ can be positive) or inactive ( $f_i(x^*) < 0$ , so $\lambda_i^* = 0$ — the constraint is irrelevant at the optimum).

For convex problems: KKT conditions are necessary and sufficient for global optimality (assuming strong duality / regularity). The KKT system is a system of equations and inequalities that uniquely pins down both the primal and dual optimal points.

Sensitivity Analysis

From the KKT conditions, shadow prices: if the right-hand side of constraint $i$ is perturbed by $\epsilon$ (i.e., $f_i(x) \leq \epsilon_i$ ), the optimal value changes as:

$\frac{\partial p^*}{\partial \epsilon_i}\bigg|_{\epsilon=0} = -\lambda_i^*.$

A large $\lambda_i^*$ means constraint $i$ is "expensive" — relaxing it slightly would significantly improve the objective. This is the economic interpretation: dual variables are marginal values of resources.

Linear Programming Duality

The LP primal and dual form a symmetric pair:

$\text{Primal:} \quad \min_{x} c^Tx \;\;\text{s.t.}\;\; Ax = b,\, x \geq 0 \qquad \longleftrightarrow \qquad \text{Dual:} \quad \max_{\nu} b^T\nu \;\;\text{s.t.}\;\; A^T\nu \leq c.$

The dual constraint $A^T\nu \leq c$ says that the dual variables price the resources consistently with the objective costs. Complementary slackness in the LP context: $x_j^*(c_j - A_j^T \nu^*) = 0$ for all $j$ — either variable $j$ is zero (nonbasic) or its reduced cost is zero (basic).

Strong duality for LP (Dantzig's theorem): if both primal and dual are feasible, $p^* = d^*$ with no duality gap.

Worked Example

Example 1: SVM Dual Derivation

The hard-margin SVM primal: minimize $\frac{1}{2}\|w\|^2$ subject to $y_i(w^Tx_i + b) \geq 1$ for all $i$ . Rewrite as $f_i(w,b) = 1 - y_i(w^Tx_i+b) \leq 0$ .

Lagrangian:

$\mathcal{L}(w, b, \alpha) = \frac{1}{2}\|w\|^2 + \sum_{i=1}^n \alpha_i(1 - y_i(w^Tx_i + b)), \quad \alpha_i \geq 0.$

Stationarity in $w$ : $\nabla_w \mathcal{L} = w - \sum_i \alpha_i y_i x_i = 0$ , so $w^* = \sum_i \alpha_i y_i x_i$ .

Stationarity in $b$ : $\partial \mathcal{L}/\partial b = -\sum_i \alpha_i y_i = 0$ .

Substituting back into $\mathcal{L}$ to get the dual function $g(\alpha)$ :

$g(\alpha) = \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j x_i^T x_j.$

Dual problem: $\max_{\alpha \geq 0,\, \sum_i \alpha_i y_i = 0} g(\alpha)$ .

This is a quadratic program in $\alpha$ with $n$ variables (one per training example). The kernel trick replaces $x_i^T x_j$ with $k(x_i, x_j)$ — the dual formulation only requires inner products.

Complementary slackness: $\alpha_i(1 - y_i(w^Tx_i+b)) = 0$ . Either $\alpha_i = 0$ (example not a support vector) or $y_i(w^Tx_i+b) = 1$ (example lies exactly on the margin). The weight vector $w^* = \sum_i \alpha_i y_i x_i$ is a sparse combination — only support vectors contribute.

Example 2: LASSO via KKT

The LASSO: minimize $\frac{1}{2}\|Ax - b\|^2 + \lambda\|x\|_1$ .

Since $\|x\|_1 = \sum_j |x_j|$ is non-smooth, KKT uses subgradients. The stationarity condition at optimum $x^*$ :

$0 \in A^T(Ax^* - b) + \lambda \partial\|x^*\|_1,$

where $\partial\|x\|_1$ is the subdifferential: $\partial|x_j|$ is $\text{sign}(x_j^*)$ if $x_j^* \neq 0$ , and $[-1,1]$ if $x_j^* = 0$ .

This gives the feature selection rule: $x_j^* \neq 0$ iff $|[A^T(b - Ax^*)]_j| = \lambda$ . Only features whose correlation with the residual exceeds the threshold $\lambda$ are selected. This is the KKT subgradient condition made explicit.

Example 3: Quadratic Program (QP) — Verifying KKT

Minimize $f_0(x) = \frac{1}{2}x^Tx = \frac{1}{2}(x_1^2 + x_2^2)$ subject to $x_1 + x_2 = 1$ .

Lagrangian: $\mathcal{L}(x, \nu) = \frac{1}{2}\|x\|^2 + \nu(x_1 + x_2 - 1)$ .

Stationarity: $x_1 + \nu = 0$ , $x_2 + \nu = 0$ , so $x_1 = x_2 = -\nu$ .

Primal feasibility: $x_1 + x_2 = 1 \Rightarrow -2\nu = 1 \Rightarrow \nu^* = -1/2$ .

Solution: $x^* = (1/2, 1/2)$ , $\nu^* = -1/2$ . Dual function: $g(\nu) = -\frac{1}{2}\nu^2 \cdot 2 - \nu = -\nu^2 - \nu$ ... wait let me compute directly. At the minimizer of $\mathcal{L}$ in $x$ : $g(\nu) = \frac{1}{2}\cdot 2\nu^2 + \nu(-2\nu - 1) = \nu^2 - 2\nu^2 - \nu = -\nu^2 - \nu$ .

Maximizing $g$ : $dg/d\nu = -2\nu - 1 = 0 \Rightarrow \nu^* = -1/2$ , $d^* = g(-1/2) = -1/4 + 1/2 = 1/4 = f_0(x^*)$ . Strong duality holds (linear equality constraint satisfies Slater's).

Connections

Where Your Intuition Breaks

The most common misapplication: using KKT conditions to verify optimality for non-convex problems. KKT conditions are necessary for optimality at any smooth constrained problem — but they are not sufficient unless the problem is convex. A non-convex problem can have many KKT points that are saddle points or local minima, not global minima. In deep learning, the conditions $\nabla_W L = 0$ at a trained weight $W^*$ are KKT stationarity — but that doesn't mean $W^*$ minimizes the loss globally, or even locally in all directions. When you use SGD and it "converges" to a stopping criterion of small gradient, you've found a KKT point; the question of whether it's a useful minimum is separate, answered by validation performance rather than the optimality equations.

💡Intuition

Complementary slackness is the source of sparsity. The KKT condition $\lambda_i^* f_i(x^*) = 0$ means every inequality constraint is either active ( $f_i = 0$ ) or free ( $\lambda_i = 0$ ). In the SVM, this sparsity lives in the dual: most training examples get $\alpha_i = 0$ (irrelevant to the decision boundary). In LASSO, the analogous sparsity is in the primal: most features get $x_j^* = 0$ because their KKT subgradient condition is satisfied at zero. Sparsity in ML solutions is almost always traceable to a complementary slackness condition at the heart of the optimization.

💡Intuition

The dual is always a concave (convex) problem. No matter how messy the primal — non-convex objective, combinatorial constraints — the Lagrange dual $g(\lambda, \nu) = \inf_x \mathcal{L}(x, \lambda, \nu)$ is always concave in $(\lambda, \nu)$ and thus the dual problem is always convex. This is why dual decomposition and ADMM can be applied to non-convex problems: you get a convex relaxation via the dual. The dual optimal $d^*$ gives a lower bound on the non-convex primal $p^*$ , and the duality gap $p^* - d^*$ measures how much you lose from the relaxation.

⚠️Warning

Strong duality requires constraint qualifications. Slater's condition (strict feasibility) is the standard CQ for convex problems. It fails if all feasible points lie on constraint boundaries (e.g., the feasible set is a single point). For non-convex problems, strong duality typically fails: the duality gap $p^* - d^* > 0$ in general. The gap is exploited in SDP relaxations (e.g., MAX-CUT): solve the convex SDP dual, get a lower bound on the NP-hard primal, and the ratio gap/(primal value) measures approximation quality.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Convex Sets & Functions: Definitions, Examples & Closure Properties

Gradient Methods: Convergence Rates & Information-Theoretic Lower Bounds