Proximal Methods, ADMM & Operator Splitting
Proximal operators extend gradient descent to non-smooth objectives by replacing each gradient step with a well-defined optimization subproblem. ADMM (Alternating Direction Method of Multipliers) decomposes large problems across structure — enabling parallel and distributed optimization, LASSO at scale, and federated learning. Both methods are instances of operator splitting in the theory of monotone operators.
Concepts
Gradient descent works beautifully for smooth functions — but the LASSO objective has a kink at zero, and you can't take a gradient there. The solution is the proximal operator: instead of a gradient step, you take a "soft step" that moves toward minimizing while not straying too far from where you started. The quadratic term in the proximal operator is exactly the regularization that keeps you nearby, and for the norm the proximal operator has a closed-form solution (soft thresholding) that naturally induces sparsity. This is the mechanism behind LASSO, group LASSO, nuclear norm minimization, and every other non-smooth structured regularizer in modern ML.
The Proximal Operator
For a function and step size , the proximal operator is:
The problem is always strongly convex in (due to the quadratic term), so the minimizer is unique and well-defined — even when is non-smooth or extended-real-valued.
Interpretation. Starting from , moves toward the minimizer of while not straying too far from (the quadratic acts as a proximity constraint). It is a generalized gradient step that respects non-smoothness. The quadratic regularization term is the minimal addition needed to make the subproblem strongly convex — guaranteeing a unique solution even when is not differentiable or is indicator-valued. Without it, the subproblem might have multiple minimizers or no minimizer at all.
Moreau envelope. The Moreau-Yosida regularization of is:
is always differentiable with , even when is not. It has the same minimizers as . This is the prox operator's key regularizing effect: it smooths a non-smooth function.
Key Proximal Operators
| Regularizer | Name | |
|---|---|---|
| $|x|_1 = \sum | x_i | $ |
| Shrinkage toward origin | ||
| (group) | Group soft-threshold | |
| if , else | (projection) | Projection onto |
| (nuclear norm) | for SVD | Singular value thresholding |
| Elastic net prox |
Soft thresholding sets small components to zero and shrinks large ones: . This is the key operation in LASSO solutions.
Projection onto a convex set : . For the ball, if , else . For the simplex, an algorithm exists.
Proximal Gradient Descent (ISTA)
For composite objectives where is smooth (differentiable) and is non-smooth (but proxable):
Algorithm (ISTA — Iterative Shrinkage Thresholding):
- Take a gradient step on the smooth part:
- Apply the proximal operator of the non-smooth part:
Convergence. For convex -smooth and convex, with :
Same rate as GD for smooth functions. For LASSO (, ): , and the proximal step is soft thresholding.
FISTA (Fast ISTA — Nesterov-accelerated). Adds the same momentum as Nesterov's method:
Convergence: — optimal for composite convex problems.
ADMM: Alternating Direction Method of Multipliers
For problems with separable structure:
Augmented Lagrangian:
ADMM iteration:
The and updates alternate (hence "alternating direction"), and the dual variable is updated by the constraint violation. Each subproblem involves only or individually — so ADMM decomposes the problem.
Convergence. For closed convex: convergence of the residuals and objective under mild conditions. The parameter is a step size for dual updates and requires tuning.
Scaled form. With (scaled dual variable):
The -update is often a proximal operator, making computation very efficient.
Operator Splitting
Both ISTA and ADMM are instances of operator splitting — decomposing the problem into simpler subproblems that can be solved separately and combined.
Douglas-Rachford splitting. For (no separability required):
This operates on the sum by alternating prox operators. ADMM applied to a consensus problem is equivalent to Douglas-Rachford splitting on the dual.
Monotone operator theory. All these algorithms minimize sums of maximal monotone operators — a unified framework where convergence follows from the contraction property of the resolvent operators .
Worked Example
Example 1: LASSO via ISTA
The LASSO problem: .
Here (smooth), (non-smooth). ISTA:
- , where
- — soft threshold each component
Each iteration costs for the matrix-vector product , . Total cost to accuracy: .
The soft thresholding step explicitly zeros out components of with magnitude below , producing sparse iterates. At convergence, the nonzero pattern of is the selected feature set.
Example 2: Distributed LASSO via ADMM
Suppose the data matrix is split across machines: . Each machine holds and . ADMM for the consensus problem:
ADMM iteration:
- -update (parallel, one per machine): each machine solves a ridge regression subproblem locally.
- -update (central aggregation): — soft-threshold the average.
- -update: .
Communication: each step requires only broadcasting and receiving from each machine — per iteration for -dimensional . This is significantly cheaper than passing the full dataset.
Example 3: Nuclear Norm Minimization
Matrix completion. Minimize the nuclear norm (sum of singular values) subject to observed entries matching the data: for .
Proximal gradient: (indicator — zero if observed entries match), .
— singular value thresholding: compute SVD of , soft-threshold singular values, reconstruct.
This is the convex relaxation of rank minimization, used in collaborative filtering (Netflix Prize) and compressed sensing. With sufficient observations and incoherence conditions, nuclear norm minimization exactly recovers the true low-rank matrix.
Connections
Where Your Intuition Breaks
Proximal gradient (ISTA/FISTA) appears to solve the LASSO problem perfectly — soft thresholding at every step, provable convergence, exact sparsity. The trap is the step size is still governed by the smoothness constant of the smooth part. For the standard LASSO, — the largest singular value of the design matrix squared. For ill-conditioned data (collinear features), is enormous, the step size is tiny, and ISTA converges glacially. This is why coordinate descent (which handles the geometry per-coordinate) often outperforms ISTA on ill-conditioned LASSO. The proximal operator solves the non-smoothness problem; it does not solve the conditioning problem.
The proximal operator is the resolvent of the subdifferential. In monotone operator theory, . The operator (the subdifferential) is monotone (satisfies for all , ). Its resolvent is always well-defined and a contraction — this is why the prox operator is unique and well-behaved even for wildly non-smooth like the norm or indicator functions.
ADMM is the workhorse of distributed ML. The -update and -update in ADMM decouple: each involves only or separately, and only requires sharing aggregate statistics (averages, not raw data). In federated learning, the -update is local model training on each device, and the -update is server-side aggregation — this is exactly FedProx (federated proximal optimization). The parameter controls how much the local models are penalized for diverging from the global average, trading off local accuracy for communication efficiency.
Prox operators must be computable in closed form to be useful. The power of proximal methods comes from the fact that can be evaluated cheaply for structured . For : soft-threshold. For nuclear norm: for SVD. For total variation (TV): via dynamic programming. But for a generic non-convex , the prox operator is itself a hard optimization problem. This is why proximal methods are generally applied to convex regularizers, even when the loss function is non-convex.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.