Proximal Methods, ADMM & Operator Splitting

Proximal operators extend gradient descent to non-smooth objectives by replacing each gradient step with a well-defined optimization subproblem. ADMM (Alternating Direction Method of Multipliers) decomposes large problems across structure — enabling parallel and distributed optimization, LASSO at scale, and federated learning. Both methods are instances of operator splitting in the theory of monotone operators.

Concepts

α (step size / threshold):0.80prox_α(v) = argmin_x {f(x) + (1/2α)‖x−v‖²}

Drag the v handle to move the input point. The output (colored dot) is the proximal point — the minimizer of f plus a quadratic proximity term.

Gradient descent works beautifully for smooth functions — but the LASSO objective has a kink at zero, and you can't take a gradient there. The solution is the proximal operator: instead of a gradient step, you take a "soft step" that moves toward minimizing $f$ while not straying too far from where you started. The quadratic term in the proximal operator is exactly the regularization that keeps you nearby, and for the $\ell_1$ norm the proximal operator has a closed-form solution (soft thresholding) that naturally induces sparsity. This is the mechanism behind LASSO, group LASSO, nuclear norm minimization, and every other non-smooth structured regularizer in modern ML.

The Proximal Operator

For a function $f : \mathbb{R}^n \to \mathbb{R} \cup \{+\infty\}$ and step size $\alpha > 0$ , the proximal operator is:

$\text{prox}_{\alpha f}(v) = \arg\min_{x} \left\{f(x) + \frac{1}{2\alpha}\|x - v\|^2\right\}.$

The problem is always strongly convex in $x$ (due to the quadratic term), so the minimizer is unique and well-defined — even when $f$ is non-smooth or extended-real-valued.

Interpretation. Starting from $v$ , $\text{prox}_{\alpha f}(v)$ moves toward the minimizer of $f$ while not straying too far from $v$ (the quadratic acts as a proximity constraint). It is a generalized gradient step that respects non-smoothness. The quadratic regularization term is the minimal addition needed to make the subproblem strongly convex — guaranteeing a unique solution even when $f$ is not differentiable or is indicator-valued. Without it, the subproblem might have multiple minimizers or no minimizer at all.

Moreau envelope. The Moreau-Yosida regularization of $f$ is:

$M_{\alpha f}(v) = \min_x \left\{f(x) + \frac{1}{2\alpha}\|x-v\|^2\right\} = f(\text{prox}_{\alpha f}(v)) + \frac{1}{2\alpha}\|v - \text{prox}_{\alpha f}(v)\|^2.$

$M_{\alpha f}$ is always differentiable with $\nabla M_{\alpha f}(v) = \frac{1}{\alpha}(v - \text{prox}_{\alpha f}(v))$ , even when $f$ is not. It has the same minimizers as $f$ . This is the prox operator's key regularizing effect: it smooths a non-smooth function.

Key Proximal Operators

Regularizer $f(x)$	$\text{prox}_{\alpha f}(v)$	Name
$\|x\|_1 = \sum	x_i	$
$\frac{1}{2}\\|x\\|_2^2$	$v/(1+\alpha)$	Shrinkage toward origin
$\\|x\\|_2$ (group)	$\max(1-\alpha/\\|v\\|_2, 0)\cdot v$	Group soft-threshold
$\delta_C(x) = 0$ if $x \in C$ , else $\infty$	$P_C(v)$ (projection)	Projection onto $C$
$\\|x\\|_*$ (nuclear norm)	$U\mathcal{S}_\alpha(\Sigma)V^T$ for SVD $v=U\Sigma V^T$	Singular value thresholding
$\\|x\\|_1 + \frac{\mu}{2}\\|x\\|_2^2$	$\mathcal{S}_{\alpha}(v)/(1+\alpha\mu)$	Elastic net prox

Soft thresholding sets small components to zero and shrinks large ones: $[\mathcal{S}_\alpha(v)]_i = \text{sign}(v_i)\max(|v_i|-\alpha,0)$ . This is the key operation in LASSO solutions.

Projection onto a convex set $C$ : $P_C(v) = \arg\min_{x \in C}\|x-v\|$ . For the $\ell_2$ ball, $P_{B_r}(v) = r\cdot v/\|v\|$ if $\|v\|>r$ , else $v$ . For the simplex, an $O(n\log n)$ algorithm exists.

Proximal Gradient Descent (ISTA)

For composite objectives $f = g + h$ where $g$ is smooth (differentiable) and $h$ is non-smooth (but proxable):

$x_{k+1} = \text{prox}_{\alpha h}\!\left(x_k - \alpha\nabla g(x_k)\right).$

Algorithm (ISTA — Iterative Shrinkage Thresholding):

Take a gradient step on the smooth part: $y = x_k - \alpha\nabla g(x_k)$
Apply the proximal operator of the non-smooth part: $x_{k+1} = \text{prox}_{\alpha h}(y)$

Convergence. For $g$ convex $L$ -smooth and $h$ convex, with $\alpha = 1/L$ :

$f(x_k) - f^* \leq \frac{L\|x_0 - x^*\|^2}{2k}.$

Same $O(1/k)$ rate as GD for smooth functions. For LASSO ( $g = \frac{1}{2}\|Ax-b\|^2$ , $h = \lambda\|x\|_1$ ): $\nabla g = A^T(Ax-b)$ , and the proximal step is soft thresholding.

FISTA (Fast ISTA — Nesterov-accelerated). Adds the same momentum as Nesterov's method:

$y_{k+1} = \text{prox}_{\alpha h}(x_k - \alpha\nabla g(x_k)), \qquad x_{k+1} = y_{k+1} + \frac{k-1}{k+2}(y_{k+1} - y_k).$

Convergence: $O(1/k^2)$ — optimal for composite convex problems.

ADMM: Alternating Direction Method of Multipliers

For problems with separable structure:

$\min_{x, z} \; f(x) + g(z) \quad \text{s.t.} \quad Ax + Bz = c.$

Augmented Lagrangian:

$\mathcal{L}_\rho(x, z, y) = f(x) + g(z) + y^T(Ax+Bz-c) + \frac{\rho}{2}\|Ax+Bz-c\|^2.$

ADMM iteration:

\begin{aligned} x_{k+1} &= \arg\min_x \mathcal{L}_\rho(x, z_k, y_k) \\ z_{k+1} &= \arg\min_z \mathcal{L}_\rho(x_{k+1}, z, y_k) \\ y_{k+1} &= y_k + \rho(Ax_{k+1} + Bz_{k+1} - c) \end{aligned}

The $x$ and $z$ updates alternate (hence "alternating direction"), and the dual variable $y$ is updated by the constraint violation. Each subproblem involves only $f$ or $g$ individually — so ADMM decomposes the problem.

Convergence. For $f, g$ closed convex: $O(1/k)$ convergence of the residuals and objective under mild conditions. The parameter $\rho > 0$ is a step size for dual updates and requires tuning.

Scaled form. With $u = y/\rho$ (scaled dual variable):

$x_{k+1} = \arg\min_x\left\{f(x) + \frac{\rho}{2}\|Ax + Bz_k - c + u_k\|^2\right\}, \quad z_{k+1} = \text{prox}_{g/\rho}(A x_{k+1} + c/\rho - u_k).$

The $z$ -update is often a proximal operator, making computation very efficient.

Operator Splitting

Both ISTA and ADMM are instances of operator splitting — decomposing the problem into simpler subproblems that can be solved separately and combined.

Douglas-Rachford splitting. For $\min f(x) + g(x)$ (no separability required):

$z_{k+1} = \text{prox}_{\alpha f}(2\,\text{prox}_{\alpha g}(z_k) - z_k), \qquad x_{k+1} = \text{prox}_{\alpha g}(z_k).$

This operates on the sum $f + g$ by alternating prox operators. ADMM applied to a consensus problem is equivalent to Douglas-Rachford splitting on the dual.

Monotone operator theory. All these algorithms minimize sums of maximal monotone operators — a unified framework where convergence follows from the contraction property of the resolvent operators $(I + \alpha \partial f)^{-1} = \text{prox}_{\alpha f}$ .

Worked Example

Example 1: LASSO via ISTA

The LASSO problem: $\min_x \frac{1}{2}\|Ax-b\|^2 + \lambda\|x\|_1$ .

Here $g(x) = \frac{1}{2}\|Ax-b\|^2$ (smooth), $h(x) = \lambda\|x\|_1$ (non-smooth). ISTA:

$y_k = x_k - \frac{1}{L} A^T(Ax_k - b)$ , where $L = \sigma_{\max}(A)^2$
$x_{k+1} = \mathcal{S}_{\lambda/L}(y_k)$ — soft threshold each component

Each iteration costs $O(mn)$ for the matrix-vector product $Ax_k$ , $A^Tv$ . Total cost to $\epsilon$ accuracy: $O(mn/\epsilon)$ .

The soft thresholding step explicitly zeros out components of $y_k$ with magnitude below $\lambda/L$ , producing sparse iterates. At convergence, the nonzero pattern of $x^*$ is the selected feature set.

Example 2: Distributed LASSO via ADMM

Suppose the data matrix $A$ is split across $N$ machines: $A = [A_1; \ldots; A_N]$ . Each machine $i$ holds $A_i$ and $b_i$ . ADMM for the consensus problem:

$\min_x \sum_{i=1}^N \left[\frac{1}{2}\|A_i x_i - b_i\|^2 + \frac{\lambda}{N}\|x_i\|_1\right] \quad \text{s.t.} \quad x_i = z \;\;\forall i.$

ADMM iteration:

$x_i$ -update (parallel, one per machine): each machine solves a ridge regression subproblem locally.
$z$ -update (central aggregation): $z = \mathcal{S}_{\lambda/(N\rho)}(\bar{x} + \bar{u})$ — soft-threshold the average.
$u_i$ -update: $u_i \leftarrow u_i + x_i - z$ .

Communication: each step requires only broadcasting $z$ and receiving $x_i$ from each machine — $O(Nd)$ per iteration for $d$ -dimensional $x$ . This is significantly cheaper than passing the full dataset.

Example 3: Nuclear Norm Minimization

Matrix completion. Minimize the nuclear norm $\|X\|_* = \sum_i \sigma_i(X)$ (sum of singular values) subject to observed entries matching the data: $X_{ij} = M_{ij}$ for $(i,j) \in \Omega$ .

Proximal gradient: $g(X) = \delta_{\Omega}(X)$ (indicator — zero if observed entries match), $h(X) = \lambda\|X\|_*$ .

$\text{prox}_{\lambda\|\cdot\|_*}(V) = U\mathcal{S}_\lambda(\Sigma)V^T$ — singular value thresholding: compute SVD of $V$ , soft-threshold singular values, reconstruct.

This is the convex relaxation of rank minimization, used in collaborative filtering (Netflix Prize) and compressed sensing. With sufficient observations and incoherence conditions, nuclear norm minimization exactly recovers the true low-rank matrix.

Connections

Where Your Intuition Breaks

Proximal gradient (ISTA/FISTA) appears to solve the LASSO problem perfectly — soft thresholding at every step, provable convergence, exact sparsity. The trap is the step size is still governed by the smoothness constant $L$ of the smooth part. For the standard LASSO, $L = \|A^TA\|_2 = \sigma_{\max}^2(A)$ — the largest singular value of the design matrix squared. For ill-conditioned data (collinear features), $L$ is enormous, the step size $1/L$ is tiny, and ISTA converges glacially. This is why coordinate descent (which handles the geometry per-coordinate) often outperforms ISTA on ill-conditioned LASSO. The proximal operator solves the non-smoothness problem; it does not solve the conditioning problem.

💡Intuition

The proximal operator is the resolvent of the subdifferential. In monotone operator theory, $\text{prox}_{\alpha f}(v) = (I + \alpha\partial f)^{-1}(v)$ . The operator $\partial f$ (the subdifferential) is monotone (satisfies $\langle u-v, x-y\rangle \geq 0$ for all $u \in \partial f(x)$ , $v \in \partial f(y)$ ). Its resolvent is always well-defined and a contraction — this is why the prox operator is unique and well-behaved even for wildly non-smooth $f$ like the $\ell_1$ norm or indicator functions.

💡Intuition

ADMM is the workhorse of distributed ML. The $x$ -update and $z$ -update in ADMM decouple: each involves only $f$ or $g$ separately, and only requires sharing aggregate statistics (averages, not raw data). In federated learning, the $x$ -update is local model training on each device, and the $z$ -update is server-side aggregation — this is exactly FedProx (federated proximal optimization). The parameter $\rho$ controls how much the local models are penalized for diverging from the global average, trading off local accuracy for communication efficiency.

⚠️Warning

Prox operators must be computable in closed form to be useful. The power of proximal methods comes from the fact that $\text{prox}_{\alpha f}$ can be evaluated cheaply for structured $f$ . For $\ell_1$ : $O(n)$ soft-threshold. For nuclear norm: $O(\min(m,n)^2\max(m,n))$ for SVD. For total variation (TV): $O(n\log n)$ via dynamic programming. But for a generic non-convex $f$ , the prox operator is itself a hard optimization problem. This is why proximal methods are generally applied to convex regularizers, even when the loss function is non-convex.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Gradient Methods: Convergence Rates & Information-Theoretic Lower Bounds

Non-Convex Landscapes: Saddle Points, Spurious Minima & Escape