Neural-Path/Notes
40 min

Convex Sets & Functions: Definitions, Examples & Closure Properties

Convexity is the single property that makes optimization tractable at scale. A convex function has no local minima that are not global, no deceptive curvature, and gradients that carry globally meaningful information. This lesson develops the mathematical foundations — sets, functions, operations — that allow you to recognize and exploit convexity throughout ML.

Concepts

012345-2-1012xf(x) = x²½f(x₁)+½f(x₂)f(½x₁+½x₂)x₁x₂convex ✓
Jensen gap: +1.3830f(mid)=0.000 chord(mid)=1.383
Drag x₁ / x₂ triangles. For convex f, the chord (blue) always lies above the curve — Jensen's inequality.

A bowl is convex: it has a single lowest point, and rolling a marble anywhere on its surface guarantees it will reach the bottom. Most optimization surfaces in ML are not bowls — they have ridges, saddles, and flat plateaus. Convexity is the mathematical property that makes a surface bowl-like, and it is the single condition that turns optimization from hard to tractable. Gradient descent on a convex function will always find the global minimum; the entire machinery of convergence guarantees depends on it.

Convex Sets

Definition. A set CRnC \subseteq \mathbb{R}^n is convex if for every x,yCx, y \in C and every θ[0,1]\theta \in [0,1]:

θx+(1θ)yC.\theta x + (1-\theta)y \in C.

Geometrically: the line segment between any two points in CC lies entirely within CC.

Key examples:

SetConvex?Why
{x:aTxb}\{x : a^Tx \leq b\} (halfspace)YesLinear constraint defines a half-plane
{x:x2r}\{x : \|x\|_2 \leq r\} (Euclidean ball)YesTriangle inequality
{x:xpr}\{x : \|x\|_p \leq r\} for p1p \geq 1YesMinkowski's inequality
{x:x0k}\{x : \|x\|_0 \leq k\} (sparsity set)NoLine between sparse vectors can be dense
Sym+(n)\text{Sym}^+(n) (PSD matrices)YesClosed under positive combinations
{(x,t):x2t}\{(x, t) : \|x\|_2 \leq t\} (second-order cone)YesSOC constraint in conic programming
A finite set with 2\geq 2 elementsNoInterior points of segment not in set

Closure properties. Convexity is preserved under:

  • Intersection: C1C2C_1 \cap C_2 is convex if C1,C2C_1, C_2 are convex.
  • Affine image: f(C)={Ax+b:xC}f(C) = \{Ax + b : x \in C\} is convex.
  • Inverse affine image: f1(C)={x:Ax+bC}f^{-1}(C) = \{x : Ax+b \in C\} is convex.
  • Cartesian product: C1×C2C_1 \times C_2 is convex.
  • Sum: C1+C2={x+y:xC1,yC2}C_1 + C_2 = \{x+y : x \in C_1, y \in C_2\} is convex.

Unions are generally not convex.

Convex hull. The convex hull conv(S)\text{conv}(S) of a set SS is the smallest convex set containing SS — equivalently, all convex combinations i=1kθixi\sum_{i=1}^k \theta_i x_i with xiSx_i \in S, θi0\theta_i \geq 0, iθi=1\sum_i \theta_i = 1.

Convex Functions

Definition. A function f:CRf : C \to \mathbb{R} on a convex set CC is convex if for every x,yCx, y \in C and θ[0,1]\theta \in [0,1]:

f(θx+(1θ)y)θf(x)+(1θ)f(y).f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y).

The right-hand side is the value on the chord from (x,f(x))(x, f(x)) to (y,f(y))(y, f(y)). Convexity means the chord lies above the graph.

The chord condition is the minimal algebraic statement of "no deceptive curvature": it says the function cannot dip below the straight-line interpolation between any two points. Any weaker condition would allow local minima that are not global, destroying the tractability guarantee. The epigraph characterization makes this concrete: a function is convex if and only if the set of points above its graph is a convex set — turning a function property into a geometric one.

Epigraph characterization. ff is convex if and only if its epigraph epi(f)={(x,t):f(x)t}\text{epi}(f) = \{(x,t) : f(x) \leq t\} is a convex set. This turns function convexity into set convexity.

First-order characterization (requires differentiability). ff is convex iff:

f(y)f(x)+f(x)T(yx)for all x,yC.f(y) \geq f(x) + \nabla f(x)^T(y - x) \quad \text{for all } x, y \in C.

The tangent hyperplane is a global underestimator. This is the supporting hyperplane property — a key fact used in gradient-based optimality conditions.

Second-order characterization (requires twice-differentiability). ff is convex iff 2f(x)0\nabla^2 f(x) \succeq 0 for all xCx \in C (the Hessian is positive semidefinite everywhere).

Strong Convexity and L-Smoothness

Two quantitative strengthening of convexity appear throughout optimization theory:

μ\mu-strong convexity (μ>0\mu > 0). ff is μ\mu-strongly convex if:

f(y)f(x)+f(x)T(yx)+μ2yx2for all x,y.f(y) \geq f(x) + \nabla f(x)^T(y-x) + \frac{\mu}{2}\|y-x\|^2 \quad \text{for all } x, y.

Equivalently: 2f(x)μI\nabla^2 f(x) \succeq \mu I everywhere. Strong convexity means the function curves at least as fast as μ2x2\frac{\mu}{2}\|x\|^2. Consequence: a unique global minimizer xx^* exists, and the gap shrinks at least quadratically.

LL-smoothness (L>0L > 0). ff is LL-smooth if f\nabla f is LL-Lipschitz:

f(x)f(y)Lxyfor all x,y.\|\nabla f(x) - \nabla f(y)\| \leq L\|x - y\| \quad \text{for all } x, y.

Equivalently: 2f(x)LI\nabla^2 f(x) \preceq LI everywhere. Consequence: ff cannot curve faster than the quadratic L2x2\frac{L}{2}\|x\|^2, enabling the descent lemma:

f(y)f(x)+f(x)T(yx)+L2yx2.f(y) \leq f(x) + \nabla f(x)^T(y-x) + \frac{L}{2}\|y-x\|^2.

Condition number. When ff is both μ\mu-strongly convex and LL-smooth, the condition number is κ=L/μ1\kappa = L/\mu \geq 1. Large κ\kappa means the function is much steeper in some directions than others (ill-conditioned), leading to slow convergence of gradient methods.

Jensen's Inequality

Jensen's inequality. For ff convex and any random variable XX with finite expectation:

f(E[X])E[f(X)].f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)].

For a finite mixture: f ⁣(iθixi)iθif(xi)f\!\left(\sum_i \theta_i x_i\right) \leq \sum_i \theta_i f(x_i) for any θi0\theta_i \geq 0, iθi=1\sum_i \theta_i = 1.

ML applications of Jensen's inequality:

  • Evidence lower bound (ELBO): logp(x)=logEq(z)[p(x,z)/q(z)]Eq(z)[logp(x,z)/q(z)]\log p(x) = \log \mathbb{E}_{q(z)}[p(x,z)/q(z)] \geq \mathbb{E}_{q(z)}[\log p(x,z)/q(z)]. The inequality comes from Jensen applied to the concave function log\log — the ELBO is the key object in variational inference.
  • Cross-entropy vs entropy: H(p,q)H(p)H(p, q) \geq H(p) — cross-entropy is always at least the true entropy. This is Jensen applied to log-\log.
  • KL divergence is nonneg: KL(pq)0\text{KL}(p \| q) \geq 0 follows from log-\log being convex.

Operations Preserving Convexity

Given convex functions f1,,fmf_1, \ldots, f_m, the following are also convex:

OperationResultCondition
f1+f2f_1 + f_2convexuncond.
αf\alpha fconvexα0\alpha \geq 0
f(Ax+b)f(Ax + b)convexff convex
max(f1,f2)\max(f_1, f_2)convexuncond.
g(f1,,fm)g(f_1, \ldots, f_m)convexgg convex, nondecr. in each arg., fif_i convex
infyCf(x,y)\inf_{y \in C} f(x, y)convex in xxff jointly convex, CC convex
ff \circ affineconvexuncond.

Composition rule (crucial). gfg \circ f is convex when: ff convex and gg convex nondecreasing, or ff concave and gg convex nonincreasing. Neural networks violate this because compositions of nonlinearities (ReLU) are not monotone in the required direction once stacked.

Conjugate Functions

The conjugate (Legendre-Fenchel transform) of ff is:

f(y)=supxdom(f)(yTxf(x)).f^*(y) = \sup_{x \in \text{dom}(f)} \left(y^T x - f(x)\right).

Properties:

  • ff^* is always convex (supremum of affine functions), even if ff is not.
  • Young's inequality: f(x)+f(y)xTyf(x) + f^*(y) \geq x^T y for all x,yx, y.
  • Bidual: f=ff^{**} = f when ff is convex and closed (Fenchel duality).
  • If f(x)=12xTAxf(x) = \frac{1}{2}x^T Ax for PD AA: f(y)=12yTA1yf^*(y) = \frac{1}{2}y^T A^{-1}y.
  • If f(x)=x1f(x) = \|x\|_1: f(y)=δy1f^*(y) = \delta_{\|y\|_\infty \leq 1} (indicator of \ell^\infty unit ball).
  • If f(x)=x2f(x) = \|x\|_2: f(y)=δy21f^*(y) = \delta_{\|y\|_2 \leq 1}.

ML role. Conjugate functions appear in dual formulations of SVMs, sparse recovery, and optimal transport. The Wasserstein distance has a dual form involving conjugates.

Key Convex Functions in ML

FunctionDomainConvex?ML role
xp\|x\|_p for p1p \geq 1Rn\mathbb{R}^nYesRegularization
xTAxx^T A x for A0A \succeq 0Rn\mathbb{R}^nYesQuadratic objectives
logiexi\log \sum_i e^{x_i} (log-sum-exp)Rn\mathbb{R}^nYesSoftmax loss
logx-\log xR++\mathbb{R}_{++}YesBarrier functions, log-likelihood
H(p)=pilogpi-H(p) = \sum p_i \log p_i (neg. entropy)Δn\Delta^nYesEntropy regularization, KL
log(1+eyx)\log(1 + e^{-y x})R\mathbb{R}YesLogistic loss
max(0,1yx)\max(0, 1 - yx)R\mathbb{R}YesHinge loss (SVM)
Axb22\|Ax - b\|_2^2Rn\mathbb{R}^nYesLeast squares
maxifi(x)\max_i f_i(x)Rn\mathbb{R}^nYesMinimax objectives
exe^xR\mathbb{R}YesExponential family log-partition
1[x0]\mathbf{1}[x \leq 0]R\mathbb{R}No0-1 loss (NP-hard to optimize)

Worked Example

Example 1: Log-Sum-Exp is Convex

Claim. f(x)=logi=1nexif(x) = \log \sum_{i=1}^n e^{x_i} is convex.

Proof via Hessian. Let zi=exi/jexjz_i = e^{x_i} / \sum_j e^{x_j} (softmax probabilities). Compute:

fxi=zi,2fxixj=zi(δijzj)=[diag(z)zzT]ij.\frac{\partial f}{\partial x_i} = z_i, \qquad \frac{\partial^2 f}{\partial x_i \partial x_j} = z_i(\delta_{ij} - z_j) = [\text{diag}(z) - zz^T]_{ij}.

The Hessian is H=diag(z)zzTH = \text{diag}(z) - zz^T. For any vRnv \in \mathbb{R}^n:

vTHv=izivi2(izivi)2=Ez[v2](Ez[v])2=Varz(v)0.v^T H v = \sum_i z_i v_i^2 - \left(\sum_i z_i v_i\right)^2 = \mathbb{E}_z[v^2] - (\mathbb{E}_z[v])^2 = \text{Var}_z(v) \geq 0.

Since H0H \succeq 0 everywhere, ff is convex. The Hessian is also bounded: H14IH \preceq \frac{1}{4}I (since Var14\text{Var} \leq \frac{1}{4}), so ff is 14\frac{1}{4}-smooth.

Significance. The cross-entropy loss for kk-class classification is xy+logiexi-x_y + \log\sum_i e^{x_i} (log-sum-exp minus the ground-truth logit), which is convex in the logits xx. This is why softmax regression is a convex problem.

Example 2: Logistic Loss is Convex

The binary logistic loss (w)=log(1+eywTx)\ell(w) = \log(1 + e^{-y w^T x}) for a single example (x,y)(x, y) with y{1,+1}y \in \{-1, +1\}:

(w)=log(1+eywTx)=f(u)(u=ywTx),\ell(w) = \log(1 + e^{-y w^T x}) = f(u) \circ (u = -y w^T x),

where f(u)=log(1+eu)f(u) = \log(1 + e^u) is convex (softplus), and uu is linear in ww. Composition of convex with affine is convex. Sum over the dataset:

L(w)=1ni=1nlog(1+eyiwTxi)L(w) = \frac{1}{n}\sum_{i=1}^n \log(1 + e^{-y_i w^T x_i})

is convex in ww. Add λ2w2\frac{\lambda}{2}\|w\|^2 for λ>0\lambda > 0: the resulting L2L_2-regularized logistic regression objective is strongly convex with μ=λ\mu = \lambda, guaranteeing a unique global minimizer.

Example 3: Why Neural Networks Break Convexity

Consider a two-layer network fθ(x)=W2σ(W1x)f_\theta(x) = W_2 \sigma(W_1 x) with ReLU σ\sigma. The composition fgf \circ g rule requires ff to be convex and nondecreasing in gg (or ff concave nonincreasing). ReLU is convex but not monotone in W2W_2 (the outer weight depends on the sign of the inner layer). Once you multiply by W2W_2, you lose convexity in the joint parameters (W1,W2)(W_1, W_2).

Specifically: Consider 1D with W1=1W_1 = 1, W2RW_2 \in \mathbb{R}. The function W2ReLU(W1x)W_2 \cdot \text{ReLU}(W_1 \cdot x) is linear in W2W_2 but nonlinear in (W1,W2)(W_1, W_2) jointly. With even two neurons in a single layer, the landscape develops local minima and saddle points.

Connections

Where Your Intuition Breaks

The common mistake: assuming that a smooth, single-valley loss curve seen in 2D or 3D plots is what the high-dimensional loss landscape really looks like. Those 2D cross-sections are cherry-picked directions — usually a random direction or the gradient direction — and they look convex because almost any one-dimensional cross-section of a high-dimensional loss surface will appear roughly convex. The actual landscape has exponentially many directions, and even a single non-convex direction is enough to create saddle points or local minima. This is why visualizations of "the loss landscape" are almost always misleading: the surface that gradient descent navigates in 10810^8 dimensions cannot be faithfully represented in 2D. The success of SGD on non-convex neural network training is not explained by local convexity — it's explained by the benign structure of saddle points in overparameterized models, which is a separate story from convexity.

💡Intuition

Why convexity = tractable optimization. For a convex function, any local minimum is global — this follows directly from the definition. Proof: if xx^* is a local minimum but not global, then some yy has f(y)<f(x)f(y) < f(x^*). The line segment θy+(1θ)x\theta y + (1-\theta)x^* for small θ>0\theta > 0 lies in the neighborhood of xx^* (local-ness) but has ff-value θf(y)+(1θ)f(x)<f(x)\leq \theta f(y) + (1-\theta)f(x^*) < f(x^*) (convexity), contradicting the local minimality. QED. For strongly convex functions, the minimizer is also unique.

💡Intuition

Jensen's inequality is why the ELBO works. The variational inference objective logp(x)L(q)=Eq(z)[logp(x,z)logq(z)]\log p(x) \geq \mathcal{L}(q) = \mathbb{E}_{q(z)}[\log p(x,z) - \log q(z)] is an instance of Jensen's inequality applied to log\log (which is concave, so Jensen gives a lower bound). Maximizing the ELBO instead of the intractable log-marginal works because the bound is tight when q(z)=p(zx)q(z) = p(z|x) — so we are doing the best possible within the variational family Q\mathcal{Q}.

⚠️Warning

Convexity is not preserved through neural network compositions. The cross-entropy loss is convex in the logits, logistic regression is convex in the weights, but deep neural networks are non-convex in their weights. The reason is the product structure: W2σ(W1x)W_2 \sigma(W_1 x) is bilinear in (W1,W2)(W_1, W_2), and bilinear functions are generally neither convex nor concave in the joint parameter space. This non-convexity is fundamental, not an artifact of how we write the objective.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.