Convex Sets & Functions: Definitions, Examples & Closure Properties

Convexity is the single property that makes optimization tractable at scale. A convex function has no local minima that are not global, no deceptive curvature, and gradients that carry globally meaningful information. This lesson develops the mathematical foundations — sets, functions, operations — that allow you to recognize and exploit convexity throughout ML.

Concepts

Jensen gap: +1.3830f(mid)=0.000 chord(mid)=1.383

supporting hyperplane at x₁

Drag x₁ / x₂ triangles. For convex f, the chord (blue) always lies above the curve — Jensen's inequality.

A bowl is convex: it has a single lowest point, and rolling a marble anywhere on its surface guarantees it will reach the bottom. Most optimization surfaces in ML are not bowls — they have ridges, saddles, and flat plateaus. Convexity is the mathematical property that makes a surface bowl-like, and it is the single condition that turns optimization from hard to tractable. Gradient descent on a convex function will always find the global minimum; the entire machinery of convergence guarantees depends on it.

Convex Sets

Definition. A set $C \subseteq \mathbb{R}^n$ is convex if for every $x, y \in C$ and every $\theta \in [0,1]$ :

$\theta x + (1-\theta)y \in C.$

Geometrically: the line segment between any two points in $C$ lies entirely within $C$ .

Key examples:

Set	Convex?	Why
$\{x : a^Tx \leq b\}$ (halfspace)	Yes	Linear constraint defines a half-plane
$\{x : \\|x\\|_2 \leq r\}$ (Euclidean ball)	Yes	Triangle inequality
$\{x : \\|x\\|_p \leq r\}$ for $p \geq 1$	Yes	Minkowski's inequality
$\{x : \\|x\\|_0 \leq k\}$ (sparsity set)	No	Line between sparse vectors can be dense
$\text{Sym}^+(n)$ (PSD matrices)	Yes	Closed under positive combinations
$\{(x, t) : \\|x\\|_2 \leq t\}$ (second-order cone)	Yes	SOC constraint in conic programming
A finite set with $\geq 2$ elements	No	Interior points of segment not in set

Closure properties. Convexity is preserved under:

Intersection: $C_1 \cap C_2$ is convex if $C_1, C_2$ are convex.
Affine image: $f(C) = \{Ax + b : x \in C\}$ is convex.
Inverse affine image: $f^{-1}(C) = \{x : Ax+b \in C\}$ is convex.
Cartesian product: $C_1 \times C_2$ is convex.
Sum: $C_1 + C_2 = \{x+y : x \in C_1, y \in C_2\}$ is convex.

Unions are generally not convex.

Convex hull. The convex hull $\text{conv}(S)$ of a set $S$ is the smallest convex set containing $S$ — equivalently, all convex combinations $\sum_{i=1}^k \theta_i x_i$ with $x_i \in S$ , $\theta_i \geq 0$ , $\sum_i \theta_i = 1$ .

Convex Functions

Definition. A function $f : C \to \mathbb{R}$ on a convex set $C$ is convex if for every $x, y \in C$ and $\theta \in [0,1]$ :

$f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y).$

The right-hand side is the value on the chord from $(x, f(x))$ to $(y, f(y))$ . Convexity means the chord lies above the graph.

The chord condition is the minimal algebraic statement of "no deceptive curvature": it says the function cannot dip below the straight-line interpolation between any two points. Any weaker condition would allow local minima that are not global, destroying the tractability guarantee. The epigraph characterization makes this concrete: a function is convex if and only if the set of points above its graph is a convex set — turning a function property into a geometric one.

Epigraph characterization. $f$ is convex if and only if its epigraph $\text{epi}(f) = \{(x,t) : f(x) \leq t\}$ is a convex set. This turns function convexity into set convexity.

First-order characterization (requires differentiability). $f$ is convex iff:

$f(y) \geq f(x) + \nabla f(x)^T(y - x) \quad \text{for all } x, y \in C.$

The tangent hyperplane is a global underestimator. This is the supporting hyperplane property — a key fact used in gradient-based optimality conditions.

Second-order characterization (requires twice-differentiability). $f$ is convex iff $\nabla^2 f(x) \succeq 0$ for all $x \in C$ (the Hessian is positive semidefinite everywhere).

Strong Convexity and L-Smoothness

Two quantitative strengthening of convexity appear throughout optimization theory:

$\mu$ -strong convexity ( $\mu > 0$ ). $f$ is $\mu$ -strongly convex if:

$f(y) \geq f(x) + \nabla f(x)^T(y-x) + \frac{\mu}{2}\|y-x\|^2 \quad \text{for all } x, y.$

Equivalently: $\nabla^2 f(x) \succeq \mu I$ everywhere. Strong convexity means the function curves at least as fast as $\frac{\mu}{2}\|x\|^2$ . Consequence: a unique global minimizer $x^*$ exists, and the gap shrinks at least quadratically.

$L$ -smoothness ( $L > 0$ ). $f$ is $L$ -smooth if $\nabla f$ is $L$ -Lipschitz:

$\|\nabla f(x) - \nabla f(y)\| \leq L\|x - y\| \quad \text{for all } x, y.$

Equivalently: $\nabla^2 f(x) \preceq LI$ everywhere. Consequence: $f$ cannot curve faster than the quadratic $\frac{L}{2}\|x\|^2$ , enabling the descent lemma:

$f(y) \leq f(x) + \nabla f(x)^T(y-x) + \frac{L}{2}\|y-x\|^2.$

Condition number. When $f$ is both $\mu$ -strongly convex and $L$ -smooth, the condition number is $\kappa = L/\mu \geq 1$ . Large $\kappa$ means the function is much steeper in some directions than others (ill-conditioned), leading to slow convergence of gradient methods.

Jensen's Inequality

Jensen's inequality. For $f$ convex and any random variable $X$ with finite expectation:

$f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)].$

For a finite mixture: $f\!\left(\sum_i \theta_i x_i\right) \leq \sum_i \theta_i f(x_i)$ for any $\theta_i \geq 0$ , $\sum_i \theta_i = 1$ .

ML applications of Jensen's inequality:

Evidence lower bound (ELBO): $\log p(x) = \log \mathbb{E}_{q(z)}[p(x,z)/q(z)] \geq \mathbb{E}_{q(z)}[\log p(x,z)/q(z)]$ . The inequality comes from Jensen applied to the concave function $\log$ — the ELBO is the key object in variational inference.
Cross-entropy vs entropy: $H(p, q) \geq H(p)$ — cross-entropy is always at least the true entropy. This is Jensen applied to $-\log$ .
KL divergence is nonneg: $\text{KL}(p \| q) \geq 0$ follows from $-\log$ being convex.

Operations Preserving Convexity

Given convex functions $f_1, \ldots, f_m$ , the following are also convex:

Operation	Result	Condition
$f_1 + f_2$	convex	uncond.
$\alpha f$	convex	$\alpha \geq 0$
$f(Ax + b)$	convex	$f$ convex
$\max(f_1, f_2)$	convex	uncond.
$g(f_1, \ldots, f_m)$	convex	$g$ convex, nondecr. in each arg., $f_i$ convex
$\inf_{y \in C} f(x, y)$	convex in $x$	$f$ jointly convex, $C$ convex
$f \circ$ affine	convex	uncond.

Composition rule (crucial). $g \circ f$ is convex when: $f$ convex and $g$ convex nondecreasing, or $f$ concave and $g$ convex nonincreasing. Neural networks violate this because compositions of nonlinearities (ReLU) are not monotone in the required direction once stacked.

Conjugate Functions

The conjugate (Legendre-Fenchel transform) of $f$ is:

$f^*(y) = \sup_{x \in \text{dom}(f)} \left(y^T x - f(x)\right).$

Properties:

$f^*$ is always convex (supremum of affine functions), even if $f$ is not.
Young's inequality: $f(x) + f^*(y) \geq x^T y$ for all $x, y$ .
Bidual: $f^{**} = f$ when $f$ is convex and closed (Fenchel duality).
If $f(x) = \frac{1}{2}x^T Ax$ for PD $A$ : $f^*(y) = \frac{1}{2}y^T A^{-1}y$ .
If $f(x) = \|x\|_1$ : $f^*(y) = \delta_{\|y\|_\infty \leq 1}$ (indicator of $\ell^\infty$ unit ball).
If $f(x) = \|x\|_2$ : $f^*(y) = \delta_{\|y\|_2 \leq 1}$ .

ML role. Conjugate functions appear in dual formulations of SVMs, sparse recovery, and optimal transport. The Wasserstein distance has a dual form involving conjugates.

Key Convex Functions in ML

Function	Domain	Convex?	ML role
$\\|x\\|_p$ for $p \geq 1$	$\mathbb{R}^n$	Yes	Regularization
$x^T A x$ for $A \succeq 0$	$\mathbb{R}^n$	Yes	Quadratic objectives
$\log \sum_i e^{x_i}$ (log-sum-exp)	$\mathbb{R}^n$	Yes	Softmax loss
$-\log x$	$\mathbb{R}_{++}$	Yes	Barrier functions, log-likelihood
$-H(p) = \sum p_i \log p_i$ (neg. entropy)	$\Delta^n$	Yes	Entropy regularization, KL
$\log(1 + e^{-y x})$	$\mathbb{R}$	Yes	Logistic loss
$\max(0, 1 - yx)$	$\mathbb{R}$	Yes	Hinge loss (SVM)
$\\|Ax - b\\|_2^2$	$\mathbb{R}^n$	Yes	Least squares
$\max_i f_i(x)$	$\mathbb{R}^n$	Yes	Minimax objectives
$e^x$	$\mathbb{R}$	Yes	Exponential family log-partition
$\mathbf{1}[x \leq 0]$	$\mathbb{R}$	No	0-1 loss (NP-hard to optimize)

Worked Example

Example 1: Log-Sum-Exp is Convex

Claim. $f(x) = \log \sum_{i=1}^n e^{x_i}$ is convex.

Proof via Hessian. Let $z_i = e^{x_i} / \sum_j e^{x_j}$ (softmax probabilities). Compute:

$\frac{\partial f}{\partial x_i} = z_i, \qquad \frac{\partial^2 f}{\partial x_i \partial x_j} = z_i(\delta_{ij} - z_j) = [\text{diag}(z) - zz^T]_{ij}.$

The Hessian is $H = \text{diag}(z) - zz^T$ . For any $v \in \mathbb{R}^n$ :

$v^T H v = \sum_i z_i v_i^2 - \left(\sum_i z_i v_i\right)^2 = \mathbb{E}_z[v^2] - (\mathbb{E}_z[v])^2 = \text{Var}_z(v) \geq 0.$

Since $H \succeq 0$ everywhere, $f$ is convex. The Hessian is also bounded: $H \preceq \frac{1}{4}I$ (since $\text{Var} \leq \frac{1}{4}$ ), so $f$ is $\frac{1}{4}$ -smooth.

Significance. The cross-entropy loss for $k$ -class classification is $-x_y + \log\sum_i e^{x_i}$ (log-sum-exp minus the ground-truth logit), which is convex in the logits $x$ . This is why softmax regression is a convex problem.

Example 2: Logistic Loss is Convex

The binary logistic loss $\ell(w) = \log(1 + e^{-y w^T x})$ for a single example $(x, y)$ with $y \in \{-1, +1\}$ :

$\ell(w) = \log(1 + e^{-y w^T x}) = f(u) \circ (u = -y w^T x),$

where $f(u) = \log(1 + e^u)$ is convex (softplus), and $u$ is linear in $w$ . Composition of convex with affine is convex. Sum over the dataset:

$L(w) = \frac{1}{n}\sum_{i=1}^n \log(1 + e^{-y_i w^T x_i})$

is convex in $w$ . Add $\frac{\lambda}{2}\|w\|^2$ for $\lambda > 0$ : the resulting $L_2$ -regularized logistic regression objective is strongly convex with $\mu = \lambda$ , guaranteeing a unique global minimizer.

Example 3: Why Neural Networks Break Convexity

Consider a two-layer network $f_\theta(x) = W_2 \sigma(W_1 x)$ with ReLU $\sigma$ . The composition $f \circ g$ rule requires $f$ to be convex and nondecreasing in $g$ (or $f$ concave nonincreasing). ReLU is convex but not monotone in $W_2$ (the outer weight depends on the sign of the inner layer). Once you multiply by $W_2$ , you lose convexity in the joint parameters $(W_1, W_2)$ .

Specifically: Consider 1D with $W_1 = 1$ , $W_2 \in \mathbb{R}$ . The function $W_2 \cdot \text{ReLU}(W_1 \cdot x)$ is linear in $W_2$ but nonlinear in $(W_1, W_2)$ jointly. With even two neurons in a single layer, the landscape develops local minima and saddle points.

Connections

Where Your Intuition Breaks

The common mistake: assuming that a smooth, single-valley loss curve seen in 2D or 3D plots is what the high-dimensional loss landscape really looks like. Those 2D cross-sections are cherry-picked directions — usually a random direction or the gradient direction — and they look convex because almost any one-dimensional cross-section of a high-dimensional loss surface will appear roughly convex. The actual landscape has exponentially many directions, and even a single non-convex direction is enough to create saddle points or local minima. This is why visualizations of "the loss landscape" are almost always misleading: the surface that gradient descent navigates in $10^8$ dimensions cannot be faithfully represented in 2D. The success of SGD on non-convex neural network training is not explained by local convexity — it's explained by the benign structure of saddle points in overparameterized models, which is a separate story from convexity.

💡Intuition

Why convexity = tractable optimization. For a convex function, any local minimum is global — this follows directly from the definition. Proof: if $x^*$ is a local minimum but not global, then some $y$ has $f(y) < f(x^*)$ . The line segment $\theta y + (1-\theta)x^*$ for small $\theta > 0$ lies in the neighborhood of $x^*$ (local-ness) but has $f$ -value $\leq \theta f(y) + (1-\theta)f(x^*) < f(x^*)$ (convexity), contradicting the local minimality. QED. For strongly convex functions, the minimizer is also unique.

💡Intuition

Jensen's inequality is why the ELBO works. The variational inference objective $\log p(x) \geq \mathcal{L}(q) = \mathbb{E}_{q(z)}[\log p(x,z) - \log q(z)]$ is an instance of Jensen's inequality applied to $\log$ (which is concave, so Jensen gives a lower bound). Maximizing the ELBO instead of the intractable log-marginal works because the bound is tight when $q(z) = p(z|x)$ — so we are doing the best possible within the variational family $\mathcal{Q}$ .

⚠️Warning

Convexity is not preserved through neural network compositions. The cross-entropy loss is convex in the logits, logistic regression is convex in the weights, but deep neural networks are non-convex in their weights. The reason is the product structure: $W_2 \sigma(W_1 x)$ is bilinear in $(W_1, W_2)$ , and bilinear functions are generally neither convex nor concave in the joint parameter space. This non-convexity is fundamental, not an artifact of how we write the objective.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Calculus & Geometry

Bridge: Loss Landscapes, Natural Gradient & Equivariant Networks

Duality: Lagrangians, KKT Conditions & Strong Duality