Convex Sets & Functions: Definitions, Examples & Closure Properties
Convexity is the single property that makes optimization tractable at scale. A convex function has no local minima that are not global, no deceptive curvature, and gradients that carry globally meaningful information. This lesson develops the mathematical foundations — sets, functions, operations — that allow you to recognize and exploit convexity throughout ML.
Concepts
A bowl is convex: it has a single lowest point, and rolling a marble anywhere on its surface guarantees it will reach the bottom. Most optimization surfaces in ML are not bowls — they have ridges, saddles, and flat plateaus. Convexity is the mathematical property that makes a surface bowl-like, and it is the single condition that turns optimization from hard to tractable. Gradient descent on a convex function will always find the global minimum; the entire machinery of convergence guarantees depends on it.
Convex Sets
Definition. A set is convex if for every and every :
Geometrically: the line segment between any two points in lies entirely within .
Key examples:
| Set | Convex? | Why |
|---|---|---|
| (halfspace) | Yes | Linear constraint defines a half-plane |
| (Euclidean ball) | Yes | Triangle inequality |
| for | Yes | Minkowski's inequality |
| (sparsity set) | No | Line between sparse vectors can be dense |
| (PSD matrices) | Yes | Closed under positive combinations |
| (second-order cone) | Yes | SOC constraint in conic programming |
| A finite set with elements | No | Interior points of segment not in set |
Closure properties. Convexity is preserved under:
- Intersection: is convex if are convex.
- Affine image: is convex.
- Inverse affine image: is convex.
- Cartesian product: is convex.
- Sum: is convex.
Unions are generally not convex.
Convex hull. The convex hull of a set is the smallest convex set containing — equivalently, all convex combinations with , , .
Convex Functions
Definition. A function on a convex set is convex if for every and :
The right-hand side is the value on the chord from to . Convexity means the chord lies above the graph.
The chord condition is the minimal algebraic statement of "no deceptive curvature": it says the function cannot dip below the straight-line interpolation between any two points. Any weaker condition would allow local minima that are not global, destroying the tractability guarantee. The epigraph characterization makes this concrete: a function is convex if and only if the set of points above its graph is a convex set — turning a function property into a geometric one.
Epigraph characterization. is convex if and only if its epigraph is a convex set. This turns function convexity into set convexity.
First-order characterization (requires differentiability). is convex iff:
The tangent hyperplane is a global underestimator. This is the supporting hyperplane property — a key fact used in gradient-based optimality conditions.
Second-order characterization (requires twice-differentiability). is convex iff for all (the Hessian is positive semidefinite everywhere).
Strong Convexity and L-Smoothness
Two quantitative strengthening of convexity appear throughout optimization theory:
-strong convexity (). is -strongly convex if:
Equivalently: everywhere. Strong convexity means the function curves at least as fast as . Consequence: a unique global minimizer exists, and the gap shrinks at least quadratically.
-smoothness (). is -smooth if is -Lipschitz:
Equivalently: everywhere. Consequence: cannot curve faster than the quadratic , enabling the descent lemma:
Condition number. When is both -strongly convex and -smooth, the condition number is . Large means the function is much steeper in some directions than others (ill-conditioned), leading to slow convergence of gradient methods.
Jensen's Inequality
Jensen's inequality. For convex and any random variable with finite expectation:
For a finite mixture: for any , .
ML applications of Jensen's inequality:
- Evidence lower bound (ELBO): . The inequality comes from Jensen applied to the concave function — the ELBO is the key object in variational inference.
- Cross-entropy vs entropy: — cross-entropy is always at least the true entropy. This is Jensen applied to .
- KL divergence is nonneg: follows from being convex.
Operations Preserving Convexity
Given convex functions , the following are also convex:
| Operation | Result | Condition |
|---|---|---|
| convex | uncond. | |
| convex | ||
| convex | convex | |
| convex | uncond. | |
| convex | convex, nondecr. in each arg., convex | |
| convex in | jointly convex, convex | |
| affine | convex | uncond. |
Composition rule (crucial). is convex when: convex and convex nondecreasing, or concave and convex nonincreasing. Neural networks violate this because compositions of nonlinearities (ReLU) are not monotone in the required direction once stacked.
Conjugate Functions
The conjugate (Legendre-Fenchel transform) of is:
Properties:
- is always convex (supremum of affine functions), even if is not.
- Young's inequality: for all .
- Bidual: when is convex and closed (Fenchel duality).
- If for PD : .
- If : (indicator of unit ball).
- If : .
ML role. Conjugate functions appear in dual formulations of SVMs, sparse recovery, and optimal transport. The Wasserstein distance has a dual form involving conjugates.
Key Convex Functions in ML
| Function | Domain | Convex? | ML role |
|---|---|---|---|
| for | Yes | Regularization | |
| for | Yes | Quadratic objectives | |
| (log-sum-exp) | Yes | Softmax loss | |
| Yes | Barrier functions, log-likelihood | ||
| (neg. entropy) | Yes | Entropy regularization, KL | |
| Yes | Logistic loss | ||
| Yes | Hinge loss (SVM) | ||
| Yes | Least squares | ||
| Yes | Minimax objectives | ||
| Yes | Exponential family log-partition | ||
| No | 0-1 loss (NP-hard to optimize) |
Worked Example
Example 1: Log-Sum-Exp is Convex
Claim. is convex.
Proof via Hessian. Let (softmax probabilities). Compute:
The Hessian is . For any :
Since everywhere, is convex. The Hessian is also bounded: (since ), so is -smooth.
Significance. The cross-entropy loss for -class classification is (log-sum-exp minus the ground-truth logit), which is convex in the logits . This is why softmax regression is a convex problem.
Example 2: Logistic Loss is Convex
The binary logistic loss for a single example with :
where is convex (softplus), and is linear in . Composition of convex with affine is convex. Sum over the dataset:
is convex in . Add for : the resulting -regularized logistic regression objective is strongly convex with , guaranteeing a unique global minimizer.
Example 3: Why Neural Networks Break Convexity
Consider a two-layer network with ReLU . The composition rule requires to be convex and nondecreasing in (or concave nonincreasing). ReLU is convex but not monotone in (the outer weight depends on the sign of the inner layer). Once you multiply by , you lose convexity in the joint parameters .
Specifically: Consider 1D with , . The function is linear in but nonlinear in jointly. With even two neurons in a single layer, the landscape develops local minima and saddle points.
Connections
Where Your Intuition Breaks
The common mistake: assuming that a smooth, single-valley loss curve seen in 2D or 3D plots is what the high-dimensional loss landscape really looks like. Those 2D cross-sections are cherry-picked directions — usually a random direction or the gradient direction — and they look convex because almost any one-dimensional cross-section of a high-dimensional loss surface will appear roughly convex. The actual landscape has exponentially many directions, and even a single non-convex direction is enough to create saddle points or local minima. This is why visualizations of "the loss landscape" are almost always misleading: the surface that gradient descent navigates in dimensions cannot be faithfully represented in 2D. The success of SGD on non-convex neural network training is not explained by local convexity — it's explained by the benign structure of saddle points in overparameterized models, which is a separate story from convexity.
Why convexity = tractable optimization. For a convex function, any local minimum is global — this follows directly from the definition. Proof: if is a local minimum but not global, then some has . The line segment for small lies in the neighborhood of (local-ness) but has -value (convexity), contradicting the local minimality. QED. For strongly convex functions, the minimizer is also unique.
Jensen's inequality is why the ELBO works. The variational inference objective is an instance of Jensen's inequality applied to (which is concave, so Jensen gives a lower bound). Maximizing the ELBO instead of the intractable log-marginal works because the bound is tight when — so we are doing the best possible within the variational family .
Convexity is not preserved through neural network compositions. The cross-entropy loss is convex in the logits, logistic regression is convex in the weights, but deep neural networks are non-convex in their weights. The reason is the product structure: is bilinear in , and bilinear functions are generally neither convex nor concave in the joint parameter space. This non-convexity is fundamental, not an artifact of how we write the objective.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.