High-Dimensional Statistics: Sparsity, RIP & Compressed Sensing
In high dimensions, intuitions from low-dimensional statistics break down: volume concentrates on spherical shells, random vectors become nearly orthogonal, and classical estimators fail. The restricted isometry property and compressed sensing show that sparse signals can be recovered exactly from far fewer measurements than classical theory would require.
Concepts
Concentration of measure: for X ~ N(0, Iᵈ), the normalized squared norm ‖X‖²/d concentrates tightly around 1 as dimension d grows. Click a dimension to toggle.
The intuitions you built for 2D and 3D geometry — that the center of a ball holds most of its volume, that nearby points are clearly closer than far ones, that "random" is spread uniformly — break catastrophically in high dimensions. A dataset of points in 1000-dimensional feature space inhabits a world where every point is nearly orthogonal to every other, nearest and farthest neighbors are almost equidistant, and essentially all volume lives in a thin shell at the boundary of any ball.
The Curse of Dimensionality
Volume concentrates on the sphere. For (unit ball in ), the radial distribution is . Almost all volume is near the boundary:
The fraction of volume within distance of the center is .
The factor in the volume element is the geometric reason intuitions fail. In the surface area grows as , modest; in the surface density grows as , so the interior is negligible. Every algorithm that relies on finding "typical" points near the center of a distribution — density estimation, nearest-neighbor classification, -means — implicitly assumes low-dimensional geometry and degrades only when the intrinsic dimensionality of the data is much lower than the ambient dimension .
Distances concentrate. For :
so all pairwise distances are approximately . Nearest-neighbor methods fail: "near" and "far" become indistinguishable.
Random vectors are nearly orthogonal. For :
The Gaussian random matrix has columns that are approximately orthogonal — this is the basis for random projections.
Concentration of Measure for Gaussians
For :
More precisely, and . The distribution of concentrates tightly around 1 as grows.
Gaussian Lipschitz concentration: for with everywhere (Lipschitz constant ):
This is dimension-free: the concentration depends only on and , not . It implies that any smooth function of a high-dimensional Gaussian concentrates tightly.
Sparse Recovery and the LASSO
When the number of features (far more features than observations), classical OLS is impossible (underdetermined). If the true coefficient vector has at most nonzero entries (s-sparse), recovery is possible.
LASSO (Least Absolute Shrinkage and Selection Operator):
The penalty induces sparsity: the LASSO solution has many exactly-zero entries.
Oracle inequality: under the restricted eigenvalue condition (RE), with :
This is the minimax-optimal rate for -sparse estimation in — the factor is unavoidable (it comes from the multiple-testing penalty for searching over features).
Comparison: OLS achieves when . LASSO achieves — a factor better.
Restricted Isometry Property and Compressed Sensing
RIP (Restricted Isometry Property): a matrix satisfies the -RIP with constant if:
The matrix preserves the lengths of all sparse vectors — it acts nearly isometrically on the sparse subspace.
Random matrices satisfy RIP. If and , then satisfies -RIP with constant with high probability.
Compressed sensing recovery (Candès, Romberg, Tao 2006; Donoho 2006): if satisfies -RIP with , then the minimization program:
recovers the true -sparse exactly from measurements.
Key insight: the number of measurements is proportional to the sparsity times , not the ambient dimension . For , : need only measurements to recover a million-dimensional sparse signal.
Johnson-Lindenstrauss Lemma
JL lemma: for any points and , there exists a linear map with such that:
A Gaussian random matrix with achieves this.
Key feature: depends on the number of points , not the ambient dimension ! For points with : , regardless of whether or .
JL implies that dimensionality reduction via random projection preserves all pairwise distances — the basis for random projections in ML, locality-sensitive hashing, and random features for kernel approximation.
Spiked Covariance and the BBP Transition
Consider data where has "spikes" of size above the noise floor.
Marchenko-Pastur law: the empirical spectral distribution of (with ) converges to the Marchenko-Pastur distribution supported on .
BBP transition (Baik-Ben Arous-Péché, 2005): for a single spike :
- If (weak spike): the sample eigenvector is essentially random — no information about .
- If (strong spike): the top sample eigenvalue separates from the Marchenko-Pastur bulk, and the sample eigenvector aligns with .
Implication for PCA: sample PCA is consistent only when the signal-to-noise ratio (spike magnitude relative to noise) exceeds the BBP threshold . For (one observation per dimension), no spike smaller than 1 is detectable.
Worked Example
Example 1: LASSO vs OLS
Gene expression study: patients, genes, true model has relevant genes.
OLS: infeasible (). Even with a pseudoinverse, MSE .
LASSO with :
Oracle rate: MSE .
The LASSO achieves an MSE roughly 40 times smaller than OLS (for large ). In practice, cross-validated LASSO is a standard tool for this high-dimensional setting.
Example 2: Compressed Sensing for MRI
In MRI, the full signal with pixels is sparse in the wavelet domain: is approximately -sparse with .
The MRI scanner collects Fourier measurements (frequency-domain samples). Instead of measurements, take only random frequency samples.
With sparse wavelet coefficients and : measurements instead of 65000 — a 15× speedup in scan time, recovering the image exactly by solving the compressed sensing minimization.
Example 3: BBP Threshold in Practice
Simulation: , (), so BBP threshold .
Population covariance: . For : top sample eigenvector has angle from — useless for direction estimation. For : sample eigenvector converges to as .
Practical implication: with (a common ratio in genomics), PCA can only detect components with signal-to-noise ratio above . Weak signals below this threshold are indistinguishable from noise even in large samples.
Connections
Where Your Intuition Breaks
The LASSO oracle inequality is a beautiful result — but it rests on the restricted eigenvalue condition on the design matrix : columns corresponding to the true support must not be too correlated with columns outside the support. For random Gaussian designs, this holds with high probability. For real-world feature matrices — gene expression data, text features, financial signals — highly correlated features are the norm, not the exception, and the RE condition can fail. When it does, the LASSO solution may not converge to the true at the oracle rate regardless of sample size. Practitioners often use adaptive variants (elastic net, group Lasso, adaptive Lasso) for correlated features, but the clean rate applies only under structural conditions that must be verified, not assumed.
Sparsity is the key structural assumption in high dimensions. Without structure, unknown parameters require samples — unavoidable in the worst case (this is the minimax lower bound). Sparsity with reduces the effective number of unknowns to , yielding rates proportional to not . The factor is the price of not knowing which coordinates are nonzero — it reflects a multiple-testing penalty for searching over all possible support sets. Other structural assumptions (low-rank matrices, group sparsity, tree sparsity) give analogous results with different penalty terms.
The JL lemma says geometry survives random projection. Adding dimensions is enough to preserve all pairwise distances, regardless of . This is not obvious: projecting from to dimensions seems like destroying information. But the key is that only pairwise distances need to be preserved — a fixed set of constraints regardless of . Random projections work because a random -dimensional subspace is "spread" across all original directions, rather than aligned with any particular low-dimensional subspace.
The RIP is hard to verify in practice. While random matrices satisfy RIP with high probability, real measurement matrices (medical devices, sensor arrays) are deterministic and structured. The RIP condition must be verified separately and is NP-hard to check in general. In practice, practitioners rely on random initialization or semi-random designs (partial Fourier, partial Hadamard) which are known to satisfy RIP. Additionally, compressed sensing guarantees exact recovery only for exactly sparse signals; approximate sparsity gives approximate recovery with error proportional to the tail of .
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.