High-Dimensional Statistics: Sparsity, RIP & Compressed Sensing

In high dimensions, intuitions from low-dimensional statistics break down: volume concentrates on spherical shells, random vectors become nearly orthogonal, and classical estimators fail. The restricted isometry property and compressed sensing show that sparse signals can be recovered exactly from far fewer measurements than classical theory would require.

Concepts

Concentration of measure: for X ~ N(0, Iᵈ), the normalized squared norm ‖X‖²/d concentrates tightly around 1 as dimension d grows. Click a dimension to toggle.

Concentration: Var(‖X‖²/d) = 2/d → 0. At d=500, the distribution is a spike. This means random Gaussian vectors are nearly orthogonal in high dimensions: E[X·Y] = 0 but ‖X‖ ≈ ‖Y‖ ≈ √d — all points live on a thin spherical shell.

The intuitions you built for 2D and 3D geometry — that the center of a ball holds most of its volume, that nearby points are clearly closer than far ones, that "random" is spread uniformly — break catastrophically in high dimensions. A dataset of points in 1000-dimensional feature space inhabits a world where every point is nearly orthogonal to every other, nearest and farthest neighbors are almost equidistant, and essentially all volume lives in a thin shell at the boundary of any ball.

The Curse of Dimensionality

Volume concentrates on the sphere. For $X \sim \text{Uniform}(\mathbb{B}^d)$ (unit ball in $\mathbb{R}^d$ ), the radial distribution is $P(r < \|X\| \leq r + dr) \propto r^{d-1}$ . Almost all volume is near the boundary:

$P\!\left(\|X\| \geq 1 - \varepsilon\right) = 1 - (1-\varepsilon)^d \to 1 \quad \text{for any fixed } \varepsilon > 0 \text{ as } d \to \infty.$

The fraction of volume within distance $\varepsilon$ of the center is $(1-\varepsilon)^d \leq e^{-\varepsilon d} \to 0$ .

The $r^{d-1}$ factor in the volume element $r^{d-1}\,dr$ is the geometric reason intuitions fail. In $d=3$ the surface area grows as $r^2$ , modest; in $d=1000$ the surface density grows as $r^{999}$ , so the interior is negligible. Every algorithm that relies on finding "typical" points near the center of a distribution — density estimation, nearest-neighbor classification, $k$ -means — implicitly assumes low-dimensional geometry and degrades only when the intrinsic dimensionality of the data is much lower than the ambient dimension $d$ .

Distances concentrate. For $X, Y \stackrel{\text{iid}}{\sim} \mathcal{N}(0, I_d)$ :

$\frac{\|X - Y\|^2}{2d} \xrightarrow{P} 1 \quad \text{as } d \to \infty,$

so all pairwise distances are approximately $\sqrt{2d}$ . Nearest-neighbor methods fail: "near" and "far" become indistinguishable.

Random vectors are nearly orthogonal. For $X, Y \stackrel{\text{iid}}{\sim} \mathcal{N}(0, I_d)$ :

$\frac{X^T Y}{\|X\|\|Y\|} \xrightarrow{P} 0 \quad \text{as } d \to \infty.$

The Gaussian random matrix has columns that are approximately orthogonal — this is the basis for random projections.

Concentration of Measure for Gaussians

For $X \sim \mathcal{N}(0, I_d)$ :

$\left\|\frac{\|X\|^2}{d} - 1\right\| \leq O\!\left(\sqrt{\frac{\log(1/\delta)}{d}}\right) \quad \text{with probability } \geq 1-\delta.$

More precisely, $\|X\|^2 \sim \chi^2(d)$ and $\text{Var}(\|X\|^2/d) = 2/d \to 0$ . The distribution of $\|X\|^2/d$ concentrates tightly around 1 as $d$ grows.

Gaussian Lipschitz concentration: for $f: \mathbb{R}^d \to \mathbb{R}$ with $\|\nabla f\| \leq L$ everywhere (Lipschitz constant $L$ ):

$P\!\left(|f(X) - \mathbb{E}[f(X)]| \geq t\right) \leq 2e^{-t^2/(2L^2)}.$

This is dimension-free: the concentration depends only on $L$ and $t$ , not $d$ . It implies that any smooth function of a high-dimensional Gaussian concentrates tightly.

Sparse Recovery and the LASSO

When the number of features $p \gg n$ (far more features than observations), classical OLS is impossible (underdetermined). If the true coefficient vector $\beta^* \in \mathbb{R}^p$ has at most $s$ nonzero entries (s-sparse), recovery is possible.

LASSO (Least Absolute Shrinkage and Selection Operator):

$\hat\beta = \arg\min_{\beta \in \mathbb{R}^p} \frac{1}{2n}\|y - X\beta\|_2^2 + \lambda\|\beta\|_1.$

The $L_1$ penalty induces sparsity: the LASSO solution has many exactly-zero entries.

Oracle inequality: under the restricted eigenvalue condition (RE), with $\lambda = C\sigma\sqrt{\log p / n}$ :

$\|\hat\beta - \beta^*\|_2^2 \leq C' s \frac{\sigma^2 \log p}{n}.$

This is the minimax-optimal rate for $s$ -sparse estimation in $\mathbb{R}^p$ — the $\log p$ factor is unavoidable (it comes from the multiple-testing penalty for searching over $p$ features).

Comparison: OLS achieves $\|\hat\beta_{\text{OLS}} - \beta^*\|_2^2 \approx p\sigma^2/n$ when $p < n$ . LASSO achieves $s\sigma^2 \log p/n$ — a factor $p/(s\log p)$ better.

Restricted Isometry Property and Compressed Sensing

RIP (Restricted Isometry Property): a matrix $A \in \mathbb{R}^{m \times p}$ satisfies the $s$ -RIP with constant $\delta_s \in (0,1)$ if:

$(1-\delta_s)\|\beta\|_2^2 \leq \|A\beta\|_2^2 \leq (1+\delta_s)\|\beta\|_2^2 \quad \text{for all } s\text{-sparse } \beta.$

The matrix $A$ preserves the lengths of all sparse vectors — it acts nearly isometrically on the sparse subspace.

Random matrices satisfy RIP. If $A_{ij} \stackrel{\text{iid}}{\sim} \mathcal{N}(0, 1/m)$ and $m \geq Cs\log(p/s)$ , then $A$ satisfies $s$ -RIP with constant $\delta_{2s} < 1/\sqrt{2}$ with high probability.

Compressed sensing recovery (Candès, Romberg, Tao 2006; Donoho 2006): if $A$ satisfies $2s$ -RIP with $\delta_{2s} < \sqrt{2}-1$ , then the $\ell_1$ minimization program:

$\hat\beta = \arg\min_\beta \|\beta\|_1 \quad \text{subject to } A\beta = y$

recovers the true $s$ -sparse $\beta^*$ exactly from $m = O(s\log(p/s))$ measurements.

Key insight: the number of measurements $m$ is proportional to the sparsity $s$ times $\log(p/s)$ , not the ambient dimension $p$ . For $s = 100$ , $p = 10^6$ : need only $m \approx 100 \cdot \log(10^4) \approx 1000$ measurements to recover a million-dimensional sparse signal.

Johnson-Lindenstrauss Lemma

JL lemma: for any $n$ points $x_1, \ldots, x_n \in \mathbb{R}^p$ and $\varepsilon \in (0, 1)$ , there exists a linear map $f: \mathbb{R}^p \to \mathbb{R}^k$ with $k = O(\log n/\varepsilon^2)$ such that:

A Gaussian random matrix $f(x) = \frac{1}{\sqrt{k}} G x$ with $G_{ij} \stackrel{\text{iid}}{\sim} \mathcal{N}(0,1)$ achieves this.

Key feature: $k$ depends on the number of points $n$ , not the ambient dimension $p$ ! For $n = 1000$ points with $\varepsilon = 0.1$ : $k = O(\log 1000 / 0.01) \approx 3000$ , regardless of whether $p = 10^6$ or $p = 10^{12}$ .

JL implies that dimensionality reduction via random projection preserves all pairwise distances — the basis for random projections in ML, locality-sensitive hashing, and random features for kernel approximation.

Spiked Covariance and the BBP Transition

Consider data $X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \mathcal{N}(0, \Sigma)$ where $\Sigma = \sum_{k=1}^r \lambda_k v_k v_k^T + I_d$ has $r$ "spikes" of size $\lambda_k$ above the noise floor.

Marchenko-Pastur law: the empirical spectral distribution of $\frac{1}{n}X^TX$ (with $p/n \to \gamma$ ) converges to the Marchenko-Pastur distribution supported on $[(1-\sqrt\gamma)^2, (1+\sqrt\gamma)^2]$ .

BBP transition (Baik-Ben Arous-Péché, 2005): for a single spike $\lambda_1$ :

If $\lambda_1 \leq \sqrt\gamma$ (weak spike): the sample eigenvector is essentially random — no information about $v_1$ .
If $\lambda_1 > \sqrt\gamma$ (strong spike): the top sample eigenvalue separates from the Marchenko-Pastur bulk, and the sample eigenvector aligns with $v_1$ .

Implication for PCA: sample PCA is consistent only when the signal-to-noise ratio (spike magnitude relative to noise) exceeds the BBP threshold $\sqrt\gamma = \sqrt{p/n}$ . For $p = n$ (one observation per dimension), no spike smaller than 1 is detectable.

Worked Example

Example 1: LASSO vs OLS

Gene expression study: $n = 200$ patients, $p = 20{,}000$ genes, true model has $s = 50$ relevant genes.

OLS: infeasible ( $p > n$ ). Even with a pseudoinverse, MSE $\approx p\sigma^2/n = 20000\sigma^2/200 = 100\sigma^2$ .

LASSO with $\lambda = 2\sigma\sqrt{\log(20000)/200} \approx 2\sigma \cdot 0.3$ :

Oracle rate: MSE $\leq Cs\sigma^2\log(p)/n = C \cdot 50 \cdot \sigma^2 \cdot 9.9/200 \approx 2.5C\sigma^2$ .

The LASSO achieves an MSE roughly 40 times smaller than OLS (for large $p/s$ ). In practice, cross-validated LASSO is a standard tool for this high-dimensional setting.

Example 2: Compressed Sensing for MRI

In MRI, the full signal $x \in \mathbb{R}^p$ with $p \approx 256^2 \approx 65{,}000$ pixels is sparse in the wavelet domain: $\beta = W x$ is approximately $s$ -sparse with $s \ll p$ .

The MRI scanner collects Fourier measurements $y = Fx$ (frequency-domain samples). Instead of $p$ measurements, take only $m = O(s\log p)$ random frequency samples.

With $s = 1000$ sparse wavelet coefficients and $p = 65000$ : $m \approx 1000 \cdot \log(65) \approx 4200$ measurements instead of 65000 — a 15× speedup in scan time, recovering the image exactly by solving the compressed sensing $\ell_1$ minimization.

Example 3: BBP Threshold in Practice

Simulation: $n = 200$ , $p = 100$ ( $\gamma = p/n = 0.5$ ), so BBP threshold $\sqrt\gamma = 0.71$ .

Population covariance: $\Sigma = I + \lambda_1 v_1 v_1^T$ . For $\lambda_1 = 0.5 < 0.71$ : top sample eigenvector has angle $\sim 90°$ from $v_1$ — useless for direction estimation. For $\lambda_1 = 2 > 0.71$ : sample eigenvector converges to $v_1$ as $n \to \infty$ .

Practical implication: with $p/n \approx 0.5$ (a common ratio in genomics), PCA can only detect components with signal-to-noise ratio above $\approx 0.71$ . Weak signals below this threshold are indistinguishable from noise even in large samples.

Connections

Where Your Intuition Breaks

The LASSO oracle inequality is a beautiful result — but it rests on the restricted eigenvalue condition on the design matrix $X$ : columns corresponding to the true support must not be too correlated with columns outside the support. For random Gaussian designs, this holds with high probability. For real-world feature matrices — gene expression data, text features, financial signals — highly correlated features are the norm, not the exception, and the RE condition can fail. When it does, the LASSO solution may not converge to the true $\beta^*$ at the oracle rate regardless of sample size. Practitioners often use adaptive variants (elastic net, group Lasso, adaptive Lasso) for correlated features, but the clean $s\sigma^2\log p/n$ rate applies only under structural conditions that must be verified, not assumed.

💡Intuition

Sparsity is the key structural assumption in high dimensions. Without structure, $p$ unknown parameters require $\Omega(p)$ samples — unavoidable in the worst case (this is the minimax lower bound). Sparsity with $s \ll p$ reduces the effective number of unknowns to $s$ , yielding rates proportional to $s$ not $p$ . The $\log p$ factor is the price of not knowing which $s$ coordinates are nonzero — it reflects a multiple-testing penalty for searching over all $\binom{p}{s}$ possible support sets. Other structural assumptions (low-rank matrices, group sparsity, tree sparsity) give analogous results with different penalty terms.

💡Intuition

The JL lemma says geometry survives random projection. Adding $k = O(\log n)$ dimensions is enough to preserve all pairwise distances, regardless of $p$ . This is not obvious: projecting from $p = 10^6$ to $k = 3000$ dimensions seems like destroying information. But the key is that only $\binom{n}{2}$ pairwise distances need to be preserved — a fixed set of constraints regardless of $p$ . Random projections work because a random $k$ -dimensional subspace is "spread" across all $p$ original directions, rather than aligned with any particular low-dimensional subspace.

⚠️Warning

The RIP is hard to verify in practice. While random matrices satisfy RIP with high probability, real measurement matrices (medical devices, sensor arrays) are deterministic and structured. The RIP condition must be verified separately and is NP-hard to check in general. In practice, practitioners rely on random initialization or semi-random designs (partial Fourier, partial Hadamard) which are known to satisfy RIP. Additionally, compressed sensing guarantees exact recovery only for exactly sparse signals; approximate sparsity gives approximate recovery with error proportional to the tail of $\beta^*$ .

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

PAC Learning, VC Dimension & Rademacher Complexity

Bridge: Double Descent, Implicit Regularization & Modern Generalization Theory