Neural-Path/Notes
45 min

High-Dimensional Statistics: Sparsity, RIP & Compressed Sensing

In high dimensions, intuitions from low-dimensional statistics break down: volume concentrates on spherical shells, random vectors become nearly orthogonal, and classical estimators fail. The restricted isometry property and compressed sensing show that sparse signals can be recovered exactly from far fewer measurements than classical theory would require.

Concepts

Concentration of measure: for X ~ N(0, Iᵈ), the normalized squared norm ‖X‖²/d concentrates tightly around 1 as dimension d grows. Click a dimension to toggle.

mean=10.00.51.01.52.0‖X‖² / d
Concentration: Var(‖X‖²/d) = 2/d → 0. At d=500, the distribution is a spike. This means random Gaussian vectors are nearly orthogonal in high dimensions: E[X·Y] = 0 but ‖X‖ ≈ ‖Y‖ ≈ √d — all points live on a thin spherical shell.

The intuitions you built for 2D and 3D geometry — that the center of a ball holds most of its volume, that nearby points are clearly closer than far ones, that "random" is spread uniformly — break catastrophically in high dimensions. A dataset of points in 1000-dimensional feature space inhabits a world where every point is nearly orthogonal to every other, nearest and farthest neighbors are almost equidistant, and essentially all volume lives in a thin shell at the boundary of any ball.

The Curse of Dimensionality

Volume concentrates on the sphere. For XUniform(Bd)X \sim \text{Uniform}(\mathbb{B}^d) (unit ball in Rd\mathbb{R}^d), the radial distribution is P(r<Xr+dr)rd1P(r < \|X\| \leq r + dr) \propto r^{d-1}. Almost all volume is near the boundary:

P ⁣(X1ε)=1(1ε)d1for any fixed ε>0 as d.P\!\left(\|X\| \geq 1 - \varepsilon\right) = 1 - (1-\varepsilon)^d \to 1 \quad \text{for any fixed } \varepsilon > 0 \text{ as } d \to \infty.

The fraction of volume within distance ε\varepsilon of the center is (1ε)deεd0(1-\varepsilon)^d \leq e^{-\varepsilon d} \to 0.

The rd1r^{d-1} factor in the volume element rd1drr^{d-1}\,dr is the geometric reason intuitions fail. In d=3d=3 the surface area grows as r2r^2, modest; in d=1000d=1000 the surface density grows as r999r^{999}, so the interior is negligible. Every algorithm that relies on finding "typical" points near the center of a distribution — density estimation, nearest-neighbor classification, kk-means — implicitly assumes low-dimensional geometry and degrades only when the intrinsic dimensionality of the data is much lower than the ambient dimension dd.

Distances concentrate. For X,YiidN(0,Id)X, Y \stackrel{\text{iid}}{\sim} \mathcal{N}(0, I_d):

XY22dP1as d,\frac{\|X - Y\|^2}{2d} \xrightarrow{P} 1 \quad \text{as } d \to \infty,

so all pairwise distances are approximately 2d\sqrt{2d}. Nearest-neighbor methods fail: "near" and "far" become indistinguishable.

Random vectors are nearly orthogonal. For X,YiidN(0,Id)X, Y \stackrel{\text{iid}}{\sim} \mathcal{N}(0, I_d):

XTYXYP0as d.\frac{X^T Y}{\|X\|\|Y\|} \xrightarrow{P} 0 \quad \text{as } d \to \infty.

The Gaussian random matrix has columns that are approximately orthogonal — this is the basis for random projections.

Concentration of Measure for Gaussians

For XN(0,Id)X \sim \mathcal{N}(0, I_d):

X2d1O ⁣(log(1/δ)d)with probability 1δ.\left\|\frac{\|X\|^2}{d} - 1\right\| \leq O\!\left(\sqrt{\frac{\log(1/\delta)}{d}}\right) \quad \text{with probability } \geq 1-\delta.

More precisely, X2χ2(d)\|X\|^2 \sim \chi^2(d) and Var(X2/d)=2/d0\text{Var}(\|X\|^2/d) = 2/d \to 0. The distribution of X2/d\|X\|^2/d concentrates tightly around 1 as dd grows.

Gaussian Lipschitz concentration: for f:RdRf: \mathbb{R}^d \to \mathbb{R} with fL\|\nabla f\| \leq L everywhere (Lipschitz constant LL):

P ⁣(f(X)E[f(X)]t)2et2/(2L2).P\!\left(|f(X) - \mathbb{E}[f(X)]| \geq t\right) \leq 2e^{-t^2/(2L^2)}.

This is dimension-free: the concentration depends only on LL and tt, not dd. It implies that any smooth function of a high-dimensional Gaussian concentrates tightly.

Sparse Recovery and the LASSO

When the number of features pnp \gg n (far more features than observations), classical OLS is impossible (underdetermined). If the true coefficient vector βRp\beta^* \in \mathbb{R}^p has at most ss nonzero entries (s-sparse), recovery is possible.

LASSO (Least Absolute Shrinkage and Selection Operator):

β^=argminβRp12nyXβ22+λβ1.\hat\beta = \arg\min_{\beta \in \mathbb{R}^p} \frac{1}{2n}\|y - X\beta\|_2^2 + \lambda\|\beta\|_1.

The L1L_1 penalty induces sparsity: the LASSO solution has many exactly-zero entries.

Oracle inequality: under the restricted eigenvalue condition (RE), with λ=Cσlogp/n\lambda = C\sigma\sqrt{\log p / n}:

β^β22Csσ2logpn.\|\hat\beta - \beta^*\|_2^2 \leq C' s \frac{\sigma^2 \log p}{n}.

This is the minimax-optimal rate for ss-sparse estimation in Rp\mathbb{R}^p — the logp\log p factor is unavoidable (it comes from the multiple-testing penalty for searching over pp features).

Comparison: OLS achieves β^OLSβ22pσ2/n\|\hat\beta_{\text{OLS}} - \beta^*\|_2^2 \approx p\sigma^2/n when p<np < n. LASSO achieves sσ2logp/ns\sigma^2 \log p/n — a factor p/(slogp)p/(s\log p) better.

Restricted Isometry Property and Compressed Sensing

RIP (Restricted Isometry Property): a matrix ARm×pA \in \mathbb{R}^{m \times p} satisfies the ss-RIP with constant δs(0,1)\delta_s \in (0,1) if:

(1δs)β22Aβ22(1+δs)β22for all s-sparse β.(1-\delta_s)\|\beta\|_2^2 \leq \|A\beta\|_2^2 \leq (1+\delta_s)\|\beta\|_2^2 \quad \text{for all } s\text{-sparse } \beta.

The matrix AA preserves the lengths of all sparse vectors — it acts nearly isometrically on the sparse subspace.

Random matrices satisfy RIP. If AijiidN(0,1/m)A_{ij} \stackrel{\text{iid}}{\sim} \mathcal{N}(0, 1/m) and mCslog(p/s)m \geq Cs\log(p/s), then AA satisfies ss-RIP with constant δ2s<1/2\delta_{2s} < 1/\sqrt{2} with high probability.

Compressed sensing recovery (Candès, Romberg, Tao 2006; Donoho 2006): if AA satisfies 2s2s-RIP with δ2s<21\delta_{2s} < \sqrt{2}-1, then the 1\ell_1 minimization program:

β^=argminββ1subject to Aβ=y\hat\beta = \arg\min_\beta \|\beta\|_1 \quad \text{subject to } A\beta = y

recovers the true ss-sparse β\beta^* exactly from m=O(slog(p/s))m = O(s\log(p/s)) measurements.

Key insight: the number of measurements mm is proportional to the sparsity ss times log(p/s)\log(p/s), not the ambient dimension pp. For s=100s = 100, p=106p = 10^6: need only m100log(104)1000m \approx 100 \cdot \log(10^4) \approx 1000 measurements to recover a million-dimensional sparse signal.

Johnson-Lindenstrauss Lemma

JL lemma: for any nn points x1,,xnRpx_1, \ldots, x_n \in \mathbb{R}^p and ε(0,1)\varepsilon \in (0, 1), there exists a linear map f:RpRkf: \mathbb{R}^p \to \mathbb{R}^k with k=O(logn/ε2)k = O(\log n/\varepsilon^2) such that:

(1ε)xixj22f(xi)f(xj)22(1+ε)xixj22i,j.(1-\varepsilon)\|x_i - x_j\|_2^2 \leq \|f(x_i) - f(x_j)\|_2^2 \leq (1+\varepsilon)\|x_i - x_j\|_2^2 \quad \forall i, j.

A Gaussian random matrix f(x)=1kGxf(x) = \frac{1}{\sqrt{k}} G x with GijiidN(0,1)G_{ij} \stackrel{\text{iid}}{\sim} \mathcal{N}(0,1) achieves this.

Key feature: kk depends on the number of points nn, not the ambient dimension pp! For n=1000n = 1000 points with ε=0.1\varepsilon = 0.1: k=O(log1000/0.01)3000k = O(\log 1000 / 0.01) \approx 3000, regardless of whether p=106p = 10^6 or p=1012p = 10^{12}.

JL implies that dimensionality reduction via random projection preserves all pairwise distances — the basis for random projections in ML, locality-sensitive hashing, and random features for kernel approximation.

Spiked Covariance and the BBP Transition

Consider data X1,,XniidN(0,Σ)X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \mathcal{N}(0, \Sigma) where Σ=k=1rλkvkvkT+Id\Sigma = \sum_{k=1}^r \lambda_k v_k v_k^T + I_d has rr "spikes" of size λk\lambda_k above the noise floor.

Marchenko-Pastur law: the empirical spectral distribution of 1nXTX\frac{1}{n}X^TX (with p/nγp/n \to \gamma) converges to the Marchenko-Pastur distribution supported on [(1γ)2,(1+γ)2][(1-\sqrt\gamma)^2, (1+\sqrt\gamma)^2].

BBP transition (Baik-Ben Arous-Péché, 2005): for a single spike λ1\lambda_1:

  • If λ1γ\lambda_1 \leq \sqrt\gamma (weak spike): the sample eigenvector is essentially random — no information about v1v_1.
  • If λ1>γ\lambda_1 > \sqrt\gamma (strong spike): the top sample eigenvalue separates from the Marchenko-Pastur bulk, and the sample eigenvector aligns with v1v_1.

Implication for PCA: sample PCA is consistent only when the signal-to-noise ratio (spike magnitude relative to noise) exceeds the BBP threshold γ=p/n\sqrt\gamma = \sqrt{p/n}. For p=np = n (one observation per dimension), no spike smaller than 1 is detectable.

Worked Example

Example 1: LASSO vs OLS

Gene expression study: n=200n = 200 patients, p=20,000p = 20{,}000 genes, true model has s=50s = 50 relevant genes.

OLS: infeasible (p>np > n). Even with a pseudoinverse, MSE pσ2/n=20000σ2/200=100σ2\approx p\sigma^2/n = 20000\sigma^2/200 = 100\sigma^2.

LASSO with λ=2σlog(20000)/2002σ0.3\lambda = 2\sigma\sqrt{\log(20000)/200} \approx 2\sigma \cdot 0.3:

Oracle rate: MSE Csσ2log(p)/n=C50σ29.9/2002.5Cσ2\leq Cs\sigma^2\log(p)/n = C \cdot 50 \cdot \sigma^2 \cdot 9.9/200 \approx 2.5C\sigma^2.

The LASSO achieves an MSE roughly 40 times smaller than OLS (for large p/sp/s). In practice, cross-validated LASSO is a standard tool for this high-dimensional setting.

Example 2: Compressed Sensing for MRI

In MRI, the full signal xRpx \in \mathbb{R}^p with p256265,000p \approx 256^2 \approx 65{,}000 pixels is sparse in the wavelet domain: β=Wx\beta = W x is approximately ss-sparse with sps \ll p.

The MRI scanner collects Fourier measurements y=Fxy = Fx (frequency-domain samples). Instead of pp measurements, take only m=O(slogp)m = O(s\log p) random frequency samples.

With s=1000s = 1000 sparse wavelet coefficients and p=65000p = 65000: m1000log(65)4200m \approx 1000 \cdot \log(65) \approx 4200 measurements instead of 65000 — a 15× speedup in scan time, recovering the image exactly by solving the compressed sensing 1\ell_1 minimization.

Example 3: BBP Threshold in Practice

Simulation: n=200n = 200, p=100p = 100 (γ=p/n=0.5\gamma = p/n = 0.5), so BBP threshold γ=0.71\sqrt\gamma = 0.71.

Population covariance: Σ=I+λ1v1v1T\Sigma = I + \lambda_1 v_1 v_1^T. For λ1=0.5<0.71\lambda_1 = 0.5 < 0.71: top sample eigenvector has angle 90°\sim 90° from v1v_1 — useless for direction estimation. For λ1=2>0.71\lambda_1 = 2 > 0.71: sample eigenvector converges to v1v_1 as nn \to \infty.

Practical implication: with p/n0.5p/n \approx 0.5 (a common ratio in genomics), PCA can only detect components with signal-to-noise ratio above 0.71\approx 0.71. Weak signals below this threshold are indistinguishable from noise even in large samples.

Connections

Where Your Intuition Breaks

The LASSO oracle inequality is a beautiful result — but it rests on the restricted eigenvalue condition on the design matrix XX: columns corresponding to the true support must not be too correlated with columns outside the support. For random Gaussian designs, this holds with high probability. For real-world feature matrices — gene expression data, text features, financial signals — highly correlated features are the norm, not the exception, and the RE condition can fail. When it does, the LASSO solution may not converge to the true β\beta^* at the oracle rate regardless of sample size. Practitioners often use adaptive variants (elastic net, group Lasso, adaptive Lasso) for correlated features, but the clean sσ2logp/ns\sigma^2\log p/n rate applies only under structural conditions that must be verified, not assumed.

💡Intuition

Sparsity is the key structural assumption in high dimensions. Without structure, pp unknown parameters require Ω(p)\Omega(p) samples — unavoidable in the worst case (this is the minimax lower bound). Sparsity with sps \ll p reduces the effective number of unknowns to ss, yielding rates proportional to ss not pp. The logp\log p factor is the price of not knowing which ss coordinates are nonzero — it reflects a multiple-testing penalty for searching over all (ps)\binom{p}{s} possible support sets. Other structural assumptions (low-rank matrices, group sparsity, tree sparsity) give analogous results with different penalty terms.

💡Intuition

The JL lemma says geometry survives random projection. Adding k=O(logn)k = O(\log n) dimensions is enough to preserve all pairwise distances, regardless of pp. This is not obvious: projecting from p=106p = 10^6 to k=3000k = 3000 dimensions seems like destroying information. But the key is that only (n2)\binom{n}{2} pairwise distances need to be preserved — a fixed set of constraints regardless of pp. Random projections work because a random kk-dimensional subspace is "spread" across all pp original directions, rather than aligned with any particular low-dimensional subspace.

⚠️Warning

The RIP is hard to verify in practice. While random matrices satisfy RIP with high probability, real measurement matrices (medical devices, sensor arrays) are deterministic and structured. The RIP condition must be verified separately and is NP-hard to check in general. In practice, practitioners rely on random initialization or semi-random designs (partial Fourier, partial Hadamard) which are known to satisfy RIP. Additionally, compressed sensing guarantees exact recovery only for exactly sparse signals; approximate sparsity gives approximate recovery with error proportional to the tail of β\beta^*.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.