Neural-Path/Notes
45 min

Positive Semidefinite Matrices & Quadratic Forms

Positive semidefinite (PSD) matrices are the matrix analogue of non-negative numbers: they define valid inner products, encode covariances, kernels, and curvatures, and form a convex cone under matrix addition and positive scaling. The PSD condition is the central constraint in semidefinite programming, Gaussian process models, and the analysis of loss surfaces in optimization. Understanding the multiple equivalent characterizations of PSD-ness — via eigenvalues, Cholesky, principal minors, and Gram matrices — is essential for both theory and numerical practice.

Concepts

PSD Cone — trace-2 symmetric matrices (drag the point)

bΔPDindef.-1-1-0.5-0.50.50.511xᵀAx = 1

Positive Definite

b² + Δ² = 0.000 < 1 ✓

A =

[1.000, 0]

[0, 1.000]

λ₁ = 1.000

λ₂ = 1.000

det = 1.000

tr = 2.000 (fixed = 2)

Cholesky A = LLᵀ

L = [1.000, 0]

[0, 1.000]

The green disk is the PD cone cross-section (trace fixed = 2). Drag inside for PD, on the boundary for PSD, outside for indefinite.

A covariance matrix must not assign negative variance to any direction — that would be geometrically impossible. Positive semidefinite (PSD) matrices are the formal expression of this constraint: they are the matrix analogue of non-negative numbers, forming a convex cone. Every covariance matrix, kernel matrix, Gram matrix, and positive-curvature Hessian is PSD, and the theory of these objects hinges on this single constraint.

Definitions and Notation

A symmetric matrix ARn×nA \in \mathbb{R}^{n \times n} is:

NameSymbolCondition
Positive definiteA0A \succ 0xTAx>0\mathbf{x}^TA\mathbf{x} > 0 for all x0\mathbf{x} \neq \mathbf{0}
Positive semidefiniteA0A \succeq 0xTAx0\mathbf{x}^TA\mathbf{x} \geq 0 for all x\mathbf{x}
Negative definiteA0A \prec 0xTAx<0\mathbf{x}^TA\mathbf{x} < 0 for all x0\mathbf{x} \neq \mathbf{0}
Negative semidefiniteA0A \preceq 0xTAx0\mathbf{x}^TA\mathbf{x} \leq 0 for all x\mathbf{x}
IndefiniteTakes both positive and negative values

The condition xTAx0\mathbf{x}^T A \mathbf{x} \geq 0 for all x\mathbf{x} is exactly what ensures AA defines a valid squared length. A matrix violating this would assign negative "squared distance" to some direction — geometrically impossible, and numerically catastrophic (Cholesky factorization would fail, log-likelihoods would be undefined).

The ordering ABA \preceq B (the Loewner order) means BA0B - A \succeq 0. This is a partial order on symmetric matrices that is compatible with scalar ordering: ABA \preceq B implies αAαB\alpha A \preceq \alpha B for α0\alpha \geq 0, and AB,BC    ACA \preceq B, B \preceq C \implies A \preceq C.

Equivalent Characterizations

Theorem. For a symmetric ARn×nA \in \mathbb{R}^{n \times n}, the following are equivalent:

  1. A0A \succeq 0 (definition)
  2. All eigenvalues λi0\lambda_i \geq 0
  3. A=BTBA = B^TB for some BRk×nB \in \mathbb{R}^{k \times n} (Gram matrix form)
  4. All leading principal minors det(A1:k,1:k)0\det(A_{1:k,1:k}) \geq 0 for k=1,,nk=1,\ldots,n
  5. The Cholesky factorization A=LLTA = LL^T exists (with possible zero diagonal entries)

For positive definite A0A \succ 0, all the inequalities become strict:

  • All λi>0\lambda_i > 0
  • All leading principal minors det(A1:k,1:k)>0\det(A_{1:k,1:k}) > 0 (Sylvester's criterion)
  • Cholesky A=LLTA = LL^T with LL lower triangular and positive diagonal exists uniquely

The PSD Cone

The set of n×nn \times n PSD matrices, denoted S+n\mathbb{S}^n_+, is a convex cone:

  • A,B0    αA+βB0A, B \succeq 0 \implies \alpha A + \beta B \succeq 0 for α,β0\alpha, \beta \geq 0
  • The set is closed and convex

For n=2n=2, a trace-2 symmetric matrix has the form A=(1+Δbb1Δ)A = \begin{pmatrix}1+\Delta & b \\ b & 1-\Delta\end{pmatrix}. It is PD iff det(A)=1Δ2b2>0\det(A) = 1 - \Delta^2 - b^2 > 0, i.e., b2+Δ2<1b^2 + \Delta^2 < 1 — the PD cone cross-section (at fixed trace) is a disk (illustrated in the diagram at the top of this lesson).

Gram Matrices and Kernel Matrices

Gram matrix. Given vectors x1,,xnRd\mathbf{x}_1, \ldots, \mathbf{x}_n \in \mathbb{R}^d, the Gram matrix GRn×nG \in \mathbb{R}^{n \times n} with Gij=xi,xjG_{ij} = \langle \mathbf{x}_i, \mathbf{x}_j \rangle is PSD. Proof: for any cRn\mathbf{c} \in \mathbb{R}^n:

cTGc=i,jcicjxi,xj=icixi,jcjxj=icixi20.\mathbf{c}^T G \mathbf{c} = \sum_{i,j} c_i c_j \langle \mathbf{x}_i, \mathbf{x}_j \rangle = \left\langle \sum_i c_i \mathbf{x}_i, \sum_j c_j \mathbf{x}_j \right\rangle = \left\| \sum_i c_i \mathbf{x}_i \right\|^2 \geq 0.

Kernel matrix. A function k:X×XRk : \mathcal{X} \times \mathcal{X} \to \mathbb{R} is a positive definite kernel if for all {x1,,xn}X\{x_1, \ldots, x_n\} \subset \mathcal{X} and all nn, the kernel (Gram) matrix Kij=k(xi,xj)K_{ij} = k(x_i, x_j) is PSD. This is Mercer's condition. Examples:

KernelFormula k(x,y)k(x,y)Notes
LinearxTyx^TyGram matrix of data
Polynomial(xTy+c)d(x^Ty + c)^dc0c \geq 0, dZ+d \in \mathbb{Z}^+
Gaussian RBFexp(xy2/22)\exp(-\|x-y\|^2 / 2\ell^2)>0\ell > 0 length scale
Laplaceexp(xy/)\exp(-\|x-y\| / \ell)Rougher than RBF
Matérn(various)Parameterized smoothness

The Schur Complement

For a block symmetric matrix M=(ABBTC)M = \begin{pmatrix}A & B \\ B^T & C\end{pmatrix} with A0A \succ 0:

Theorem. M0    CBTA1B0M \succeq 0 \iff C - B^TA^{-1}B \succeq 0.

The matrix S=CBTA1BS = C - B^TA^{-1}B is the Schur complement of AA in MM.

Why it matters: The Schur complement arises in Gaussian conditioning. If (x1x2)N(0,(Σ11Σ12Σ21Σ22))\begin{pmatrix}\mathbf{x}_1 \\ \mathbf{x}_2\end{pmatrix} \sim \mathcal{N}\left(\boldsymbol{0}, \begin{pmatrix}\Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22}\end{pmatrix}\right), then the conditional covariance of x2x1\mathbf{x}_2 | \mathbf{x}_1 is exactly the Schur complement Σ22Σ21Σ111Σ12\Sigma_{22} - \Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12} — the PSD Schur complement theorem guarantees this is a valid (non-negative) covariance.

Completing the square. The block factorization:

(ABBTC)=(I0BTA1I)(A00S)(IA1B0I)\begin{pmatrix}A & B \\ B^T & C\end{pmatrix} = \begin{pmatrix}I & 0 \\ B^TA^{-1} & I\end{pmatrix}\begin{pmatrix}A & 0 \\ 0 & S\end{pmatrix}\begin{pmatrix}I & A^{-1}B \\ 0 & I\end{pmatrix}

shows that det(M)=det(A)det(S)\det(M) = \det(A)\det(S) — Schur complement determinant formula.

Matrix Square Root and Powers

For A0A \succeq 0 with spectral decomposition A=QΛQTA = Q\Lambda Q^T, define:

A1/2=QΛ1/2QT,Λ1/2=diag(λ1,,λn).A^{1/2} = Q\Lambda^{1/2}Q^T, \qquad \Lambda^{1/2} = \operatorname{diag}(\sqrt{\lambda_1}, \ldots, \sqrt{\lambda_n}).

Properties: (A1/2)2=A(A^{1/2})^2 = A, A1/20A^{1/2} \succeq 0, unique among PSD matrices.

More generally, f(A)=Qf(Λ)QTf(A) = Qf(\Lambda)Q^T for any function ff applied entrywise to the eigenvalues. For a PD matrix: A1/2=QΛ1/2QTA^{-1/2} = Q\Lambda^{-1/2}Q^T is the inverse square root, used in whitening transformations.

Whitening. Given covariance Σ0\Sigma \succ 0 and data vector x\mathbf{x} with Cov(x)=Σ\operatorname{Cov}(\mathbf{x}) = \Sigma, the whitened vector z=Σ1/2x\mathbf{z} = \Sigma^{-1/2}\mathbf{x} has Cov(z)=I\operatorname{Cov}(\mathbf{z}) = I. Whitening is the preprocessing step that removes all linear correlations, mapping arbitrary Gaussians to standard Gaussians.

Operations Preserving PSD-ness

The following operations map PSD matrices to PSD matrices:

OperationResultProof
A,B0    αA+βBA, B \succeq 0 \implies \alpha A + \beta B0\succeq 0 for α,β0\alpha, \beta \geq 0Convex cone
A0    BTABA \succeq 0 \implies B^TAB0\succeq 0 for any BBGram form
A,B0    ABA, B \succeq 0 \implies A \circ B0\succeq 0 (Hadamard product)Schur product theorem
A0    eAA \succeq 0 \implies e^A0\succ 0eA=QeΛQTe^A = Q e^\Lambda Q^T, all eλi>0e^{\lambda_i} > 0
A0    A1A \succ 0 \implies A^{-1}0\succ 0Eigenvalues 1/λi>01/\lambda_i > 0

Schur product theorem. If A,B0A, B \succeq 0, then the elementwise (Hadamard) product ABA \circ B (with (AB)ij=AijBij(A \circ B)_{ij} = A_{ij}B_{ij}) is also PSD. Corollary: the pointwise product of two positive definite kernels is a positive definite kernel.

Worked Example

Example 1: Sylvester's Criterion

Test A=(210131012)A = \begin{pmatrix}2 & 1 & 0 \\ 1 & 3 & 1 \\ 0 & 1 & 2\end{pmatrix} for positive definiteness.

Leading principal minors:

  • k=1k=1: det(2)=2>0\det(2) = 2 > 0
  • k=2k=2: det(2113)=61=5>0\det\begin{pmatrix}2&1\\1&3\end{pmatrix} = 6-1 = 5 > 0
  • k=3k=3: det(A)=2(61)1(20)+0=102=8>0\det(A) = 2(6-1) - 1(2-0) + 0 = 10 - 2 = 8 > 0

All minors positive     \implies A0A \succ 0 by Sylvester's criterion. The Cholesky factor is:

L=(2001/25/2002/58/5).L = \begin{pmatrix}\sqrt{2} & 0 & 0 \\ 1/\sqrt{2} & \sqrt{5/2} & 0 \\ 0 & \sqrt{2/5} & \sqrt{8/5}\end{pmatrix}.

Example 2: Covariance Matrix and Its Inverse

If the 2×22 \times 2 covariance matrix is Σ=(4223)\Sigma = \begin{pmatrix}4 & 2 \\ 2 & 3\end{pmatrix}, then Σ0\Sigma \succ 0 (eigenvalues 5.24,1.76>0\approx 5.24, 1.76 > 0).

The precision matrix Σ1=18(3224)\Sigma^{-1} = \frac{1}{8}\begin{pmatrix}3 & -2 \\ -2 & 4\end{pmatrix} is also PD (eigenvalues = 1/5.24,1/1.761/5.24, 1/1.76). The precision matrix encodes partial correlations: (Σ1)12=2/80(\Sigma^{-1})_{12} = -2/8 \neq 0 means x1x_1 and x2x_2 are conditionally dependent (their partial correlation is nonzero).

The Cholesky of the precision Ω=Σ1\Omega = \Sigma^{-1} is used in sparse Gaussian graphical models (glasso), where sparsity of Ω\Omega encodes conditional independence structure.

Example 3: Gaussian Conditioning via Schur Complement

Let (xy)N((μxμy),(σx2ρσxσyρσxσyσy2))\begin{pmatrix}x \\ y\end{pmatrix} \sim \mathcal{N}\left(\begin{pmatrix}\mu_x \\ \mu_y\end{pmatrix}, \begin{pmatrix}\sigma_x^2 & \rho\sigma_x\sigma_y \\ \rho\sigma_x\sigma_y & \sigma_y^2\end{pmatrix}\right).

Given x=x0x = x_0, the conditional distribution yx=x0y | x = x_0 is:

yx=x0N ⁣(μy+ρσyσx(x0μx),  σy2(1ρ2)).y | x = x_0 \sim \mathcal{N}\!\left(\mu_y + \rho\frac{\sigma_y}{\sigma_x}(x_0 - \mu_x),\; \sigma_y^2(1-\rho^2)\right).

The conditional variance σy2(1ρ2)\sigma_y^2(1-\rho^2) is the Schur complement of σx2\sigma_x^2 in Σ\Sigma:

σy2ρσxσy1σx2ρσxσy=σy2(1ρ2)0.\sigma_y^2 - \rho\sigma_x\sigma_y \cdot \frac{1}{\sigma_x^2} \cdot \rho\sigma_x\sigma_y = \sigma_y^2(1 - \rho^2) \geq 0.

The PSD-ness of the Schur complement is what guarantees the conditional variance is non-negative (and equals zero only when ρ=1|\rho|=1, i.e., perfect linear relationship).

Connections

Where Your Intuition Breaks

A matrix with all positive entries is positive definite. These are entirely different properties. The matrix A=(1221)A = \begin{pmatrix}1 & 2 \\ 2 & 1\end{pmatrix} has all positive entries, but det(A)=14=3<0\det(A) = 1 - 4 = -3 < 0, so it is indefinite — there exist directions where xTAx<0\mathbf{x}^T A \mathbf{x} < 0. Conversely, the matrix (10.90.91)\begin{pmatrix}1 & -0.9 \\ -0.9 & 1\end{pmatrix} has negative off-diagonal entries but is positive definite (eigenvalues 1.9 and 0.1). Positive definiteness is a property of the quadratic form the matrix defines, not a property of the individual entries. Always check eigenvalues or Sylvester's criterion, not entry signs.

PSD in Optimization: Second-Order Conditions

Theorem (Second-Order Optimality). For a twice-differentiable f:RnRf : \mathbb{R}^n \to \mathbb{R}:

  • Necessary condition for local min: f(x)=0\nabla f(\mathbf{x}^*) = \mathbf{0} and 2f(x)0\nabla^2 f(\mathbf{x}^*) \succeq 0
  • Sufficient condition for strict local min: f(x)=0\nabla f(\mathbf{x}^*) = \mathbf{0} and 2f(x)0\nabla^2 f(\mathbf{x}^*) \succ 0
  • Convex function: ff is convex     \iff 2f(x)0\nabla^2 f(\mathbf{x}) \succeq 0 for all x\mathbf{x}

The eigenvalues of the Hessian at a critical point determine its nature: all positive → minimum, all negative → maximum, mixed signs → saddle point.

💡Intuition

Saddle points dominate in high dimensions. For a random function of nn variables, a critical point has probability roughly 2n2^{-n} of being a local minimum (all nn Hessian eigenvalues must be positive). In high-dimensional loss landscapes (neural networks with millions of parameters), virtually all critical points are saddle points. This is one reason why gradient descent in deep learning doesn't get stuck — it escapes saddles efficiently, often faster than theory predicts.

💡Intuition

PSD matrices as valid metrics. A PSD matrix AA defines a (possibly degenerate) inner product x,yA=xTAy\langle \mathbf{x}, \mathbf{y} \rangle_A = \mathbf{x}^TA\mathbf{y}. When A0A \succ 0, this is a genuine inner product (positive definite). Mahalanobis distance uses the precision matrix as the metric: dΣ(x,y)2=(xy)TΣ1(xy)d_\Sigma(\mathbf{x}, \mathbf{y})^2 = (\mathbf{x}-\mathbf{y})^T\Sigma^{-1}(\mathbf{x}-\mathbf{y}). The condition Σ0\Sigma \succ 0 is what guarantees dΣd_\Sigma is a true metric.

⚠️Warning

Near-singular covariance matrices. In practice, sample covariance matrices from high-dimensional data (d>nd > n) are always singular (rank at most n1n-1) — not even PSD with full rank. Regularization via Σ^α=Σ^+αI\hat{\Sigma}_\alpha = \hat{\Sigma} + \alpha I (diagonal loading, a.k.a. Ledoit-Wolf shrinkage) restores positive definiteness and enables stable Cholesky factorization. The choice of α\alpha trades off between the sample covariance and the identity — a bias-variance tradeoff in covariance estimation.

Semidefinite Programming (SDP)

A semidefinite program optimizes a linear objective over the PSD cone:

minX0tr(CX)subject totr(AiX)=bi,i=1,,m.\min_{X \succeq 0} \operatorname{tr}(CX) \quad \text{subject to} \quad \operatorname{tr}(A_i X) = b_i, \quad i=1,\ldots,m.

SDPs generalize both linear programs (LP: variables are non-negative scalars, a 1D PSD cone) and second-order cone programs (SOCP). They are solvable in polynomial time via interior-point methods and arise in:

  • Relaxations of combinatorial problems (MAX-CUT via Goemans-Williamson SDP relaxation)
  • Sum-of-squares (SOS) proofs of polynomial nonnegativity
  • Controller design in control theory (linear matrix inequalities)
  • Metric learning (learning a PSD Mahalanobis metric from pairwise constraints)
  • Low-rank matrix completion (Netflix prize problem)

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.