Requires:The Spectral Theorem & Symmetric Matrices SVD, QR & LU Decompositions

Positive Semidefinite Matrices & Quadratic Forms

Positive semidefinite (PSD) matrices are the matrix analogue of non-negative numbers: they define valid inner products, encode covariances, kernels, and curvatures, and form a convex cone under matrix addition and positive scaling. The PSD condition is the central constraint in semidefinite programming, Gaussian process models, and the analysis of loss surfaces in optimization. Understanding the multiple equivalent characterizations of PSD-ness — via eigenvalues, Cholesky, principal minors, and Gram matrices — is essential for both theory and numerical practice.

Concepts

PSD Cone — trace-2 symmetric matrices (drag the point)

Positive Definite

b² + Δ² = 0.000 < 1 ✓

A =

[1.000, 0]

[0, 1.000]

λ₁ = 1.000

λ₂ = 1.000

det = 1.000

tr = 2.000 (fixed = 2)

Cholesky A = LLᵀ

L = [1.000, 0]

[0, 1.000]

The green disk is the PD cone cross-section (trace fixed = 2). Drag inside for PD, on the boundary for PSD, outside for indefinite.

A covariance matrix must not assign negative variance to any direction — that would be geometrically impossible. Positive semidefinite (PSD) matrices are the formal expression of this constraint: they are the matrix analogue of non-negative numbers, forming a convex cone. Every covariance matrix, kernel matrix, Gram matrix, and positive-curvature Hessian is PSD, and the theory of these objects hinges on this single constraint.

Definitions and Notation

A symmetric matrix $A \in \mathbb{R}^{n \times n}$ is:

Name	Symbol	Condition
Positive definite	$A \succ 0$	$\mathbf{x}^TA\mathbf{x} > 0$ for all $\mathbf{x} \neq \mathbf{0}$
Positive semidefinite	$A \succeq 0$	$\mathbf{x}^TA\mathbf{x} \geq 0$ for all $\mathbf{x}$
Negative definite	$A \prec 0$	$\mathbf{x}^TA\mathbf{x} < 0$ for all $\mathbf{x} \neq \mathbf{0}$
Negative semidefinite	$A \preceq 0$	$\mathbf{x}^TA\mathbf{x} \leq 0$ for all $\mathbf{x}$
Indefinite	—	Takes both positive and negative values

The condition $\mathbf{x}^T A \mathbf{x} \geq 0$ for all $\mathbf{x}$ is exactly what ensures $A$ defines a valid squared length. A matrix violating this would assign negative "squared distance" to some direction — geometrically impossible, and numerically catastrophic (Cholesky factorization would fail, log-likelihoods would be undefined).

The ordering $A \preceq B$ (the Loewner order) means $B - A \succeq 0$ . This is a partial order on symmetric matrices that is compatible with scalar ordering: $A \preceq B$ implies $\alpha A \preceq \alpha B$ for $\alpha \geq 0$ , and $A \preceq B, B \preceq C \implies A \preceq C$ .

Equivalent Characterizations

Theorem. For a symmetric $A \in \mathbb{R}^{n \times n}$ , the following are equivalent:

$A \succeq 0$ (definition)
All eigenvalues $\lambda_i \geq 0$
$A = B^TB$ for some $B \in \mathbb{R}^{k \times n}$ (Gram matrix form)
All leading principal minors $\det(A_{1:k,1:k}) \geq 0$ for $k=1,\ldots,n$
The Cholesky factorization $A = LL^T$ exists (with possible zero diagonal entries)

For positive definite $A \succ 0$ , all the inequalities become strict:

All $\lambda_i > 0$
All leading principal minors $\det(A_{1:k,1:k}) > 0$ (Sylvester's criterion)
Cholesky $A = LL^T$ with $L$ lower triangular and positive diagonal exists uniquely

The PSD Cone

The set of $n \times n$ PSD matrices, denoted $\mathbb{S}^n_+$ , is a convex cone:

$A, B \succeq 0 \implies \alpha A + \beta B \succeq 0$ for $\alpha, \beta \geq 0$
The set is closed and convex

For $n=2$ , a trace-2 symmetric matrix has the form $A = \begin{pmatrix}1+\Delta & b \\ b & 1-\Delta\end{pmatrix}$ . It is PD iff $\det(A) = 1 - \Delta^2 - b^2 > 0$ , i.e., $b^2 + \Delta^2 < 1$ — the PD cone cross-section (at fixed trace) is a disk (illustrated in the diagram at the top of this lesson).

Gram Matrices and Kernel Matrices

Gram matrix. Given vectors $\mathbf{x}_1, \ldots, \mathbf{x}_n \in \mathbb{R}^d$ , the Gram matrix $G \in \mathbb{R}^{n \times n}$ with $G_{ij} = \langle \mathbf{x}_i, \mathbf{x}_j \rangle$ is PSD. Proof: for any $\mathbf{c} \in \mathbb{R}^n$ :

$\mathbf{c}^T G \mathbf{c} = \sum_{i,j} c_i c_j \langle \mathbf{x}_i, \mathbf{x}_j \rangle = \left\langle \sum_i c_i \mathbf{x}_i, \sum_j c_j \mathbf{x}_j \right\rangle = \left\| \sum_i c_i \mathbf{x}_i \right\|^2 \geq 0.$

Kernel matrix. A function $k : \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is a positive definite kernel if for all $\{x_1, \ldots, x_n\} \subset \mathcal{X}$ and all $n$ , the kernel (Gram) matrix $K_{ij} = k(x_i, x_j)$ is PSD. This is Mercer's condition. Examples:

Kernel	Formula $k(x,y)$	Notes
Linear	$x^Ty$	Gram matrix of data
Polynomial	$(x^Ty + c)^d$	$c \geq 0$ , $d \in \mathbb{Z}^+$
Gaussian RBF	$\exp(-\\|x-y\\|^2 / 2\ell^2)$	$\ell > 0$ length scale
Laplace	$\exp(-\\|x-y\\| / \ell)$	Rougher than RBF
Matérn	(various)	Parameterized smoothness

The Schur Complement

For a block symmetric matrix $M = \begin{pmatrix}A & B \\ B^T & C\end{pmatrix}$ with $A \succ 0$ :

Theorem. $M \succeq 0 \iff C - B^TA^{-1}B \succeq 0$ .

The matrix $S = C - B^TA^{-1}B$ is the Schur complement of $A$ in $M$ .

Why it matters: The Schur complement arises in Gaussian conditioning. If $\begin{pmatrix}\mathbf{x}_1 \\ \mathbf{x}_2\end{pmatrix} \sim \mathcal{N}\left(\boldsymbol{0}, \begin{pmatrix}\Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22}\end{pmatrix}\right)$ , then the conditional covariance of $\mathbf{x}_2 | \mathbf{x}_1$ is exactly the Schur complement $\Sigma_{22} - \Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12}$ — the PSD Schur complement theorem guarantees this is a valid (non-negative) covariance.

Completing the square. The block factorization:

$\begin{pmatrix}A & B \\ B^T & C\end{pmatrix} = \begin{pmatrix}I & 0 \\ B^TA^{-1} & I\end{pmatrix}\begin{pmatrix}A & 0 \\ 0 & S\end{pmatrix}\begin{pmatrix}I & A^{-1}B \\ 0 & I\end{pmatrix}$

shows that $\det(M) = \det(A)\det(S)$ — Schur complement determinant formula.

Matrix Square Root and Powers

For $A \succeq 0$ with spectral decomposition $A = Q\Lambda Q^T$ , define:

$A^{1/2} = Q\Lambda^{1/2}Q^T, \qquad \Lambda^{1/2} = \operatorname{diag}(\sqrt{\lambda_1}, \ldots, \sqrt{\lambda_n}).$

Properties: $(A^{1/2})^2 = A$ , $A^{1/2} \succeq 0$ , unique among PSD matrices.

More generally, $f(A) = Qf(\Lambda)Q^T$ for any function $f$ applied entrywise to the eigenvalues. For a PD matrix: $A^{-1/2} = Q\Lambda^{-1/2}Q^T$ is the inverse square root, used in whitening transformations.

Whitening. Given covariance $\Sigma \succ 0$ and data vector $\mathbf{x}$ with $\operatorname{Cov}(\mathbf{x}) = \Sigma$ , the whitened vector $\mathbf{z} = \Sigma^{-1/2}\mathbf{x}$ has $\operatorname{Cov}(\mathbf{z}) = I$ . Whitening is the preprocessing step that removes all linear correlations, mapping arbitrary Gaussians to standard Gaussians.

Operations Preserving PSD-ness

The following operations map PSD matrices to PSD matrices:

Operation	Result	Proof
$A, B \succeq 0 \implies \alpha A + \beta B$	$\succeq 0$ for $\alpha, \beta \geq 0$	Convex cone
$A \succeq 0 \implies B^TAB$	$\succeq 0$ for any $B$	Gram form
$A, B \succeq 0 \implies A \circ B$	$\succeq 0$ (Hadamard product)	Schur product theorem
$A \succeq 0 \implies e^A$	$\succ 0$	$e^A = Q e^\Lambda Q^T$ , all $e^{\lambda_i} > 0$
$A \succ 0 \implies A^{-1}$	$\succ 0$	Eigenvalues $1/\lambda_i > 0$

Schur product theorem. If $A, B \succeq 0$ , then the elementwise (Hadamard) product $A \circ B$ (with $(A \circ B)_{ij} = A_{ij}B_{ij}$ ) is also PSD. Corollary: the pointwise product of two positive definite kernels is a positive definite kernel.

Worked Example

Example 1: Sylvester's Criterion

Test $A = \begin{pmatrix}2 & 1 & 0 \\ 1 & 3 & 1 \\ 0 & 1 & 2\end{pmatrix}$ for positive definiteness.

Leading principal minors:

$k=1$ : $\det(2) = 2 > 0$ ✓
$k=2$ : $\det\begin{pmatrix}2&1\\1&3\end{pmatrix} = 6-1 = 5 > 0$ ✓
$k=3$ : $\det(A) = 2(6-1) - 1(2-0) + 0 = 10 - 2 = 8 > 0$ ✓

All minors positive $\implies$ $A \succ 0$ by Sylvester's criterion. The Cholesky factor is:

$L = \begin{pmatrix}\sqrt{2} & 0 & 0 \\ 1/\sqrt{2} & \sqrt{5/2} & 0 \\ 0 & \sqrt{2/5} & \sqrt{8/5}\end{pmatrix}.$

Example 2: Covariance Matrix and Its Inverse

If the $2 \times 2$ covariance matrix is $\Sigma = \begin{pmatrix}4 & 2 \\ 2 & 3\end{pmatrix}$ , then $\Sigma \succ 0$ (eigenvalues $\approx 5.24, 1.76 > 0$ ).

The precision matrix $\Sigma^{-1} = \frac{1}{8}\begin{pmatrix}3 & -2 \\ -2 & 4\end{pmatrix}$ is also PD (eigenvalues = $1/5.24, 1/1.76$ ). The precision matrix encodes partial correlations: $(\Sigma^{-1})_{12} = -2/8 \neq 0$ means $x_1$ and $x_2$ are conditionally dependent (their partial correlation is nonzero).

The Cholesky of the precision $\Omega = \Sigma^{-1}$ is used in sparse Gaussian graphical models (glasso), where sparsity of $\Omega$ encodes conditional independence structure.

Example 3: Gaussian Conditioning via Schur Complement

Let $\begin{pmatrix}x \\ y\end{pmatrix} \sim \mathcal{N}\left(\begin{pmatrix}\mu_x \\ \mu_y\end{pmatrix}, \begin{pmatrix}\sigma_x^2 & \rho\sigma_x\sigma_y \\ \rho\sigma_x\sigma_y & \sigma_y^2\end{pmatrix}\right)$ .

Given $x = x_0$ , the conditional distribution $y | x = x_0$ is:

$y | x = x_0 \sim \mathcal{N}\!\left(\mu_y + \rho\frac{\sigma_y}{\sigma_x}(x_0 - \mu_x),\; \sigma_y^2(1-\rho^2)\right).$

The conditional variance $\sigma_y^2(1-\rho^2)$ is the Schur complement of $\sigma_x^2$ in $\Sigma$ :

$\sigma_y^2 - \rho\sigma_x\sigma_y \cdot \frac{1}{\sigma_x^2} \cdot \rho\sigma_x\sigma_y = \sigma_y^2(1 - \rho^2) \geq 0.$

The PSD-ness of the Schur complement is what guarantees the conditional variance is non-negative (and equals zero only when $|\rho|=1$ , i.e., perfect linear relationship).

Connections

Where Your Intuition Breaks

A matrix with all positive entries is positive definite. These are entirely different properties. The matrix $A = \begin{pmatrix}1 & 2 \\ 2 & 1\end{pmatrix}$ has all positive entries, but $\det(A) = 1 - 4 = -3 < 0$ , so it is indefinite — there exist directions where $\mathbf{x}^T A \mathbf{x} < 0$ . Conversely, the matrix $\begin{pmatrix}1 & -0.9 \\ -0.9 & 1\end{pmatrix}$ has negative off-diagonal entries but is positive definite (eigenvalues 1.9 and 0.1). Positive definiteness is a property of the quadratic form the matrix defines, not a property of the individual entries. Always check eigenvalues or Sylvester's criterion, not entry signs.

PSD in Optimization: Second-Order Conditions

Theorem (Second-Order Optimality). For a twice-differentiable $f : \mathbb{R}^n \to \mathbb{R}$ :

Necessary condition for local min: $\nabla f(\mathbf{x}^*) = \mathbf{0}$ and $\nabla^2 f(\mathbf{x}^*) \succeq 0$
Sufficient condition for strict local min: $\nabla f(\mathbf{x}^*) = \mathbf{0}$ and $\nabla^2 f(\mathbf{x}^*) \succ 0$
Convex function: $f$ is convex $\iff$ $\nabla^2 f(\mathbf{x}) \succeq 0$ for all $\mathbf{x}$

The eigenvalues of the Hessian at a critical point determine its nature: all positive → minimum, all negative → maximum, mixed signs → saddle point.

💡Intuition

Saddle points dominate in high dimensions. For a random function of $n$ variables, a critical point has probability roughly $2^{-n}$ of being a local minimum (all $n$ Hessian eigenvalues must be positive). In high-dimensional loss landscapes (neural networks with millions of parameters), virtually all critical points are saddle points. This is one reason why gradient descent in deep learning doesn't get stuck — it escapes saddles efficiently, often faster than theory predicts.

💡Intuition

PSD matrices as valid metrics. A PSD matrix $A$ defines a (possibly degenerate) inner product $\langle \mathbf{x}, \mathbf{y} \rangle_A = \mathbf{x}^TA\mathbf{y}$ . When $A \succ 0$ , this is a genuine inner product (positive definite). Mahalanobis distance uses the precision matrix as the metric: $d_\Sigma(\mathbf{x}, \mathbf{y})^2 = (\mathbf{x}-\mathbf{y})^T\Sigma^{-1}(\mathbf{x}-\mathbf{y})$ . The condition $\Sigma \succ 0$ is what guarantees $d_\Sigma$ is a true metric.

⚠️Warning

Near-singular covariance matrices. In practice, sample covariance matrices from high-dimensional data ( $d > n$ ) are always singular (rank at most $n-1$ ) — not even PSD with full rank. Regularization via $\hat{\Sigma}_\alpha = \hat{\Sigma} + \alpha I$ (diagonal loading, a.k.a. Ledoit-Wolf shrinkage) restores positive definiteness and enables stable Cholesky factorization. The choice of $\alpha$ trades off between the sample covariance and the identity — a bias-variance tradeoff in covariance estimation.

Semidefinite Programming (SDP)

A semidefinite program optimizes a linear objective over the PSD cone:

$\min_{X \succeq 0} \operatorname{tr}(CX) \quad \text{subject to} \quad \operatorname{tr}(A_i X) = b_i, \quad i=1,\ldots,m.$

SDPs generalize both linear programs (LP: variables are non-negative scalars, a 1D PSD cone) and second-order cone programs (SOCP). They are solvable in polynomial time via interior-point methods and arise in:

Relaxations of combinatorial problems (MAX-CUT via Goemans-Williamson SDP relaxation)
Sum-of-squares (SOS) proofs of polynomial nonnegativity
Controller design in control theory (linear matrix inequalities)
Metric learning (learning a PSD Mahalanobis metric from pairwise constraints)
Low-rank matrix completion (Netflix prize problem)

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

SVD, QR & LU Decompositions

Linear Systems & Least Squares