Requires:Vector Spaces & Linear Maps Inner Products, Norms & Orthogonality

Eigenvalues, Eigenvectors & Diagonalization

Eigenvalues and eigenvectors reveal the special directions along which a linear map acts by pure scaling — no rotation, no shear, just stretch or compression. Diagonalization rewrites a matrix in terms of these invariant directions, transforming matrix powers into scalar powers and turning coupled differential equations into independent ones. In machine learning, PCA, spectral clustering, PageRank, and the analysis of gradient flow in neural networks all reduce, at their core, to an eigenvalue problem.

Concepts

Eigenvalue Geometry — Unit Circle → Transformed Ellipse

Unit circleTransformedEigenvectors

λ₁ = 3.000

λ₂ = 1.000

v₁ = (-0.707, -0.707)

v₂ = (-0.707, 0.707)

tr(A) = 4.000 = λ₁+λ₂ = 4.000

det(A) = 3.000 = λ₁·λ₂ = 3.000

Real distinct eigenvalues — matrix is diagonalizable over ℝ

The dashed circle is the input. The solid shape is the output after applying A. Eigenvectors (colored arrows) are the only directions that stay aligned — they scale but do not rotate.

When a guitar string resonates, it vibrates at specific natural frequencies — special modes where every point moves in sync, stretched and compressed without being twisted. Eigenvectors are the matrix analogue: the special directions that a linear map stretches or flips without rotating. Every covariance matrix, graph Laplacian, and Hessian in machine learning has these invariant directions, and understanding them is what makes PCA, spectral clustering, and optimization analysis tractable.

The Eigenvalue Equation

Let $A \in \mathbb{R}^{n \times n}$ . A nonzero vector $\mathbf{v} \in \mathbb{R}^n$ is an eigenvector of $A$ with eigenvalue $\lambda \in \mathbb{C}$ if

$A\mathbf{v} = \lambda \mathbf{v}.$

Requiring $\mathbf{v} \neq \mathbf{0}$ and $A\mathbf{v} = \lambda\mathbf{v}$ is the only algebraic way to say "a direction that is not rotated." Any weaker condition — allowing $\mathbf{v} = \mathbf{0}$ , or permitting a component orthogonal to $\mathbf{v}$ in the output — would fail to capture invariance under the map.

Geometrically: $\mathbf{v}$ is a fixed direction of $A$ — applying $A$ scales $\mathbf{v}$ by $\lambda$ but does not rotate it. Eigenvalues can be zero (the eigenvector is in the null space), negative (the direction flips), or complex (indicating rotation when $A$ is real but non-symmetric).

Non-example. For $A = \begin{pmatrix}0 & -1 \\ 1 & 0\end{pmatrix}$ (90° rotation), no real vector satisfies $A\mathbf{v} = \lambda \mathbf{v}$ — every vector is rotated, so there are no real invariant directions.

The Characteristic Polynomial

Rearranging the eigenvalue equation:

$A\mathbf{v} = \lambda \mathbf{v} \iff (A - \lambda I)\mathbf{v} = \mathbf{0}.$

For a nonzero solution to exist, $A - \lambda I$ must be singular:

$\det(A - \lambda I) = 0.$

The function $p(\lambda) = \det(A - \lambda I)$ is the characteristic polynomial of $A$ , a degree- $n$ polynomial in $\lambda$ . Its roots are the eigenvalues of $A$ .

For $2 \times 2$ matrices, if $A = \begin{pmatrix}a & b \\ c & d\end{pmatrix}$ :

$p(\lambda) = \lambda^2 - \underbrace{(a+d)}_{\operatorname{tr}(A)}\lambda + \underbrace{(ad - bc)}_{\det(A)} = 0.$

This gives the quadratic formula:

$\lambda = \frac{\operatorname{tr}(A) \pm \sqrt{\operatorname{tr}(A)^2 - 4\det(A)}}{2}.$

Key identities (true for any $n \times n$ matrix):

$\operatorname{tr}(A) = \sum_{i=1}^n \lambda_i, \qquad \det(A) = \prod_{i=1}^n \lambda_i.$

These hold even when eigenvalues are complex. They give cheap sanity checks after computing eigenvalues.

Algebraic and Geometric Multiplicity

Once an eigenvalue $\lambda_0$ is found, its eigenvectors span the eigenspace:

$E_{\lambda_0} = \ker(A - \lambda_0 I) = \{\mathbf{v} : A\mathbf{v} = \lambda_0 \mathbf{v}\}.$

Two notions of "how many times" an eigenvalue appears:

Concept	Definition	Notation
Algebraic multiplicity	Power of $(\lambda - \lambda_0)$ as a factor of $p(\lambda)$	$a(\lambda_0)$
Geometric multiplicity	$\dim \ker(A - \lambda_0 I)$ — dimension of the eigenspace	$g(\lambda_0)$

Always $1 \leq g(\lambda_0) \leq a(\lambda_0)$ . When $g(\lambda_0) < a(\lambda_0)$ , the eigenvalue is defective and the matrix is not diagonalizable over that eigenvalue.

Example. The matrix $\begin{pmatrix}2 & 1 \\ 0 & 2\end{pmatrix}$ has characteristic polynomial $(\lambda-2)^2$ , so $a(2) = 2$ . But $\ker(A - 2I) = \ker\begin{pmatrix}0 & 1 \\ 0 & 0\end{pmatrix} = \operatorname{span}\{(1, 0)^T\}$ , so $g(2) = 1 < 2$ . Defective — not diagonalizable.

Diagonalization

A matrix $A \in \mathbb{R}^{n \times n}$ is diagonalizable if there exists an invertible $P$ and diagonal $D$ such that

$A = PDP^{-1}, \qquad D = \operatorname{diag}(\lambda_1, \ldots, \lambda_n).$

The columns of $P$ are linearly independent eigenvectors of $A$ ; the diagonal entries of $D$ are the corresponding eigenvalues.

Theorem (Diagonalizability Criterion). $A$ is diagonalizable over $\mathbb{C}$ if and only if $g(\lambda_i) = a(\lambda_i)$ for every eigenvalue $\lambda_i$ .

Sufficient condition. If $A$ has $n$ distinct eigenvalues, it is diagonalizable (eigenvectors corresponding to distinct eigenvalues are automatically linearly independent).

Change of basis interpretation. In the eigenbasis, $A$ acts simply as scaling:

$D = P^{-1}AP \implies D\mathbf{e}_i = \lambda_i \mathbf{e}_i.$

Powers and Functions of Matrices

Diagonalization makes matrix powers trivial:

$A^k = PD^kP^{-1}, \qquad D^k = \operatorname{diag}(\lambda_1^k, \ldots, \lambda_n^k).$

This is the key that unlocks recurrences. The Fibonacci sequence $F_{n+1} = F_n + F_{n-1}$ encodes as a matrix power, and diagonalization yields the closed-form Binet formula.

Matrix exponential. For the differential equation $\dot{\mathbf{x}} = A\mathbf{x}$ , the solution is $\mathbf{x}(t) = e^{At}\mathbf{x}_0$ where

$e^{At} = Pe^{Dt}P^{-1}, \qquad e^{Dt} = \operatorname{diag}(e^{\lambda_1 t}, \ldots, e^{\lambda_n t}).$

Stability of the system is determined entirely by the signs of $\operatorname{Re}(\lambda_i)$ : if all real parts are negative, the system decays to zero.

Eigenvalues of Symmetric Matrices

Real symmetric matrices ( $A = A^T$ ) enjoy exceptional properties:

Theorem (Spectral Theorem for symmetric matrices). If $A \in \mathbb{R}^{n \times n}$ is symmetric, then:

All eigenvalues of $A$ are real.
Eigenvectors corresponding to distinct eigenvalues are orthogonal.
$A$ is orthogonally diagonalizable: $A = Q\Lambda Q^T$ where $Q$ is orthogonal ( $Q^TQ = I$ ) and $\Lambda$ is diagonal with real entries.

Proof sketch of (1). Suppose $A\mathbf{v} = \lambda\mathbf{v}$ with $\mathbf{v} \neq \mathbf{0}$ , working over $\mathbb{C}$ . Then: $\bar{\lambda}\|\mathbf{v}\|^2 = \bar{\mathbf{v}}^T A^T \mathbf{v} = \bar{\mathbf{v}}^T A \mathbf{v} = \lambda\|\mathbf{v}\|^2,$ so $\bar{\lambda} = \lambda$ , meaning $\lambda \in \mathbb{R}$ .

Proof sketch of (2). Let $A\mathbf{u} = \lambda\mathbf{u}$ , $A\mathbf{v} = \mu\mathbf{v}$ , $\lambda \neq \mu$ : $\lambda \langle \mathbf{u}, \mathbf{v} \rangle = \langle A\mathbf{u}, \mathbf{v} \rangle = \langle \mathbf{u}, A^T\mathbf{v} \rangle = \langle \mathbf{u}, A\mathbf{v} \rangle = \mu \langle \mathbf{u}, \mathbf{v} \rangle.$ Since $\lambda \neq \mu$ , we get $\langle \mathbf{u}, \mathbf{v} \rangle = 0$ .

The Spectral Theorem is foundational for PCA, kernel methods, and the analysis of graph Laplacians.

Rayleigh Quotient and Variational Characterization

For a symmetric matrix $A \in \mathbb{R}^{n \times n}$ with eigenvalues $\lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_n$ , the Rayleigh quotient is

$R(\mathbf{x}) = \frac{\mathbf{x}^T A \mathbf{x}}{\mathbf{x}^T \mathbf{x}}, \qquad \mathbf{x} \neq \mathbf{0}.$

Theorem (Min-Max). The extremal eigenvalues satisfy:

$\lambda_{\min} = \min_{\mathbf{x} \neq 0} R(\mathbf{x}), \qquad \lambda_{\max} = \max_{\mathbf{x} \neq 0} R(\mathbf{x}).$

More generally (Courant-Fischer):

$\lambda_k = \min_{\dim V = k} \max_{\mathbf{x} \in V, \mathbf{x} \neq 0} R(\mathbf{x}).$

This variational characterization is the basis for PCA (maximize variance = maximize Rayleigh quotient of the covariance matrix) and spectral graph theory (the Fiedler vector minimizes the graph cut Rayleigh quotient).

Spectral Radius and Convergence

The spectral radius of $A$ is $\rho(A) = \max_i |\lambda_i|$ . For any induced matrix norm:

$\rho(A) \leq \|A\|.$

Power iteration converges to the dominant eigenvector (eigenvector for $\lambda_{\max}$ ) at rate $|\lambda_2/\lambda_1|$ . Slow convergence when the top two eigenvalues are close — this drives the need for deflation or Krylov subspace methods.

For gradient descent on a quadratic $f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T A \mathbf{x} - \mathbf{b}^T \mathbf{x}$ with $A \succ 0$ :

$\text{convergence rate} = \left(\frac{\kappa - 1}{\kappa + 1}\right)^2, \qquad \kappa = \frac{\lambda_{\max}}{\lambda_{\min}} = \kappa(A).$

The condition number $\kappa(A)$ directly controls optimization speed — the eigenvalue spread of the Hessian determines whether gradient descent converges in a few steps or thousands.

Summary Table

Property	Condition	Consequence
Diagonalizable	$g(\lambda_i) = a(\lambda_i)$ for all $i$	$A = PDP^{-1}$ , $A^k = PD^kP^{-1}$
Symmetric	$A = A^T$	All $\lambda_i \in \mathbb{R}$ , orthogonal eigenvectors
Positive definite	$A = A^T$ , all $\lambda_i > 0$	Unique minimum, stable gradient descent
Defective	$\exists i: g(\lambda_i) < a(\lambda_i)$	No full eigenbasis; use Jordan form
Normal	$AA^T = A^TA$	Orthogonally diagonalizable over $\mathbb{C}$

Worked Example

Example 1: Full Diagonalization of a $2 \times 2$ Symmetric Matrix

Let $A = \begin{pmatrix}3 & 1 \\ 1 & 3\end{pmatrix}$ .

Step 1: Characteristic polynomial.

$\det(A - \lambda I) = (3-\lambda)^2 - 1 = \lambda^2 - 6\lambda + 8 = (\lambda - 4)(\lambda - 2) = 0.$

Eigenvalues: $\lambda_1 = 4$ , $\lambda_2 = 2$ . Verify: $\operatorname{tr}(A) = 6 = 4 + 2$ ✓, $\det(A) = 8 = 4 \cdot 2$ ✓.

Step 2: Eigenvectors.

For $\lambda_1 = 4$ : $(A - 4I)\mathbf{v} = \begin{pmatrix}-1 & 1 \\ 1 & -1\end{pmatrix}\mathbf{v} = \mathbf{0}$ . Row reduce → $v_1 = v_2$ . Eigenvector: $\mathbf{p}_1 = \frac{1}{\sqrt{2}}\begin{pmatrix}1 \\ 1\end{pmatrix}$ .

For $\lambda_2 = 2$ : $(A - 2I)\mathbf{v} = \begin{pmatrix}1 & 1 \\ 1 & 1\end{pmatrix}\mathbf{v} = \mathbf{0}$ . Row reduce → $v_1 = -v_2$ . Eigenvector: $\mathbf{p}_2 = \frac{1}{\sqrt{2}}\begin{pmatrix}1 \\ -1\end{pmatrix}$ .

Verify orthogonality: $\mathbf{p}_1 \cdot \mathbf{p}_2 = \frac{1}{2}(1 \cdot 1 + 1 \cdot (-1)) = 0$ ✓.

Step 3: Write the decomposition.

$A = Q\Lambda Q^T = \frac{1}{\sqrt{2}}\begin{pmatrix}1 & 1 \\ 1 & -1\end{pmatrix} \begin{pmatrix}4 & 0 \\ 0 & 2\end{pmatrix} \frac{1}{\sqrt{2}}\begin{pmatrix}1 & 1 \\ 1 & -1\end{pmatrix}.$

Step 4: Compute $A^{10}$ .

$A^{10} = Q\Lambda^{10}Q^T = \frac{1}{2}\begin{pmatrix}1 & 1 \\ 1 & -1\end{pmatrix}\begin{pmatrix}4^{10} & 0 \\ 0 & 2^{10}\end{pmatrix}\begin{pmatrix}1 & 1 \\ 1 & -1\end{pmatrix} = \frac{1}{2}\begin{pmatrix}4^{10}+2^{10} & 4^{10}-2^{10} \\ 4^{10}-2^{10} & 4^{10}+2^{10}\end{pmatrix}.$

Compare this to naively multiplying $A$ ten times — the eigenbasis makes it a one-liner.

Example 2: Complex Eigenvalues — Rotation Matrix

Let $A = \begin{pmatrix}0 & -1 \\ 1 & 0\end{pmatrix}$ (rotation by $90°$ ).

Characteristic polynomial: $\lambda^2 + 1 = 0 \implies \lambda = \pm i$ .

Since both eigenvalues are purely imaginary, $A$ has no real eigenvectors. The rotation has no invariant direction over $\mathbb{R}$ .

General rotation matrix. $R_\theta = \begin{pmatrix}\cos\theta & -\sin\theta \\ \sin\theta & \cos\theta\end{pmatrix}$ has eigenvalues $e^{\pm i\theta} = \cos\theta \pm i\sin\theta$ . For $\theta \neq 0, \pi$ , all eigenvalues are complex — consistent with the geometric fact that rotations (other than $0°$ and $180°$ ) leave no real direction fixed.

Example 3: PCA as an Eigenvalue Problem

Given a data matrix $X \in \mathbb{R}^{n \times d}$ (centered, $n$ observations, $d$ features), the empirical covariance matrix is

$C = \frac{1}{n-1}X^TX \in \mathbb{R}^{d \times d}.$

$C$ is symmetric positive semidefinite, so its eigendecomposition is

$C = Q\Lambda Q^T, \qquad \Lambda = \operatorname{diag}(\lambda_1, \ldots, \lambda_d), \quad \lambda_1 \geq \cdots \geq \lambda_d \geq 0.$

The $k$ -th column of $Q$ (i.e., the $k$ -th eigenvector) is the $k$ -th principal component — the direction of the $k$ -th largest variance. The corresponding eigenvalue $\lambda_k$ is exactly that variance.

Why? For a unit vector $\mathbf{u}$ , the variance of the projections $X\mathbf{u}$ is $\mathbf{u}^T C \mathbf{u}$ — the Rayleigh quotient of $C$ . PCA maximizes this over $\mathbf{u}$ , and by the Spectral Theorem's variational characterization, the maximizer is $\mathbf{q}_1$ (the top eigenvector).

Explained variance ratio. The fraction of total variance captured by the top $k$ components:

$\text{EVR}(k) = \frac{\sum_{j=1}^k \lambda_j}{\sum_{j=1}^d \lambda_j} = \frac{\sum_{j=1}^k \lambda_j}{\operatorname{tr}(C)}.$

A standard rule of thumb: keep components until EVR $\geq 0.90$ .

Connections

Where Your Intuition Breaks

The eigenvector with the largest eigenvalue is always the most important. This is true for variance maximization (PCA) but wrong for almost everything else. For stability analysis, large eigenvalues of the recurrent weight matrix cause exploding gradients — the dominant direction is the dangerous one, not the useful one. For optimization, a large eigenvalue of the Hessian means sharp curvature, which slows down gradient descent rather than speeding it up. And eigenvalues can be negative: a negative eigenvalue of the Hessian at a critical point means that direction is a descent direction — a saddle, not a minimum. The eigenvalue's sign and context determine its meaning, not its magnitude alone.

Eigenvalue Problem — Method Comparison

Method	Time complexity	When to use
Characteristic polynomial (exact)	$O(n^3)$ for $n \times n$ in general	$2 \times 2$ , $3 \times 3$ by hand only
Power iteration	$O(n^2)$ per step, $k$ steps	Only largest eigenvalue; sparse $A$
QR algorithm	$O(n^3)$ (dense); standard method	All eigenvalues of dense matrices
Lanczos / Arnoldi	$O(kn^2)$ for $k$ eigenvalues	Large sparse matrices (graphs, PCA on high- $d$ data)
Randomized SVD	$O(ndk)$ for rank- $k$ approximation	Very large matrices; PCA with $d, n \gg k$

The QR algorithm (Golub-Kahan, Francis) is the workhorse behind numpy.linalg.eig. For deep learning, the dominant eigenvalue of the Hessian (relevant for learning-rate selection and loss landscape analysis) is usually estimated via power iteration or Lanczos, never via full eigendecomposition.

Common Pitfalls

⚠️Warning

Numerical near-degeneracy. When two eigenvalues are very close ( $|\lambda_i - \lambda_j| \ll \epsilon_{\text{machine}} \|A\|$ ), their eigenvectors are numerically ill-conditioned. Any small perturbation can rotate the eigenspace dramatically. Algorithms that return individual eigenvectors in this regime may be meaningless — report eigenspaces (i.e., spans) instead.

⚠️Warning

Defective matrices are measure-zero but show up. A random matrix is almost surely diagonalizable, but engineered matrices in ML (e.g., certain recurrent weight matrices trained to specific dynamics) can be defective or nearly defective. Do not assume diagonalizability without checking the multiplicity condition.

💡Intuition

Eigenvalues encode stability. For discrete-time linear systems $\mathbf{x}_{t+1} = A\mathbf{x}_t$ : the system is stable (trajectories decay to zero) if and only if $\rho(A) < 1$ . For continuous-time $\dot{\mathbf{x}} = A\mathbf{x}$ : stability requires all eigenvalues to have negative real parts. This is why exploding/vanishing gradients in RNNs are analyzed via the spectral radius of the recurrent weight matrix.

💡Intuition

Condition number and optimization. If the Hessian $H$ of a loss surface has $\kappa(H) = \lambda_{\max}/\lambda_{\min} \gg 1$ , gradient descent behaves like a ball on a narrow valley — fast along one axis, slow along the perpendicular. This is why batch normalization accelerates training: it implicitly regularizes the condition number of weight matrices, making the curvature of the loss surface more isotropic.

Eigenvalues in ML: A Conceptual Map

Application	Matrix	Relevant eigenstructure
PCA	Covariance $C = \frac{1}{n}X^TX$	Top $k$ eigenvalues/vectors = principal components
Spectral clustering	Graph Laplacian $L = D - W$	Fiedler vector ( $\lambda_2$ eigenvector) encodes cluster boundary
PageRank	(Row-stochastic) adjacency $A$	Dominant left eigenvector ( $\lambda = 1$ ) = stationary distribution
RNN stability	Recurrent weight $W_{hh}$	$\rho(W_{hh}) > 1$ → exploding gradients
Ridge regression	$X^TX + \alpha I$	Eigenvalue shrinkage: $\lambda_i \mapsto \lambda_i + \alpha$
Transformer attention	Attention weight matrix	Effective rank via eigenvalue decay = attention collapse diagnostic
Neural collapse	Last-layer features + classifier	Eigenstructure of within-class covariance equals simplex ETF

Algebraic Structures: Why Symmetric Matrices are Special

The Spectral Theorem is not a coincidence — it holds precisely because real symmetric matrices (and more generally, normal matrices satisfying $AA^T = A^TA$ ) belong to a special class where the matrix is unitarily diagonalizable.

For non-symmetric matrices, the correct generalization is the Jordan normal form: every matrix is similar to a block-diagonal matrix of Jordan blocks. Jordan blocks arise from defective eigenvalues and encode polynomial (not exponential) growth in matrix powers, which is why they appear in the analysis of unstable recurrences.

The practical lesson: in ML, covariance matrices, graph Laplacians, kernel matrices, and Gram matrices are always symmetric (or Hermitian), so the full Spectral Theorem applies and we get real eigenvalues, orthogonal eigenvectors, and stable numerical algorithms.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Inner Products, Norms & Orthogonality

The Spectral Theorem & Symmetric Matrices