Eigenvalues, Eigenvectors & Diagonalization
Eigenvalues and eigenvectors reveal the special directions along which a linear map acts by pure scaling — no rotation, no shear, just stretch or compression. Diagonalization rewrites a matrix in terms of these invariant directions, transforming matrix powers into scalar powers and turning coupled differential equations into independent ones. In machine learning, PCA, spectral clustering, PageRank, and the analysis of gradient flow in neural networks all reduce, at their core, to an eigenvalue problem.
Concepts
Eigenvalue Geometry — Unit Circle → Transformed Ellipse
λ₁ = 3.000
λ₂ = 1.000
v₁ = (-0.707, -0.707)
v₂ = (-0.707, 0.707)
tr(A) = 4.000 = λ₁+λ₂ = 4.000
det(A) = 3.000 = λ₁·λ₂ = 3.000
Real distinct eigenvalues — matrix is diagonalizable over ℝ
The dashed circle is the input. The solid shape is the output after applying A. Eigenvectors (colored arrows) are the only directions that stay aligned — they scale but do not rotate.
When a guitar string resonates, it vibrates at specific natural frequencies — special modes where every point moves in sync, stretched and compressed without being twisted. Eigenvectors are the matrix analogue: the special directions that a linear map stretches or flips without rotating. Every covariance matrix, graph Laplacian, and Hessian in machine learning has these invariant directions, and understanding them is what makes PCA, spectral clustering, and optimization analysis tractable.
The Eigenvalue Equation
Let . A nonzero vector is an eigenvector of with eigenvalue if
Requiring and is the only algebraic way to say "a direction that is not rotated." Any weaker condition — allowing , or permitting a component orthogonal to in the output — would fail to capture invariance under the map.
Geometrically: is a fixed direction of — applying scales by but does not rotate it. Eigenvalues can be zero (the eigenvector is in the null space), negative (the direction flips), or complex (indicating rotation when is real but non-symmetric).
Non-example. For (90° rotation), no real vector satisfies — every vector is rotated, so there are no real invariant directions.
The Characteristic Polynomial
Rearranging the eigenvalue equation:
For a nonzero solution to exist, must be singular:
The function is the characteristic polynomial of , a degree- polynomial in . Its roots are the eigenvalues of .
For matrices, if :
This gives the quadratic formula:
Key identities (true for any matrix):
These hold even when eigenvalues are complex. They give cheap sanity checks after computing eigenvalues.
Algebraic and Geometric Multiplicity
Once an eigenvalue is found, its eigenvectors span the eigenspace:
Two notions of "how many times" an eigenvalue appears:
| Concept | Definition | Notation |
|---|---|---|
| Algebraic multiplicity | Power of as a factor of | |
| Geometric multiplicity | — dimension of the eigenspace |
Always . When , the eigenvalue is defective and the matrix is not diagonalizable over that eigenvalue.
Example. The matrix has characteristic polynomial , so . But , so . Defective — not diagonalizable.
Diagonalization
A matrix is diagonalizable if there exists an invertible and diagonal such that
The columns of are linearly independent eigenvectors of ; the diagonal entries of are the corresponding eigenvalues.
Theorem (Diagonalizability Criterion). is diagonalizable over if and only if for every eigenvalue .
Sufficient condition. If has distinct eigenvalues, it is diagonalizable (eigenvectors corresponding to distinct eigenvalues are automatically linearly independent).
Change of basis interpretation. In the eigenbasis, acts simply as scaling:
Powers and Functions of Matrices
Diagonalization makes matrix powers trivial:
This is the key that unlocks recurrences. The Fibonacci sequence encodes as a matrix power, and diagonalization yields the closed-form Binet formula.
Matrix exponential. For the differential equation , the solution is where
Stability of the system is determined entirely by the signs of : if all real parts are negative, the system decays to zero.
Eigenvalues of Symmetric Matrices
Real symmetric matrices () enjoy exceptional properties:
Theorem (Spectral Theorem for symmetric matrices). If is symmetric, then:
- All eigenvalues of are real.
- Eigenvectors corresponding to distinct eigenvalues are orthogonal.
- is orthogonally diagonalizable: where is orthogonal () and is diagonal with real entries.
Proof sketch of (1). Suppose with , working over . Then: so , meaning .
Proof sketch of (2). Let , , : Since , we get .
The Spectral Theorem is foundational for PCA, kernel methods, and the analysis of graph Laplacians.
Rayleigh Quotient and Variational Characterization
For a symmetric matrix with eigenvalues , the Rayleigh quotient is
Theorem (Min-Max). The extremal eigenvalues satisfy:
More generally (Courant-Fischer):
This variational characterization is the basis for PCA (maximize variance = maximize Rayleigh quotient of the covariance matrix) and spectral graph theory (the Fiedler vector minimizes the graph cut Rayleigh quotient).
Spectral Radius and Convergence
The spectral radius of is . For any induced matrix norm:
Power iteration converges to the dominant eigenvector (eigenvector for ) at rate . Slow convergence when the top two eigenvalues are close — this drives the need for deflation or Krylov subspace methods.
For gradient descent on a quadratic with :
The condition number directly controls optimization speed — the eigenvalue spread of the Hessian determines whether gradient descent converges in a few steps or thousands.
Summary Table
| Property | Condition | Consequence |
|---|---|---|
| Diagonalizable | for all | , |
| Symmetric | All , orthogonal eigenvectors | |
| Positive definite | , all | Unique minimum, stable gradient descent |
| Defective | No full eigenbasis; use Jordan form | |
| Normal | Orthogonally diagonalizable over |
Worked Example
Example 1: Full Diagonalization of a Symmetric Matrix
Let .
Step 1: Characteristic polynomial.
Eigenvalues: , . Verify: ✓, ✓.
Step 2: Eigenvectors.
For : . Row reduce → . Eigenvector: .
For : . Row reduce → . Eigenvector: .
Verify orthogonality: ✓.
Step 3: Write the decomposition.
Step 4: Compute .
Compare this to naively multiplying ten times — the eigenbasis makes it a one-liner.
Example 2: Complex Eigenvalues — Rotation Matrix
Let (rotation by ).
Characteristic polynomial: .
Since both eigenvalues are purely imaginary, has no real eigenvectors. The rotation has no invariant direction over .
General rotation matrix. has eigenvalues . For , all eigenvalues are complex — consistent with the geometric fact that rotations (other than and ) leave no real direction fixed.
Example 3: PCA as an Eigenvalue Problem
Given a data matrix (centered, observations, features), the empirical covariance matrix is
is symmetric positive semidefinite, so its eigendecomposition is
The -th column of (i.e., the -th eigenvector) is the -th principal component — the direction of the -th largest variance. The corresponding eigenvalue is exactly that variance.
Why? For a unit vector , the variance of the projections is — the Rayleigh quotient of . PCA maximizes this over , and by the Spectral Theorem's variational characterization, the maximizer is (the top eigenvector).
Explained variance ratio. The fraction of total variance captured by the top components:
A standard rule of thumb: keep components until EVR .
Connections
Where Your Intuition Breaks
The eigenvector with the largest eigenvalue is always the most important. This is true for variance maximization (PCA) but wrong for almost everything else. For stability analysis, large eigenvalues of the recurrent weight matrix cause exploding gradients — the dominant direction is the dangerous one, not the useful one. For optimization, a large eigenvalue of the Hessian means sharp curvature, which slows down gradient descent rather than speeding it up. And eigenvalues can be negative: a negative eigenvalue of the Hessian at a critical point means that direction is a descent direction — a saddle, not a minimum. The eigenvalue's sign and context determine its meaning, not its magnitude alone.
Eigenvalue Problem — Method Comparison
| Method | Time complexity | When to use |
|---|---|---|
| Characteristic polynomial (exact) | for in general | , by hand only |
| Power iteration | per step, steps | Only largest eigenvalue; sparse |
| QR algorithm | (dense); standard method | All eigenvalues of dense matrices |
| Lanczos / Arnoldi | for eigenvalues | Large sparse matrices (graphs, PCA on high- data) |
| Randomized SVD | for rank- approximation | Very large matrices; PCA with |
The QR algorithm (Golub-Kahan, Francis) is the workhorse behind numpy.linalg.eig. For deep learning, the dominant eigenvalue of the Hessian (relevant for learning-rate selection and loss landscape analysis) is usually estimated via power iteration or Lanczos, never via full eigendecomposition.
Common Pitfalls
Numerical near-degeneracy. When two eigenvalues are very close (), their eigenvectors are numerically ill-conditioned. Any small perturbation can rotate the eigenspace dramatically. Algorithms that return individual eigenvectors in this regime may be meaningless — report eigenspaces (i.e., spans) instead.
Defective matrices are measure-zero but show up. A random matrix is almost surely diagonalizable, but engineered matrices in ML (e.g., certain recurrent weight matrices trained to specific dynamics) can be defective or nearly defective. Do not assume diagonalizability without checking the multiplicity condition.
Eigenvalues encode stability. For discrete-time linear systems : the system is stable (trajectories decay to zero) if and only if . For continuous-time : stability requires all eigenvalues to have negative real parts. This is why exploding/vanishing gradients in RNNs are analyzed via the spectral radius of the recurrent weight matrix.
Condition number and optimization. If the Hessian of a loss surface has , gradient descent behaves like a ball on a narrow valley — fast along one axis, slow along the perpendicular. This is why batch normalization accelerates training: it implicitly regularizes the condition number of weight matrices, making the curvature of the loss surface more isotropic.
Eigenvalues in ML: A Conceptual Map
| Application | Matrix | Relevant eigenstructure |
|---|---|---|
| PCA | Covariance | Top eigenvalues/vectors = principal components |
| Spectral clustering | Graph Laplacian | Fiedler vector ( eigenvector) encodes cluster boundary |
| PageRank | (Row-stochastic) adjacency | Dominant left eigenvector () = stationary distribution |
| RNN stability | Recurrent weight | → exploding gradients |
| Ridge regression | Eigenvalue shrinkage: | |
| Transformer attention | Attention weight matrix | Effective rank via eigenvalue decay = attention collapse diagnostic |
| Neural collapse | Last-layer features + classifier | Eigenstructure of within-class covariance equals simplex ETF |
Algebraic Structures: Why Symmetric Matrices are Special
The Spectral Theorem is not a coincidence — it holds precisely because real symmetric matrices (and more generally, normal matrices satisfying ) belong to a special class where the matrix is unitarily diagonalizable.
For non-symmetric matrices, the correct generalization is the Jordan normal form: every matrix is similar to a block-diagonal matrix of Jordan blocks. Jordan blocks arise from defective eigenvalues and encode polynomial (not exponential) growth in matrix powers, which is why they appear in the analysis of unstable recurrences.
The practical lesson: in ML, covariance matrices, graph Laplacians, kernel matrices, and Gram matrices are always symmetric (or Hermitian), so the full Spectral Theorem applies and we get real eigenvalues, orthogonal eigenvectors, and stable numerical algorithms.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.