Neural-Path/Notes
55 min

Eigenvalues, Eigenvectors & Diagonalization

Eigenvalues and eigenvectors reveal the special directions along which a linear map acts by pure scaling — no rotation, no shear, just stretch or compression. Diagonalization rewrites a matrix in terms of these invariant directions, transforming matrix powers into scalar powers and turning coupled differential equations into independent ones. In machine learning, PCA, spectral clustering, PageRank, and the analysis of gradient flow in neural networks all reduce, at their core, to an eigenvalue problem.

Concepts

Eigenvalue Geometry — Unit Circle → Transformed Ellipse

λ₁=3.000λ₂=1.000A = [2, 1; 1, 2]

λ₁ = 3.000

λ₂ = 1.000

v₁ = (-0.707, -0.707)

v₂ = (-0.707, 0.707)

tr(A) = 4.000 = λ₁+λ₂ = 4.000

det(A) = 3.000 = λ₁·λ₂ = 3.000

Real distinct eigenvalues — matrix is diagonalizable over ℝ

The dashed circle is the input. The solid shape is the output after applying A. Eigenvectors (colored arrows) are the only directions that stay aligned — they scale but do not rotate.

When a guitar string resonates, it vibrates at specific natural frequencies — special modes where every point moves in sync, stretched and compressed without being twisted. Eigenvectors are the matrix analogue: the special directions that a linear map stretches or flips without rotating. Every covariance matrix, graph Laplacian, and Hessian in machine learning has these invariant directions, and understanding them is what makes PCA, spectral clustering, and optimization analysis tractable.

The Eigenvalue Equation

Let ARn×nA \in \mathbb{R}^{n \times n}. A nonzero vector vRn\mathbf{v} \in \mathbb{R}^n is an eigenvector of AA with eigenvalue λC\lambda \in \mathbb{C} if

Av=λv.A\mathbf{v} = \lambda \mathbf{v}.

Requiring v0\mathbf{v} \neq \mathbf{0} and Av=λvA\mathbf{v} = \lambda\mathbf{v} is the only algebraic way to say "a direction that is not rotated." Any weaker condition — allowing v=0\mathbf{v} = \mathbf{0}, or permitting a component orthogonal to v\mathbf{v} in the output — would fail to capture invariance under the map.

Geometrically: v\mathbf{v} is a fixed direction of AA — applying AA scales v\mathbf{v} by λ\lambda but does not rotate it. Eigenvalues can be zero (the eigenvector is in the null space), negative (the direction flips), or complex (indicating rotation when AA is real but non-symmetric).

Non-example. For A=(0110)A = \begin{pmatrix}0 & -1 \\ 1 & 0\end{pmatrix} (90° rotation), no real vector satisfies Av=λvA\mathbf{v} = \lambda \mathbf{v} — every vector is rotated, so there are no real invariant directions.

The Characteristic Polynomial

Rearranging the eigenvalue equation:

Av=λv    (AλI)v=0.A\mathbf{v} = \lambda \mathbf{v} \iff (A - \lambda I)\mathbf{v} = \mathbf{0}.

For a nonzero solution to exist, AλIA - \lambda I must be singular:

det(AλI)=0.\det(A - \lambda I) = 0.

The function p(λ)=det(AλI)p(\lambda) = \det(A - \lambda I) is the characteristic polynomial of AA, a degree-nn polynomial in λ\lambda. Its roots are the eigenvalues of AA.

For 2×22 \times 2 matrices, if A=(abcd)A = \begin{pmatrix}a & b \\ c & d\end{pmatrix}:

p(λ)=λ2(a+d)tr(A)λ+(adbc)det(A)=0.p(\lambda) = \lambda^2 - \underbrace{(a+d)}_{\operatorname{tr}(A)}\lambda + \underbrace{(ad - bc)}_{\det(A)} = 0.

This gives the quadratic formula:

λ=tr(A)±tr(A)24det(A)2.\lambda = \frac{\operatorname{tr}(A) \pm \sqrt{\operatorname{tr}(A)^2 - 4\det(A)}}{2}.

Key identities (true for any n×nn \times n matrix):

tr(A)=i=1nλi,det(A)=i=1nλi.\operatorname{tr}(A) = \sum_{i=1}^n \lambda_i, \qquad \det(A) = \prod_{i=1}^n \lambda_i.

These hold even when eigenvalues are complex. They give cheap sanity checks after computing eigenvalues.

Algebraic and Geometric Multiplicity

Once an eigenvalue λ0\lambda_0 is found, its eigenvectors span the eigenspace:

Eλ0=ker(Aλ0I)={v:Av=λ0v}.E_{\lambda_0} = \ker(A - \lambda_0 I) = \{\mathbf{v} : A\mathbf{v} = \lambda_0 \mathbf{v}\}.

Two notions of "how many times" an eigenvalue appears:

ConceptDefinitionNotation
Algebraic multiplicityPower of (λλ0)(\lambda - \lambda_0) as a factor of p(λ)p(\lambda)a(λ0)a(\lambda_0)
Geometric multiplicitydimker(Aλ0I)\dim \ker(A - \lambda_0 I) — dimension of the eigenspaceg(λ0)g(\lambda_0)

Always 1g(λ0)a(λ0)1 \leq g(\lambda_0) \leq a(\lambda_0). When g(λ0)<a(λ0)g(\lambda_0) < a(\lambda_0), the eigenvalue is defective and the matrix is not diagonalizable over that eigenvalue.

Example. The matrix (2102)\begin{pmatrix}2 & 1 \\ 0 & 2\end{pmatrix} has characteristic polynomial (λ2)2(\lambda-2)^2, so a(2)=2a(2) = 2. But ker(A2I)=ker(0100)=span{(1,0)T}\ker(A - 2I) = \ker\begin{pmatrix}0 & 1 \\ 0 & 0\end{pmatrix} = \operatorname{span}\{(1, 0)^T\}, so g(2)=1<2g(2) = 1 < 2. Defective — not diagonalizable.

Diagonalization

A matrix ARn×nA \in \mathbb{R}^{n \times n} is diagonalizable if there exists an invertible PP and diagonal DD such that

A=PDP1,D=diag(λ1,,λn).A = PDP^{-1}, \qquad D = \operatorname{diag}(\lambda_1, \ldots, \lambda_n).

The columns of PP are linearly independent eigenvectors of AA; the diagonal entries of DD are the corresponding eigenvalues.

Theorem (Diagonalizability Criterion). AA is diagonalizable over C\mathbb{C} if and only if g(λi)=a(λi)g(\lambda_i) = a(\lambda_i) for every eigenvalue λi\lambda_i.

Sufficient condition. If AA has nn distinct eigenvalues, it is diagonalizable (eigenvectors corresponding to distinct eigenvalues are automatically linearly independent).

Change of basis interpretation. In the eigenbasis, AA acts simply as scaling:

D=P1AP    Dei=λiei.D = P^{-1}AP \implies D\mathbf{e}_i = \lambda_i \mathbf{e}_i.

Powers and Functions of Matrices

Diagonalization makes matrix powers trivial:

Ak=PDkP1,Dk=diag(λ1k,,λnk).A^k = PD^kP^{-1}, \qquad D^k = \operatorname{diag}(\lambda_1^k, \ldots, \lambda_n^k).

This is the key that unlocks recurrences. The Fibonacci sequence Fn+1=Fn+Fn1F_{n+1} = F_n + F_{n-1} encodes as a matrix power, and diagonalization yields the closed-form Binet formula.

Matrix exponential. For the differential equation x˙=Ax\dot{\mathbf{x}} = A\mathbf{x}, the solution is x(t)=eAtx0\mathbf{x}(t) = e^{At}\mathbf{x}_0 where

eAt=PeDtP1,eDt=diag(eλ1t,,eλnt).e^{At} = Pe^{Dt}P^{-1}, \qquad e^{Dt} = \operatorname{diag}(e^{\lambda_1 t}, \ldots, e^{\lambda_n t}).

Stability of the system is determined entirely by the signs of Re(λi)\operatorname{Re}(\lambda_i): if all real parts are negative, the system decays to zero.

Eigenvalues of Symmetric Matrices

Real symmetric matrices (A=ATA = A^T) enjoy exceptional properties:

Theorem (Spectral Theorem for symmetric matrices). If ARn×nA \in \mathbb{R}^{n \times n} is symmetric, then:

  1. All eigenvalues of AA are real.
  2. Eigenvectors corresponding to distinct eigenvalues are orthogonal.
  3. AA is orthogonally diagonalizable: A=QΛQTA = Q\Lambda Q^T where QQ is orthogonal (QTQ=IQ^TQ = I) and Λ\Lambda is diagonal with real entries.

Proof sketch of (1). Suppose Av=λvA\mathbf{v} = \lambda\mathbf{v} with v0\mathbf{v} \neq \mathbf{0}, working over C\mathbb{C}. Then: λˉv2=vˉTATv=vˉTAv=λv2,\bar{\lambda}\|\mathbf{v}\|^2 = \bar{\mathbf{v}}^T A^T \mathbf{v} = \bar{\mathbf{v}}^T A \mathbf{v} = \lambda\|\mathbf{v}\|^2, so λˉ=λ\bar{\lambda} = \lambda, meaning λR\lambda \in \mathbb{R}.

Proof sketch of (2). Let Au=λuA\mathbf{u} = \lambda\mathbf{u}, Av=μvA\mathbf{v} = \mu\mathbf{v}, λμ\lambda \neq \mu: λu,v=Au,v=u,ATv=u,Av=μu,v.\lambda \langle \mathbf{u}, \mathbf{v} \rangle = \langle A\mathbf{u}, \mathbf{v} \rangle = \langle \mathbf{u}, A^T\mathbf{v} \rangle = \langle \mathbf{u}, A\mathbf{v} \rangle = \mu \langle \mathbf{u}, \mathbf{v} \rangle. Since λμ\lambda \neq \mu, we get u,v=0\langle \mathbf{u}, \mathbf{v} \rangle = 0.

The Spectral Theorem is foundational for PCA, kernel methods, and the analysis of graph Laplacians.

Rayleigh Quotient and Variational Characterization

For a symmetric matrix ARn×nA \in \mathbb{R}^{n \times n} with eigenvalues λ1λ2λn\lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_n, the Rayleigh quotient is

R(x)=xTAxxTx,x0.R(\mathbf{x}) = \frac{\mathbf{x}^T A \mathbf{x}}{\mathbf{x}^T \mathbf{x}}, \qquad \mathbf{x} \neq \mathbf{0}.

Theorem (Min-Max). The extremal eigenvalues satisfy:

λmin=minx0R(x),λmax=maxx0R(x).\lambda_{\min} = \min_{\mathbf{x} \neq 0} R(\mathbf{x}), \qquad \lambda_{\max} = \max_{\mathbf{x} \neq 0} R(\mathbf{x}).

More generally (Courant-Fischer):

λk=mindimV=kmaxxV,x0R(x).\lambda_k = \min_{\dim V = k} \max_{\mathbf{x} \in V, \mathbf{x} \neq 0} R(\mathbf{x}).

This variational characterization is the basis for PCA (maximize variance = maximize Rayleigh quotient of the covariance matrix) and spectral graph theory (the Fiedler vector minimizes the graph cut Rayleigh quotient).

Spectral Radius and Convergence

The spectral radius of AA is ρ(A)=maxiλi\rho(A) = \max_i |\lambda_i|. For any induced matrix norm:

ρ(A)A.\rho(A) \leq \|A\|.

Power iteration converges to the dominant eigenvector (eigenvector for λmax\lambda_{\max}) at rate λ2/λ1|\lambda_2/\lambda_1|. Slow convergence when the top two eigenvalues are close — this drives the need for deflation or Krylov subspace methods.

For gradient descent on a quadratic f(x)=12xTAxbTxf(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T A \mathbf{x} - \mathbf{b}^T \mathbf{x} with A0A \succ 0:

convergence rate=(κ1κ+1)2,κ=λmaxλmin=κ(A).\text{convergence rate} = \left(\frac{\kappa - 1}{\kappa + 1}\right)^2, \qquad \kappa = \frac{\lambda_{\max}}{\lambda_{\min}} = \kappa(A).

The condition number κ(A)\kappa(A) directly controls optimization speed — the eigenvalue spread of the Hessian determines whether gradient descent converges in a few steps or thousands.

Summary Table

PropertyConditionConsequence
Diagonalizableg(λi)=a(λi)g(\lambda_i) = a(\lambda_i) for all iiA=PDP1A = PDP^{-1}, Ak=PDkP1A^k = PD^kP^{-1}
SymmetricA=ATA = A^TAll λiR\lambda_i \in \mathbb{R}, orthogonal eigenvectors
Positive definiteA=ATA = A^T, all λi>0\lambda_i > 0Unique minimum, stable gradient descent
Defectivei:g(λi)<a(λi)\exists i: g(\lambda_i) < a(\lambda_i)No full eigenbasis; use Jordan form
NormalAAT=ATAAA^T = A^TAOrthogonally diagonalizable over C\mathbb{C}

Worked Example

Example 1: Full Diagonalization of a 2×22 \times 2 Symmetric Matrix

Let A=(3113)A = \begin{pmatrix}3 & 1 \\ 1 & 3\end{pmatrix}.

Step 1: Characteristic polynomial.

det(AλI)=(3λ)21=λ26λ+8=(λ4)(λ2)=0.\det(A - \lambda I) = (3-\lambda)^2 - 1 = \lambda^2 - 6\lambda + 8 = (\lambda - 4)(\lambda - 2) = 0.

Eigenvalues: λ1=4\lambda_1 = 4, λ2=2\lambda_2 = 2. Verify: tr(A)=6=4+2\operatorname{tr}(A) = 6 = 4 + 2 ✓, det(A)=8=42\det(A) = 8 = 4 \cdot 2 ✓.

Step 2: Eigenvectors.

For λ1=4\lambda_1 = 4: (A4I)v=(1111)v=0(A - 4I)\mathbf{v} = \begin{pmatrix}-1 & 1 \\ 1 & -1\end{pmatrix}\mathbf{v} = \mathbf{0}. Row reduce → v1=v2v_1 = v_2. Eigenvector: p1=12(11)\mathbf{p}_1 = \frac{1}{\sqrt{2}}\begin{pmatrix}1 \\ 1\end{pmatrix}.

For λ2=2\lambda_2 = 2: (A2I)v=(1111)v=0(A - 2I)\mathbf{v} = \begin{pmatrix}1 & 1 \\ 1 & 1\end{pmatrix}\mathbf{v} = \mathbf{0}. Row reduce → v1=v2v_1 = -v_2. Eigenvector: p2=12(11)\mathbf{p}_2 = \frac{1}{\sqrt{2}}\begin{pmatrix}1 \\ -1\end{pmatrix}.

Verify orthogonality: p1p2=12(11+1(1))=0\mathbf{p}_1 \cdot \mathbf{p}_2 = \frac{1}{2}(1 \cdot 1 + 1 \cdot (-1)) = 0 ✓.

Step 3: Write the decomposition.

A=QΛQT=12(1111)(4002)12(1111).A = Q\Lambda Q^T = \frac{1}{\sqrt{2}}\begin{pmatrix}1 & 1 \\ 1 & -1\end{pmatrix} \begin{pmatrix}4 & 0 \\ 0 & 2\end{pmatrix} \frac{1}{\sqrt{2}}\begin{pmatrix}1 & 1 \\ 1 & -1\end{pmatrix}.

Step 4: Compute A10A^{10}.

A10=QΛ10QT=12(1111)(41000210)(1111)=12(410+210410210410210410+210).A^{10} = Q\Lambda^{10}Q^T = \frac{1}{2}\begin{pmatrix}1 & 1 \\ 1 & -1\end{pmatrix}\begin{pmatrix}4^{10} & 0 \\ 0 & 2^{10}\end{pmatrix}\begin{pmatrix}1 & 1 \\ 1 & -1\end{pmatrix} = \frac{1}{2}\begin{pmatrix}4^{10}+2^{10} & 4^{10}-2^{10} \\ 4^{10}-2^{10} & 4^{10}+2^{10}\end{pmatrix}.

Compare this to naively multiplying AA ten times — the eigenbasis makes it a one-liner.

Example 2: Complex Eigenvalues — Rotation Matrix

Let A=(0110)A = \begin{pmatrix}0 & -1 \\ 1 & 0\end{pmatrix} (rotation by 90°90°).

Characteristic polynomial: λ2+1=0    λ=±i\lambda^2 + 1 = 0 \implies \lambda = \pm i.

Since both eigenvalues are purely imaginary, AA has no real eigenvectors. The rotation has no invariant direction over R\mathbb{R}.

General rotation matrix. Rθ=(cosθsinθsinθcosθ)R_\theta = \begin{pmatrix}\cos\theta & -\sin\theta \\ \sin\theta & \cos\theta\end{pmatrix} has eigenvalues e±iθ=cosθ±isinθe^{\pm i\theta} = \cos\theta \pm i\sin\theta. For θ0,π\theta \neq 0, \pi, all eigenvalues are complex — consistent with the geometric fact that rotations (other than 0° and 180°180°) leave no real direction fixed.

Example 3: PCA as an Eigenvalue Problem

Given a data matrix XRn×dX \in \mathbb{R}^{n \times d} (centered, nn observations, dd features), the empirical covariance matrix is

C=1n1XTXRd×d.C = \frac{1}{n-1}X^TX \in \mathbb{R}^{d \times d}.

CC is symmetric positive semidefinite, so its eigendecomposition is

C=QΛQT,Λ=diag(λ1,,λd),λ1λd0.C = Q\Lambda Q^T, \qquad \Lambda = \operatorname{diag}(\lambda_1, \ldots, \lambda_d), \quad \lambda_1 \geq \cdots \geq \lambda_d \geq 0.

The kk-th column of QQ (i.e., the kk-th eigenvector) is the kk-th principal component — the direction of the kk-th largest variance. The corresponding eigenvalue λk\lambda_k is exactly that variance.

Why? For a unit vector u\mathbf{u}, the variance of the projections XuX\mathbf{u} is uTCu\mathbf{u}^T C \mathbf{u} — the Rayleigh quotient of CC. PCA maximizes this over u\mathbf{u}, and by the Spectral Theorem's variational characterization, the maximizer is q1\mathbf{q}_1 (the top eigenvector).

Explained variance ratio. The fraction of total variance captured by the top kk components:

EVR(k)=j=1kλjj=1dλj=j=1kλjtr(C).\text{EVR}(k) = \frac{\sum_{j=1}^k \lambda_j}{\sum_{j=1}^d \lambda_j} = \frac{\sum_{j=1}^k \lambda_j}{\operatorname{tr}(C)}.

A standard rule of thumb: keep components until EVR 0.90\geq 0.90.

Connections

Where Your Intuition Breaks

The eigenvector with the largest eigenvalue is always the most important. This is true for variance maximization (PCA) but wrong for almost everything else. For stability analysis, large eigenvalues of the recurrent weight matrix cause exploding gradients — the dominant direction is the dangerous one, not the useful one. For optimization, a large eigenvalue of the Hessian means sharp curvature, which slows down gradient descent rather than speeding it up. And eigenvalues can be negative: a negative eigenvalue of the Hessian at a critical point means that direction is a descent direction — a saddle, not a minimum. The eigenvalue's sign and context determine its meaning, not its magnitude alone.

Eigenvalue Problem — Method Comparison

MethodTime complexityWhen to use
Characteristic polynomial (exact)O(n3)O(n^3) for n×nn \times n in general2×22 \times 2, 3×33 \times 3 by hand only
Power iterationO(n2)O(n^2) per step, kk stepsOnly largest eigenvalue; sparse AA
QR algorithmO(n3)O(n^3) (dense); standard methodAll eigenvalues of dense matrices
Lanczos / ArnoldiO(kn2)O(kn^2) for kk eigenvaluesLarge sparse matrices (graphs, PCA on high-dd data)
Randomized SVDO(ndk)O(ndk) for rank-kk approximationVery large matrices; PCA with d,nkd, n \gg k

The QR algorithm (Golub-Kahan, Francis) is the workhorse behind numpy.linalg.eig. For deep learning, the dominant eigenvalue of the Hessian (relevant for learning-rate selection and loss landscape analysis) is usually estimated via power iteration or Lanczos, never via full eigendecomposition.

Common Pitfalls

⚠️Warning

Numerical near-degeneracy. When two eigenvalues are very close (λiλjϵmachineA|\lambda_i - \lambda_j| \ll \epsilon_{\text{machine}} \|A\|), their eigenvectors are numerically ill-conditioned. Any small perturbation can rotate the eigenspace dramatically. Algorithms that return individual eigenvectors in this regime may be meaningless — report eigenspaces (i.e., spans) instead.

⚠️Warning

Defective matrices are measure-zero but show up. A random matrix is almost surely diagonalizable, but engineered matrices in ML (e.g., certain recurrent weight matrices trained to specific dynamics) can be defective or nearly defective. Do not assume diagonalizability without checking the multiplicity condition.

💡Intuition

Eigenvalues encode stability. For discrete-time linear systems xt+1=Axt\mathbf{x}_{t+1} = A\mathbf{x}_t: the system is stable (trajectories decay to zero) if and only if ρ(A)<1\rho(A) < 1. For continuous-time x˙=Ax\dot{\mathbf{x}} = A\mathbf{x}: stability requires all eigenvalues to have negative real parts. This is why exploding/vanishing gradients in RNNs are analyzed via the spectral radius of the recurrent weight matrix.

💡Intuition

Condition number and optimization. If the Hessian HH of a loss surface has κ(H)=λmax/λmin1\kappa(H) = \lambda_{\max}/\lambda_{\min} \gg 1, gradient descent behaves like a ball on a narrow valley — fast along one axis, slow along the perpendicular. This is why batch normalization accelerates training: it implicitly regularizes the condition number of weight matrices, making the curvature of the loss surface more isotropic.

Eigenvalues in ML: A Conceptual Map

ApplicationMatrixRelevant eigenstructure
PCACovariance C=1nXTXC = \frac{1}{n}X^TXTop kk eigenvalues/vectors = principal components
Spectral clusteringGraph Laplacian L=DWL = D - WFiedler vector (λ2\lambda_2 eigenvector) encodes cluster boundary
PageRank(Row-stochastic) adjacency AADominant left eigenvector (λ=1\lambda = 1) = stationary distribution
RNN stabilityRecurrent weight WhhW_{hh}ρ(Whh)>1\rho(W_{hh}) > 1 → exploding gradients
Ridge regressionXTX+αIX^TX + \alpha IEigenvalue shrinkage: λiλi+α\lambda_i \mapsto \lambda_i + \alpha
Transformer attentionAttention weight matrixEffective rank via eigenvalue decay = attention collapse diagnostic
Neural collapseLast-layer features + classifierEigenstructure of within-class covariance equals simplex ETF

Algebraic Structures: Why Symmetric Matrices are Special

The Spectral Theorem is not a coincidence — it holds precisely because real symmetric matrices (and more generally, normal matrices satisfying AAT=ATAAA^T = A^TA) belong to a special class where the matrix is unitarily diagonalizable.

For non-symmetric matrices, the correct generalization is the Jordan normal form: every matrix is similar to a block-diagonal matrix of Jordan blocks. Jordan blocks arise from defective eigenvalues and encode polynomial (not exponential) growth in matrix powers, which is why they appear in the analysis of unstable recurrences.

The practical lesson: in ML, covariance matrices, graph Laplacians, kernel matrices, and Gram matrices are always symmetric (or Hermitian), so the full Spectral Theorem applies and we get real eigenvalues, orthogonal eigenvectors, and stable numerical algorithms.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.