Neural-Path/Notes
40 min

Floating Point Arithmetic, Numerical Stability & Condition Numbers

Every computation in neural network training passes through floating-point arithmetic. Understanding how rounding errors accumulate, how subtraction causes catastrophic cancellation, and how condition numbers determine the amplification of errors into outputs is essential for diagnosing numerical failures and designing stable algorithms.

Concepts

Every floating-point number is a rounded approximation: most real numbers cannot be represented exactly in binary, so the computer rounds to the nearest representable value. For a single operation the error is tiny — at most εmach107\varepsilon_{\text{mach}} \approx 10^{-7} for float32. The danger is compounding: hundreds of operations, each introducing a small relative error, can produce results with no accurate digits. Understanding which operations amplify errors and which cancel them is what separates stable algorithms from silently wrong ones.

IEEE 754 Floating-Point Representation

IEEE 754 represents a floating-point number as (1)sm2e(-1)^s \cdot m \cdot 2^e where s{0,1}s \in \{0,1\} is the sign, mm is the significand (mantissa), and ee is the exponent. For float32 (single precision): 1 sign bit, 8 exponent bits, 23 mantissa bits (plus 1 implicit leading bit). For float64 (double precision): 1 sign bit, 11 exponent bits, 52 mantissa bits.

Machine epsilon εmach\varepsilon_{\text{mach}} is the smallest number such that 1+εmach11 + \varepsilon_{\text{mach}} \neq 1 in floating point:

  • float32: εmach1.19×107\varepsilon_{\text{mach}} \approx 1.19 \times 10^{-7} (about 7 significant decimal digits)
  • float64: εmach2.22×1016\varepsilon_{\text{mach}} \approx 2.22 \times 10^{-16} (about 15–16 significant decimal digits)
  • float16: εmach9.77×104\varepsilon_{\text{mach}} \approx 9.77 \times 10^{-4} (about 3 significant decimal digits)
  • bfloat16: εmach7.81×103\varepsilon_{\text{mach}} \approx 7.81 \times 10^{-3} (8 exponent bits like float32, but only 7 mantissa bits)

The fundamental axiom of floating-point arithmetic (Rounding Model): for any elementary operation {+,,×,/}\circ \in \{+, -, \times, /\}:

fl(xy)=(xy)(1+δ),δεmach.\text{fl}(x \circ y) = (x \circ y)(1 + \delta), \quad |\delta| \leq \varepsilon_{\text{mach}}.

Each operation introduces a relative error at most εmach\varepsilon_{\text{mach}}.

This rounding model is the foundation of backward error analysis: instead of asking "how wrong is my answer?", ask "for which perturbed input would my answer be exact?" An algorithm is backward stable if the perturbation needed to make the computed result exact is comparable in size to εmach\varepsilon_{\text{mach}}. This reframing is powerful — it separates the error introduced by the algorithm (backward error) from the amplification by the problem itself (condition number).

Catastrophic Cancellation

When two nearly equal numbers are subtracted, the significant digits cancel and the result has few accurate digits:

fl(xy)=(xy)(1+δ)\text{fl}(x - y) = (x - y)(1 + \delta)

but if xyx \approx y, the relative error of the result is large even though δεmach|\delta| \leq \varepsilon_{\text{mach}}.

Example: compute x=1.000001x = 1.000001 and y=1.000000y = 1.000000 in float32. Their difference is 10610^{-6}, but float32 represents each to only 10710^{-7} precision — the result has essentially 0 correct digits.

Remedy: algebraic reformulation avoids subtraction of nearly-equal values.

Unstable formStable form
x+11\sqrt{x+1} - 1 (near x=0x=0)x/(x+1+1)x / (\sqrt{x+1} + 1)
log(1+x)\log(1+x) (near x=0x=0)Use log1p(x)
ex1e^x - 1 (near x=0x=0)Use expm1(x)
(ab)(a+b)(a-b)(a+b)a2b2a^2 - b^2
Softmax exi/jexje^{x_i}/\sum_j e^{x_j}Subtract maxxj\max x_j first (log-sum-exp trick)

Error Analysis and Forward/Backward Stability

Forward error: f^(x)f(x)|\hat f(x) - f(x)|, the absolute difference between computed and true output.

Backward stability: algorithm f^\hat f is backward stable if f^(x)=f(x^)\hat f(x) = f(\hat x) for some x^\hat x with x^x/xfewεmach|\hat x - x| / |x| \leq \text{few}\cdot\varepsilon_{\text{mach}}. The algorithm gives the exact answer to a slightly perturbed input.

Mixed stability: f^(x)f(x)κ(f,x)εmachf(x)|\hat f(x) - f(x)| \leq \kappa(f, x) \cdot \varepsilon_{\text{mach}} \cdot |f(x)| where κ\kappa is the condition number.

Condition Numbers

The condition number of a problem f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m at xx measures amplification of relative errors:

κ(f,x)=limε0supδxεxδf/fδx/x=xJf(x)f(x).\kappa(f, x) = \lim_{\varepsilon \to 0} \sup_{|\delta x| \leq \varepsilon |x|} \frac{|\delta f| / |f|}{|\delta x| / |x|} = \frac{|x| \cdot \|J_f(x)\|}{|f(x)|}.

For a linear system Ax=bAx = b, the condition number of AA is:

κ(A)=AA1=σmax(A)σmin(A),\kappa(A) = \|A\| \cdot \|A^{-1}\| = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)},

the ratio of the largest to smallest singular value. The forward error bound:

x^xxκ(A)b^bb+O(κ(A)2εmach).\frac{\|\hat x - x\|}{\|x\|} \leq \kappa(A) \cdot \frac{\|\hat b - b\|}{\|b\|} + O(\kappa(A)^2 \varepsilon_{\text{mach}}).

A system with κ(A)=10k\kappa(A) = 10^k loses kk digits of accuracy. If κ(A)εmach1\kappa(A) \cdot \varepsilon_{\text{mach}} \geq 1, the solution is essentially meaningless.

Condition of eigenvalue problems: for a symmetric matrix, the condition number for computing eigenvalue λi\lambda_i is κ(λi)=1\kappa(\lambda_i) = 1 (eigenvalues are well-conditioned). For a non-symmetric matrix, eigenvalue condition numbers can be exponentially large (ill-conditioned eigenvectors).

Condition of matrix inversion: κ(A1b)=κ(A)\kappa(A^{-1} b) = \kappa(A). Computing the inverse explicitly doubles the condition number in practice — prefer to solve the linear system Ax=bAx = b directly.

Numerical Stability of Common Operations

Dot product: c=iaibic = \sum_i a_i b_i in floating point incurs O(nεmach)O(n\varepsilon_{\text{mach}}) forward error. Compensated summation (Kahan) reduces this to O(εmach)O(\varepsilon_{\text{mach}}) independent of nn:

s = 0; c = 0
for each x_i:
    y = x_i - c
    t = s + y
    c = (t - s) - y
    s = t

Matrix-vector product: stable; error O(nεmach)AxO(n\varepsilon_{\text{mach}}) \|A\| \|x\|.

Triangular solve: stable when pivoting is used; can be unstable without.

Worked Example

Example 1: Log-Sum-Exp and Numerical Softmax

Computing softmax σi=exi/jexj\sigma_i = e^{x_i} / \sum_j e^{x_j} for logits x=(1000,1001,1002)x = (1000, 1001, 1002):

Naive: e1002>10434e^{1002} > 10^{434} — overflow in float32 (max 3.4×1038\approx 3.4 \times 10^{38}). Result: NaN.

Stable: subtract M=maxxj=1002M = \max x_j = 1002 first:

σi=exiMjexjM=exi1002e0+e1+e2=exi10021.135.\sigma_i = \frac{e^{x_i - M}}{\sum_j e^{x_j - M}} = \frac{e^{x_i - 1002}}{e^0 + e^{-1} + e^{-2}} = \frac{e^{x_i - 1002}}{1.135}.

Now exponents are 0\leq 0, so no overflow. The log-sum-exp trick: logjexj=M+logjexjM\log\sum_j e^{x_j} = M + \log\sum_j e^{x_j - M}.

This exact pattern appears in every deep learning framework's cross-entropy implementation.

Example 2: Condition Number in Linear Regression

The normal equations ATAβ^=ATyA^T A \hat\beta = A^T y have condition number κ(ATA)=κ(A)2\kappa(A^T A) = \kappa(A)^2. Forming the normal equations explicitly squares the condition number — potentially catastrophic for ill-conditioned AA.

Remedy: solve via QR factorization A=QRA = QR, then Rβ^=QTyR\hat\beta = Q^T y. The QR solve has condition number κ(A)\kappa(A), not κ(A)2\kappa(A)^2.

For example, if AA has κ(A)=107\kappa(A) = 10^7 and we use float32 (εmach107\varepsilon_{\text{mach}} \approx 10^{-7}): normal equations have κ=1014>1/εmach\kappa = 10^{14} > 1/\varepsilon_{\text{mach}} — the solution is numerically meaningless. QR gives κ=107\kappa = 10^7, which just barely works in float32.

Example 3: Condition Number of the Hessian in Optimization

For minimizing f(θ)f(\theta) with gradient descent, the condition number κ(2f)\kappa(\nabla^2 f) controls convergence. For a quadratic f(θ)=12θTHθbTθf(\theta) = \frac{1}{2}\theta^T H\theta - b^T\theta:

  • Gradient descent converges in O(κ(H)log(1/ε))O(\kappa(H) \log(1/\varepsilon)) steps
  • Conjugate gradient converges in O(κ(H)log(1/ε))O(\sqrt{\kappa(H)} \log(1/\varepsilon)) steps
  • Newton's method (with HH exact): O(loglog(1/ε))O(\log\log(1/\varepsilon)) steps

A neural network loss with κ(H)106\kappa(H) \approx 10^6 requires 10610^6 gradient descent steps vs 10310^3 conjugate gradient steps. Preconditioning reduces the effective κ\kappa — Adam's adaptive learning rates are an approximation to diagonal preconditioning.

Connections

Where Your Intuition Breaks

Condition numbers describe the worst-case amplification of input error to output error — but an ill-conditioned problem solved by a backward-stable algorithm can still produce accurate results if the errors happen not to excite the worst case. Conversely, a well-conditioned problem solved by an unstable algorithm (e.g., Gram-Schmidt orthogonalization without re-orthogonalization) produces inaccurate results. The condition number bounds the error for any algorithm; backward stability guarantees the error is proportional to the condition number for a specific algorithm. Both properties are necessary for a guarantee; neither alone is sufficient.

💡Intuition

Condition number is a property of the problem, not the algorithm. An ill-conditioned problem cannot be solved accurately by any algorithm, no matter how clever. If κ(A)=1014\kappa(A) = 10^{14} and you use float64, the 14 digits of accuracy are entirely consumed by the problem's inherent sensitivity — no floating-point precision remains for the answer. Preconditioning does not reduce κ(A)\kappa(A); it replaces AA with P1AP^{-1}A where κ(P1A)\kappa(P^{-1}A) is smaller. This is a change of problem, not algorithm.

💡Intuition

The log-sum-exp trick is everywhere in ML because probabilities are tiny. Computing log-likelihoods involves logipi\log\sum_i p_i where each pip_i is exponentially small (e.g., e100e^{-100}). Direct summation underflows to zero; taking the log gives -\infty. The log-sum-exp factored form keeps values in range throughout. The same trick appears in: attention (softmax of logits), CTC loss, HMM forward algorithm, normalizing flows, and any computation involving probabilities of sequences. It is not optional — without it, training fails silently with NaN losses.

⚠️Warning

bfloat16 and float16 have very different numerical properties. float16 has εmach103\varepsilon_{\text{mach}} \approx 10^{-3} and max value 65504\approx 65504 — it overflows for activations or weights that exceed 10410^4. bfloat16 has εmach102\varepsilon_{\text{mach}} \approx 10^{-2} (worse precision than float16) but max value 3.4×1038\approx 3.4 \times 10^{38} (same as float32) — it rarely overflows. Modern LLM training uses bfloat16 for activations (avoiding overflow) with float32 master weights for gradient accumulation (maintaining precision). Mixing float16 and bfloat16 in a pipeline silently produces incorrect results due to different overflow behaviors.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.