Floating Point Arithmetic, Numerical Stability & Condition Numbers
Every computation in neural network training passes through floating-point arithmetic. Understanding how rounding errors accumulate, how subtraction causes catastrophic cancellation, and how condition numbers determine the amplification of errors into outputs is essential for diagnosing numerical failures and designing stable algorithms.
Concepts
Every floating-point number is a rounded approximation: most real numbers cannot be represented exactly in binary, so the computer rounds to the nearest representable value. For a single operation the error is tiny — at most for float32. The danger is compounding: hundreds of operations, each introducing a small relative error, can produce results with no accurate digits. Understanding which operations amplify errors and which cancel them is what separates stable algorithms from silently wrong ones.
IEEE 754 Floating-Point Representation
IEEE 754 represents a floating-point number as where is the sign, is the significand (mantissa), and is the exponent. For float32 (single precision): 1 sign bit, 8 exponent bits, 23 mantissa bits (plus 1 implicit leading bit). For float64 (double precision): 1 sign bit, 11 exponent bits, 52 mantissa bits.
Machine epsilon is the smallest number such that in floating point:
- float32: (about 7 significant decimal digits)
- float64: (about 15–16 significant decimal digits)
- float16: (about 3 significant decimal digits)
- bfloat16: (8 exponent bits like float32, but only 7 mantissa bits)
The fundamental axiom of floating-point arithmetic (Rounding Model): for any elementary operation :
Each operation introduces a relative error at most .
This rounding model is the foundation of backward error analysis: instead of asking "how wrong is my answer?", ask "for which perturbed input would my answer be exact?" An algorithm is backward stable if the perturbation needed to make the computed result exact is comparable in size to . This reframing is powerful — it separates the error introduced by the algorithm (backward error) from the amplification by the problem itself (condition number).
Catastrophic Cancellation
When two nearly equal numbers are subtracted, the significant digits cancel and the result has few accurate digits:
but if , the relative error of the result is large even though .
Example: compute and in float32. Their difference is , but float32 represents each to only precision — the result has essentially 0 correct digits.
Remedy: algebraic reformulation avoids subtraction of nearly-equal values.
| Unstable form | Stable form |
|---|---|
| (near ) | |
| (near ) | Use log1p(x) |
| (near ) | Use expm1(x) |
| Softmax | Subtract first (log-sum-exp trick) |
Error Analysis and Forward/Backward Stability
Forward error: , the absolute difference between computed and true output.
Backward stability: algorithm is backward stable if for some with . The algorithm gives the exact answer to a slightly perturbed input.
Mixed stability: where is the condition number.
Condition Numbers
The condition number of a problem at measures amplification of relative errors:
For a linear system , the condition number of is:
the ratio of the largest to smallest singular value. The forward error bound:
A system with loses digits of accuracy. If , the solution is essentially meaningless.
Condition of eigenvalue problems: for a symmetric matrix, the condition number for computing eigenvalue is (eigenvalues are well-conditioned). For a non-symmetric matrix, eigenvalue condition numbers can be exponentially large (ill-conditioned eigenvectors).
Condition of matrix inversion: . Computing the inverse explicitly doubles the condition number in practice — prefer to solve the linear system directly.
Numerical Stability of Common Operations
Dot product: in floating point incurs forward error. Compensated summation (Kahan) reduces this to independent of :
s = 0; c = 0
for each x_i:
y = x_i - c
t = s + y
c = (t - s) - y
s = t
Matrix-vector product: stable; error .
Triangular solve: stable when pivoting is used; can be unstable without.
Worked Example
Example 1: Log-Sum-Exp and Numerical Softmax
Computing softmax for logits :
Naive: — overflow in float32 (max ). Result: NaN.
Stable: subtract first:
Now exponents are , so no overflow. The log-sum-exp trick: .
This exact pattern appears in every deep learning framework's cross-entropy implementation.
Example 2: Condition Number in Linear Regression
The normal equations have condition number . Forming the normal equations explicitly squares the condition number — potentially catastrophic for ill-conditioned .
Remedy: solve via QR factorization , then . The QR solve has condition number , not .
For example, if has and we use float32 (): normal equations have — the solution is numerically meaningless. QR gives , which just barely works in float32.
Example 3: Condition Number of the Hessian in Optimization
For minimizing with gradient descent, the condition number controls convergence. For a quadratic :
- Gradient descent converges in steps
- Conjugate gradient converges in steps
- Newton's method (with exact): steps
A neural network loss with requires gradient descent steps vs conjugate gradient steps. Preconditioning reduces the effective — Adam's adaptive learning rates are an approximation to diagonal preconditioning.
Connections
Where Your Intuition Breaks
Condition numbers describe the worst-case amplification of input error to output error — but an ill-conditioned problem solved by a backward-stable algorithm can still produce accurate results if the errors happen not to excite the worst case. Conversely, a well-conditioned problem solved by an unstable algorithm (e.g., Gram-Schmidt orthogonalization without re-orthogonalization) produces inaccurate results. The condition number bounds the error for any algorithm; backward stability guarantees the error is proportional to the condition number for a specific algorithm. Both properties are necessary for a guarantee; neither alone is sufficient.
Condition number is a property of the problem, not the algorithm. An ill-conditioned problem cannot be solved accurately by any algorithm, no matter how clever. If and you use float64, the 14 digits of accuracy are entirely consumed by the problem's inherent sensitivity — no floating-point precision remains for the answer. Preconditioning does not reduce ; it replaces with where is smaller. This is a change of problem, not algorithm.
The log-sum-exp trick is everywhere in ML because probabilities are tiny. Computing log-likelihoods involves where each is exponentially small (e.g., ). Direct summation underflows to zero; taking the log gives . The log-sum-exp factored form keeps values in range throughout. The same trick appears in: attention (softmax of logits), CTC loss, HMM forward algorithm, normalizing flows, and any computation involving probabilities of sequences. It is not optional — without it, training fails silently with NaN losses.
bfloat16 and float16 have very different numerical properties. float16 has and max value — it overflows for activations or weights that exceed . bfloat16 has (worse precision than float16) but max value (same as float32) — it rarely overflows. Modern LLM training uses bfloat16 for activations (avoiding overflow) with float32 master weights for gradient accumulation (maintaining precision). Mixing float16 and bfloat16 in a pipeline silently produces incorrect results due to different overflow behaviors.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.