Floating Point Arithmetic, Numerical Stability & Condition Numbers

Every computation in neural network training passes through floating-point arithmetic. Understanding how rounding errors accumulate, how subtraction causes catastrophic cancellation, and how condition numbers determine the amplification of errors into outputs is essential for diagnosing numerical failures and designing stable algorithms.

Concepts

Every floating-point number is a rounded approximation: most real numbers cannot be represented exactly in binary, so the computer rounds to the nearest representable value. For a single operation the error is tiny — at most $\varepsilon_{\text{mach}} \approx 10^{-7}$ for float32. The danger is compounding: hundreds of operations, each introducing a small relative error, can produce results with no accurate digits. Understanding which operations amplify errors and which cancel them is what separates stable algorithms from silently wrong ones.

IEEE 754 Floating-Point Representation

IEEE 754 represents a floating-point number as $(-1)^s \cdot m \cdot 2^e$ where $s \in \{0,1\}$ is the sign, $m$ is the significand (mantissa), and $e$ is the exponent. For float32 (single precision): 1 sign bit, 8 exponent bits, 23 mantissa bits (plus 1 implicit leading bit). For float64 (double precision): 1 sign bit, 11 exponent bits, 52 mantissa bits.

Machine epsilon $\varepsilon_{\text{mach}}$ is the smallest number such that $1 + \varepsilon_{\text{mach}} \neq 1$ in floating point:

float32: $\varepsilon_{\text{mach}} \approx 1.19 \times 10^{-7}$ (about 7 significant decimal digits)
float64: $\varepsilon_{\text{mach}} \approx 2.22 \times 10^{-16}$ (about 15–16 significant decimal digits)
float16: $\varepsilon_{\text{mach}} \approx 9.77 \times 10^{-4}$ (about 3 significant decimal digits)
bfloat16: $\varepsilon_{\text{mach}} \approx 7.81 \times 10^{-3}$ (8 exponent bits like float32, but only 7 mantissa bits)

The fundamental axiom of floating-point arithmetic (Rounding Model): for any elementary operation $\circ \in \{+, -, \times, /\}$ :

$\text{fl}(x \circ y) = (x \circ y)(1 + \delta), \quad |\delta| \leq \varepsilon_{\text{mach}}.$

Each operation introduces a relative error at most $\varepsilon_{\text{mach}}$ .

This rounding model is the foundation of backward error analysis: instead of asking "how wrong is my answer?", ask "for which perturbed input would my answer be exact?" An algorithm is backward stable if the perturbation needed to make the computed result exact is comparable in size to $\varepsilon_{\text{mach}}$ . This reframing is powerful — it separates the error introduced by the algorithm (backward error) from the amplification by the problem itself (condition number).

Catastrophic Cancellation

When two nearly equal numbers are subtracted, the significant digits cancel and the result has few accurate digits:

$\text{fl}(x - y) = (x - y)(1 + \delta)$

but if $x \approx y$ , the relative error of the result is large even though $|\delta| \leq \varepsilon_{\text{mach}}$ .

Example: compute $x = 1.000001$ and $y = 1.000000$ in float32. Their difference is $10^{-6}$ , but float32 represents each to only $10^{-7}$ precision — the result has essentially 0 correct digits.

Remedy: algebraic reformulation avoids subtraction of nearly-equal values.

Unstable form	Stable form
$\sqrt{x+1} - 1$ (near $x=0$ )	$x / (\sqrt{x+1} + 1)$
$\log(1+x)$ (near $x=0$ )	Use `log1p(x)`
$e^x - 1$ (near $x=0$ )	Use `expm1(x)`
$(a-b)(a+b)$	$a^2 - b^2$
Softmax $e^{x_i}/\sum_j e^{x_j}$	Subtract $\max x_j$ first (log-sum-exp trick)

Error Analysis and Forward/Backward Stability

Forward error: $|\hat f(x) - f(x)|$ , the absolute difference between computed and true output.

Backward stability: algorithm $\hat f$ is backward stable if $\hat f(x) = f(\hat x)$ for some $\hat x$ with $|\hat x - x| / |x| \leq \text{few}\cdot\varepsilon_{\text{mach}}$ . The algorithm gives the exact answer to a slightly perturbed input.

Mixed stability: $|\hat f(x) - f(x)| \leq \kappa(f, x) \cdot \varepsilon_{\text{mach}} \cdot |f(x)|$ where $\kappa$ is the condition number.

Condition Numbers

The condition number of a problem $f: \mathbb{R}^n \to \mathbb{R}^m$ at $x$ measures amplification of relative errors:

$\kappa(f, x) = \lim_{\varepsilon \to 0} \sup_{|\delta x| \leq \varepsilon |x|} \frac{|\delta f| / |f|}{|\delta x| / |x|} = \frac{|x| \cdot \|J_f(x)\|}{|f(x)|}.$

For a linear system $Ax = b$ , the condition number of $A$ is:

$\kappa(A) = \|A\| \cdot \|A^{-1}\| = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)},$

the ratio of the largest to smallest singular value. The forward error bound:

$\frac{\|\hat x - x\|}{\|x\|} \leq \kappa(A) \cdot \frac{\|\hat b - b\|}{\|b\|} + O(\kappa(A)^2 \varepsilon_{\text{mach}}).$

A system with $\kappa(A) = 10^k$ loses $k$ digits of accuracy. If $\kappa(A) \cdot \varepsilon_{\text{mach}} \geq 1$ , the solution is essentially meaningless.

Condition of eigenvalue problems: for a symmetric matrix, the condition number for computing eigenvalue $\lambda_i$ is $\kappa(\lambda_i) = 1$ (eigenvalues are well-conditioned). For a non-symmetric matrix, eigenvalue condition numbers can be exponentially large (ill-conditioned eigenvectors).

Condition of matrix inversion: $\kappa(A^{-1} b) = \kappa(A)$ . Computing the inverse explicitly doubles the condition number in practice — prefer to solve the linear system $Ax = b$ directly.

Numerical Stability of Common Operations

Dot product: $c = \sum_i a_i b_i$ in floating point incurs $O(n\varepsilon_{\text{mach}})$ forward error. Compensated summation (Kahan) reduces this to $O(\varepsilon_{\text{mach}})$ independent of $n$ :

s = 0; c = 0
for each x_i:
    y = x_i - c
    t = s + y
    c = (t - s) - y
    s = t

Matrix-vector product: stable; error $O(n\varepsilon_{\text{mach}}) \|A\| \|x\|$ .

Triangular solve: stable when pivoting is used; can be unstable without.

Worked Example

Example 1: Log-Sum-Exp and Numerical Softmax

Computing softmax $\sigma_i = e^{x_i} / \sum_j e^{x_j}$ for logits $x = (1000, 1001, 1002)$ :

Naive: $e^{1002} > 10^{434}$ — overflow in float32 (max $\approx 3.4 \times 10^{38}$ ). Result: NaN.

Stable: subtract $M = \max x_j = 1002$ first:

$\sigma_i = \frac{e^{x_i - M}}{\sum_j e^{x_j - M}} = \frac{e^{x_i - 1002}}{e^0 + e^{-1} + e^{-2}} = \frac{e^{x_i - 1002}}{1.135}.$

Now exponents are $\leq 0$ , so no overflow. The log-sum-exp trick: $\log\sum_j e^{x_j} = M + \log\sum_j e^{x_j - M}$ .

This exact pattern appears in every deep learning framework's cross-entropy implementation.

Example 2: Condition Number in Linear Regression

The normal equations $A^T A \hat\beta = A^T y$ have condition number $\kappa(A^T A) = \kappa(A)^2$ . Forming the normal equations explicitly squares the condition number — potentially catastrophic for ill-conditioned $A$ .

Remedy: solve via QR factorization $A = QR$ , then $R\hat\beta = Q^T y$ . The QR solve has condition number $\kappa(A)$ , not $\kappa(A)^2$ .

For example, if $A$ has $\kappa(A) = 10^7$ and we use float32 ( $\varepsilon_{\text{mach}} \approx 10^{-7}$ ): normal equations have $\kappa = 10^{14} > 1/\varepsilon_{\text{mach}}$ — the solution is numerically meaningless. QR gives $\kappa = 10^7$ , which just barely works in float32.

Example 3: Condition Number of the Hessian in Optimization

For minimizing $f(\theta)$ with gradient descent, the condition number $\kappa(\nabla^2 f)$ controls convergence. For a quadratic $f(\theta) = \frac{1}{2}\theta^T H\theta - b^T\theta$ :

Gradient descent converges in $O(\kappa(H) \log(1/\varepsilon))$ steps
Conjugate gradient converges in $O(\sqrt{\kappa(H)} \log(1/\varepsilon))$ steps
Newton's method (with $H$ exact): $O(\log\log(1/\varepsilon))$ steps

A neural network loss with $\kappa(H) \approx 10^6$ requires $10^6$ gradient descent steps vs $10^3$ conjugate gradient steps. Preconditioning reduces the effective $\kappa$ — Adam's adaptive learning rates are an approximation to diagonal preconditioning.

Connections

Where Your Intuition Breaks

Condition numbers describe the worst-case amplification of input error to output error — but an ill-conditioned problem solved by a backward-stable algorithm can still produce accurate results if the errors happen not to excite the worst case. Conversely, a well-conditioned problem solved by an unstable algorithm (e.g., Gram-Schmidt orthogonalization without re-orthogonalization) produces inaccurate results. The condition number bounds the error for any algorithm; backward stability guarantees the error is proportional to the condition number for a specific algorithm. Both properties are necessary for a guarantee; neither alone is sufficient.

💡Intuition

Condition number is a property of the problem, not the algorithm. An ill-conditioned problem cannot be solved accurately by any algorithm, no matter how clever. If $\kappa(A) = 10^{14}$ and you use float64, the 14 digits of accuracy are entirely consumed by the problem's inherent sensitivity — no floating-point precision remains for the answer. Preconditioning does not reduce $\kappa(A)$ ; it replaces $A$ with $P^{-1}A$ where $\kappa(P^{-1}A)$ is smaller. This is a change of problem, not algorithm.

💡Intuition

The log-sum-exp trick is everywhere in ML because probabilities are tiny. Computing log-likelihoods involves $\log\sum_i p_i$ where each $p_i$ is exponentially small (e.g., $e^{-100}$ ). Direct summation underflows to zero; taking the log gives $-\infty$ . The log-sum-exp factored form keeps values in range throughout. The same trick appears in: attention (softmax of logits), CTC loss, HMM forward algorithm, normalizing flows, and any computation involving probabilities of sequences. It is not optional — without it, training fails silently with NaN losses.

⚠️Warning

bfloat16 and float16 have very different numerical properties. float16 has $\varepsilon_{\text{mach}} \approx 10^{-3}$ and max value $\approx 65504$ — it overflows for activations or weights that exceed $10^4$ . bfloat16 has $\varepsilon_{\text{mach}} \approx 10^{-2}$ (worse precision than float16) but max value $\approx 3.4 \times 10^{38}$ (same as float32) — it rarely overflows. Modern LLM training uses bfloat16 for activations (avoiding overflow) with float32 master weights for gradient accumulation (maintaining precision). Mixing float16 and bfloat16 in a pipeline silently produces incorrect results due to different overflow behaviors.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Stochastic Processes

Bridge: MCMC, Langevin Dynamics & Diffusion Models as SDEs

Direct Linear Solvers: LU, Cholesky & QR Factorizations