Floating-Point Representation¶
Real numbers form a continuous set, but computers store numbers using finite binary representations. As a result, most real numbers cannot be represented exactly in memory.
To balance range, precision, and performance, modern computers use the IEEE 754 floating-point standard. This representation underlies floating-point arithmetic in languages such as C, Python, Java, and many scientific libraries.
Understanding floating-point representation explains many surprising behaviors in numerical computing, including:
- rounding errors
0.1 + 0.2 != 0.3- loss of precision when subtracting nearly equal numbers
- overflow and underflow
- the existence of
NaNandInfinity
1. Scientific Notation in Binary¶
Floating-point numbers are essentially binary scientific notation.
In decimal scientific notation, a number is written as:
[ x = \pm m \times 10^e ]
Example:
[ 314.5 = 3.145 \times 10^2 ]
Binary floating-point uses the same idea but with base 2 instead of 10:
[ x = \pm m \times 2^e ]
Example:
[ 13.25 = 1.10101_2 \times 2^3 ]
Floating-point numbers store:
- the sign
- the exponent
- the significand (mantissa)
2. IEEE 754 Floating-Point Format¶
IEEE 754 defines the standard binary floating-point formats used by modern hardware.
The value of a floating-point number is:
[ \text{value} = (-1)^S \times 1.\text{fraction} \times 2^{(\text{exponent} - \text{bias})} ]
Where:
- S = sign bit
- exponent = encoded exponent
- fraction = significand bits
Structure of floating-point numbers¶
float32 (single precision)¶
| Field | Bits |
|---|---|
| Sign | 1 |
| Exponent | 8 |
| Fraction | 23 |
Total:
32 bits
float64 (double precision)¶
| Field | Bits |
|---|---|
| Sign | 1 |
| Exponent | 11 |
| Fraction | 52 |
Total:
64 bits
Bit layout visualization¶
flowchart LR
A["Sign (1 bit)"] --> B["Exponent"]
B --> C["Fraction"]
subgraph float64
A
B["Exponent (11 bits)"]
C["Fraction (52 bits)"]
end
3. The Implicit Leading Bit¶
In normalized floating-point numbers, the leading bit of the significand is always 1.
For example:
1.010101₂
Because this leading 1 is guaranteed, IEEE 754 does not store it explicitly.
This provides one additional bit of precision.
So:
| Format | Stored bits | Effective precision |
|---|---|---|
| float32 | 23 | 24 |
| float64 | 52 | 53 |
This is called the hidden bit or implicit leading 1.
4. Exponent Bias¶
The exponent field is stored using a bias so that both positive and negative exponents can be represented using unsigned integers.
| Format | Exponent bits | Bias |
|---|---|---|
| float32 | 8 | 127 |
| float64 | 11 | 1023 |
The actual exponent is:
[ e = E - \text{bias} ]
Example:
Exponent field = 130
Bias = 127
Actual exponent = 3
5. Example: Interpreting a Floating-Point Number¶
Consider a simplified example:
Sign = 0
Exponent = 130
Fraction = 010000...
Step 1: sign
(-1)^0 = 1
Step 2: exponent
130 - 127 = 3
Step 3: significand
1.01₂
Step 4: value
[ 1.01_2 \times 2^3 ]
= 1.25 × 8
= 10
Visualization¶
flowchart TD
A["Sign bit"] --> B["Exponent field"]
B --> C["Exponent - bias"]
C --> D["Power of 2"]
A --> E["Sign multiplier"]
D --> F["Scale significand"]
G["1.fraction"] --> F
E --> H["Final value"]
F --> H
6. Why Many Decimal Numbers Cannot Be Represented Exactly¶
Some decimal fractions have infinite binary expansions.
Example:
0.1 (decimal)
In binary:
0.000110011001100110011...
The pattern repeats forever.
Because floating-point numbers store only finite bits, the number must be rounded.
This approximation explains many floating-point surprises.
Famous example¶
print(0.1 + 0.2)
Output:
0.30000000000000004
This occurs because both 0.1 and 0.2 are stored as approximations.
7. Machine Epsilon¶
The machine epsilon is the smallest number that can be added to 1.0 that produces a distinct floating-point value.
For float64:
[ \epsilon \approx 2^{-52} ]
Numerically:
2.22 × 10⁻¹⁶
For float32:
1.19 × 10⁻⁷
Visualization of spacing¶
flowchart LR
A["1.0"] --> B["1 + ε"]
B --> C["1 + 2ε"]
Machine epsilon determines the relative precision of floating-point arithmetic.
8. Non-Uniform Precision¶
Floating-point numbers are not evenly spaced.
The gap between representable numbers increases with magnitude.
Near value (v):
[ \text{spacing} \approx v \times \epsilon ]
Examples:
| Value | Approx gap |
|---|---|
| 1 | 2e-16 |
| 1e6 | 2e-10 |
| 1e16 | ≈ 1 |
Example: absorption¶
print(1e16 + 1 == 1e16)
Output:
True
The 1 is smaller than the spacing between numbers near 1e16, so it disappears.
Visualization¶
flowchart LR
A["1e16"] --> B["+1"]
B --> C["rounded result"]
C --> D["1e16"]
9. Catastrophic Cancellation¶
When subtracting two nearly equal numbers, most significant digits cancel out.
Example:
a = 1.0000001
b = 1.0000000
a - b = 0.0000001
If the inputs contain rounding error, the result may lose nearly all meaningful digits.
This phenomenon is called catastrophic cancellation.
Example¶
import numpy as np
a = 1.0000001
b = 1.0000000
print(a - b)
Small rounding errors may dominate the result.
Often the solution is algebraic reformulation.
10. Special Floating-Point Values¶
IEEE 754 defines special bit patterns for exceptional conditions.
Infinity¶
Occurs when numbers exceed representable range.
+∞
-∞
Example:
import numpy as np
print(np.inf * 2)
NaN (Not a Number)¶
Represents undefined operations.
Examples:
0 / 0
∞ - ∞
sqrt(-1)
Example:
import numpy as np
print(np.inf - np.inf)
Output:
nan
NaN comparison behavior¶
NaN is not equal to anything, including itself.
import numpy as np
print(np.nan == np.nan)
Output:
False
Correct check:
np.isnan(x)
11. Memory and Precision Tradeoff¶
Different floating-point types balance precision and memory usage.
| Type | Bytes | Precision |
|---|---|---|
| float32 | 4 | ~7 decimal digits |
| float64 | 8 | ~16 decimal digits |
Example:
import numpy as np
n = 10_000_000
print(np.zeros(n, dtype=np.float64).nbytes)
print(np.zeros(n, dtype=np.float32).nbytes)
Visualization¶
flowchart LR
A["float32"] --> B["4 bytes"]
C["float64"] --> D["8 bytes"]
float32 uses half the memory but also half the precision.
12. Practical Guidelines¶
When working with floating-point numbers:
Avoid direct equality comparisons¶
Use tolerances instead.
np.isclose(a, b)
Prefer numerically stable formulas¶
Example:
Instead of
[ \sqrt{x^2 + y^2} ]
use
np.hypot(x, y)
Be cautious with subtraction¶
Subtracting nearly equal numbers can destroy precision.
Choose data types carefully¶
- float32: memory efficient
- float64: higher precision
13. Worked Examples¶
Example 1¶
Explain why:
0.1 + 0.2 != 0.3
Both operands are approximated in binary, so rounding errors accumulate.
Example 2¶
Determine the approximate precision of float32.
≈ 2⁻²³ ≈ 1.19e-7
Example 3¶
Show absorption:
1e16 + 1 = 1e16
The difference is below floating-point resolution.
14. Exercises¶
- What are the three fields of IEEE 754 floating-point numbers?
- What is the bias for float64?
- Why can 0.1 not be represented exactly in binary?
- What is machine epsilon?
- Why is
np.nan == np.nanfalse? - What happens when a floating-point value exceeds the maximum representable number?
- Why does adding a small number to a very large number sometimes have no effect?
15. Short Answers¶
- Sign, exponent, significand
- 1023
- Its binary expansion is infinite
- Smallest number distinguishable from 1.0
- NaN is defined as unequal to everything
- It becomes infinity
- Because spacing between floats grows with magnitude
16. Summary¶
- Floating-point numbers represent real values using binary scientific notation.
- IEEE 754 stores numbers using sign, exponent, and significand fields.
- The implicit leading 1 increases precision.
- Many decimal fractions cannot be represented exactly in binary.
- Machine epsilon defines floating-point precision.
- Floating-point spacing grows with magnitude.
- Catastrophic cancellation can destroy precision when subtracting nearly equal numbers.
- IEEE 754 includes special values such as NaN and Infinity.
Understanding floating-point representation is essential for writing correct numerical software and scientific code.