Floating-Point Representation¶

Real numbers form a continuous set, but computers store numbers using finite binary representations. As a result, most real numbers cannot be represented exactly in memory.

To balance range, precision, and performance, modern computers use the IEEE 754 floating-point standard. This representation underlies floating-point arithmetic in languages such as C, Python, Java, and many scientific libraries.

Understanding floating-point representation explains many surprising behaviors in numerical computing, including:

rounding errors
0.1 + 0.2 != 0.3
loss of precision when subtracting nearly equal numbers
overflow and underflow
the existence of NaN and Infinity

1. Scientific Notation in Binary¶

Floating-point numbers are essentially binary scientific notation.

In decimal scientific notation, a number is written as:

[ x = \pm m \times 10^e ]

Example:

[ 314.5 = 3.145 \times 10^2 ]

Binary floating-point uses the same idea but with base 2 instead of 10:

[ x = \pm m \times 2^e ]

Example:

[ 13.25 = 1.10101_2 \times 2^3 ]

Floating-point numbers store:

the sign
the exponent
the significand (mantissa)

2. IEEE 754 Floating-Point Format¶

IEEE 754 defines the standard binary floating-point formats used by modern hardware.

The value of a floating-point number is:

[ \text{value} = (-1)^S \times 1.\text{fraction} \times 2^{(\text{exponent} - \text{bias})} ]

Where:

S = sign bit
exponent = encoded exponent
fraction = significand bits

Structure of floating-point numbers¶

float32 (single precision)¶

Field	Bits
Sign	1
Exponent	8
Fraction	23

Total:

32 bits

float64 (double precision)¶

Field	Bits
Sign	1
Exponent	11
Fraction	52

Total:

64 bits

Bit layout visualization¶

flowchart LR
    A["Sign (1 bit)"] --> B["Exponent"]
    B --> C["Fraction"]

    subgraph float64
        A
        B["Exponent (11 bits)"]
        C["Fraction (52 bits)"]
    end

3. The Implicit Leading Bit¶

In normalized floating-point numbers, the leading bit of the significand is always 1.

For example:

1.010101₂

Because this leading 1 is guaranteed, IEEE 754 does not store it explicitly.

This provides one additional bit of precision.

So:

Format	Stored bits	Effective precision
float32	23	24
float64	52	53

This is called the hidden bit or implicit leading 1.

4. Exponent Bias¶

The exponent field is stored using a bias so that both positive and negative exponents can be represented using unsigned integers.

Format	Exponent bits	Bias
float32	8	127
float64	11	1023

The actual exponent is:

[ e = E - \text{bias} ]

Example:

Exponent field = 130
Bias = 127
Actual exponent = 3

5. Example: Interpreting a Floating-Point Number¶

Consider a simplified example:

Sign = 0
Exponent = 130
Fraction = 010000...

Step 1: sign

(-1)^0 = 1

Step 2: exponent

130 - 127 = 3

Step 3: significand

1.01₂

Step 4: value

[ 1.01_2 \times 2^3 ]

= 1.25 × 8
= 10

Visualization¶

flowchart TD
    A["Sign bit"] --> B["Exponent field"]
    B --> C["Exponent - bias"]
    C --> D["Power of 2"]

    A --> E["Sign multiplier"]
    D --> F["Scale significand"]

    G["1.fraction"] --> F
    E --> H["Final value"]
    F --> H

6. Why Many Decimal Numbers Cannot Be Represented Exactly¶

Some decimal fractions have infinite binary expansions.

Example:

0.1 (decimal)

In binary:

0.000110011001100110011...

The pattern repeats forever.

Because floating-point numbers store only finite bits, the number must be rounded.

This approximation explains many floating-point surprises.

Famous example¶

print(0.1 + 0.2)

Output:

0.30000000000000004

This occurs because both 0.1 and 0.2 are stored as approximations.

7. Machine Epsilon¶

The machine epsilon is the smallest number that can be added to 1.0 that produces a distinct floating-point value.

For float64:

[ \epsilon \approx 2^{-52} ]

Numerically:

2.22 × 10⁻¹⁶

For float32:

1.19 × 10⁻⁷

Visualization of spacing¶

flowchart LR
    A["1.0"] --> B["1 + ε"]
    B --> C["1 + 2ε"]

Machine epsilon determines the relative precision of floating-point arithmetic.

8. Non-Uniform Precision¶

Floating-point numbers are not evenly spaced.

The gap between representable numbers increases with magnitude.

Near value (v):

[ \text{spacing} \approx v \times \epsilon ]

Examples:

Value	Approx gap
1	2e-16
1e6	2e-10
1e16	≈ 1

Example: absorption¶

print(1e16 + 1 == 1e16)

Output:

True

The 1 is smaller than the spacing between numbers near 1e16, so it disappears.

Visualization¶

flowchart LR
    A["1e16"] --> B["+1"]
    B --> C["rounded result"]
    C --> D["1e16"]

9. Catastrophic Cancellation¶

When subtracting two nearly equal numbers, most significant digits cancel out.

Example:

a = 1.0000001
b = 1.0000000

a - b = 0.0000001

If the inputs contain rounding error, the result may lose nearly all meaningful digits.

This phenomenon is called catastrophic cancellation.

Example¶

import numpy as np

a = 1.0000001
b = 1.0000000

print(a - b)

Small rounding errors may dominate the result.

Often the solution is algebraic reformulation.

10. Special Floating-Point Values¶

IEEE 754 defines special bit patterns for exceptional conditions.

Infinity¶

Occurs when numbers exceed representable range.

+∞
-∞

Example:

import numpy as np

print(np.inf * 2)

NaN (Not a Number)¶

Represents undefined operations.

Examples:

0 / 0
∞ - ∞
sqrt(-1)

Example:

import numpy as np

print(np.inf - np.inf)

Output:

nan

NaN comparison behavior¶

NaN is not equal to anything, including itself.

import numpy as np

print(np.nan == np.nan)

Output:

False

Correct check:

np.isnan(x)

11. Memory and Precision Tradeoff¶

Different floating-point types balance precision and memory usage.

Type	Bytes	Precision
float32	4	~7 decimal digits
float64	8	~16 decimal digits

Example:

import numpy as np

n = 10_000_000

print(np.zeros(n, dtype=np.float64).nbytes)
print(np.zeros(n, dtype=np.float32).nbytes)

Visualization¶

flowchart LR
    A["float32"] --> B["4 bytes"]
    C["float64"] --> D["8 bytes"]

float32 uses half the memory but also half the precision.

12. Practical Guidelines¶

When working with floating-point numbers:

Avoid direct equality comparisons¶

Use tolerances instead.

np.isclose(a, b)

Prefer numerically stable formulas¶

Example:

Instead of

[ \sqrt{x^2 + y^2} ]

use

np.hypot(x, y)

Be cautious with subtraction¶

Subtracting nearly equal numbers can destroy precision.

Choose data types carefully¶

float32: memory efficient
float64: higher precision

13. Worked Examples¶

Example 1¶

Explain why:

0.1 + 0.2 != 0.3

Both operands are approximated in binary, so rounding errors accumulate.

Example 2¶

Determine the approximate precision of float32.

≈ 2⁻²³ ≈ 1.19e-7

Example 3¶

Show absorption:

1e16 + 1 = 1e16

The difference is below floating-point resolution.

14. Exercises¶

What are the three fields of IEEE 754 floating-point numbers?
What is the bias for float64?
Why can 0.1 not be represented exactly in binary?
What is machine epsilon?
Why is np.nan == np.nan false?
What happens when a floating-point value exceeds the maximum representable number?
Why does adding a small number to a very large number sometimes have no effect?

15. Short Answers¶

Sign, exponent, significand
1023
Its binary expansion is infinite
Smallest number distinguishable from 1.0
NaN is defined as unequal to everything
It becomes infinity
Because spacing between floats grows with magnitude

16. Summary¶

Floating-point numbers represent real values using binary scientific notation.
IEEE 754 stores numbers using sign, exponent, and significand fields.
The implicit leading 1 increases precision.
Many decimal fractions cannot be represented exactly in binary.
Machine epsilon defines floating-point precision.
Floating-point spacing grows with magnitude.
Catastrophic cancellation can destroy precision when subtracting nearly equal numbers.
IEEE 754 includes special values such as NaN and Infinity.

Understanding floating-point representation is essential for writing correct numerical software and scientific code.