Skip to content

IEEE 754 Standard

The IEEE 754 floating-point standard defines how computers represent real numbers in binary. Understanding this representation explains why certain decimal values cannot be stored exactly and why floating-point arithmetic behaves unexpectedly.

Mental Model

A 64-bit float packs three fields into 8 bytes: a sign bit, an 11-bit exponent, and a 52-bit mantissa. This is binary scientific notation -- the mantissa gives precision and the exponent gives range. Because the mantissa has only 52 bits, most decimal fractions (like 0.1) cannot be represented exactly, which is the root cause of floating-point surprises.

Binary Scientific Notation

Floating-point numbers use binary scientific notation, analogous to decimal scientific notation.

1. Decimal vs Binary

In decimal, we write \(1234.5 = 1.2345 \times 10^3\). Binary follows the same pattern.

```python

Decimal scientific notation

print(f"{1234.5:.4e}") # 1.2345e+03

Binary representation of 5.75

5.75 = 4 + 1 + 0.5 + 0.25 = 2^2 + 2^0 + 2^-1 + 2^-2

Binary: 101.11 = 1.0111 × 2^2

print(5.75) # 5.75

Verify

print(4 + 1 + 0.5 + 0.25) # 5.75 ```

2. Floating-Point Formula

IEEE 754 encodes numbers as:

\[x = (-1)^s \times (1 + m) \times 2^{e-\text{bias}}\]

Where:

  • \(s\) is the sign bit (0 = positive, 1 = negative)
  • \(m\) is the mantissa (fractional part after the implicit 1)
  • \(e\) is the biased exponent
  • bias = 1023 for double precision, 127 for single precision

```python import struct

def float_to_binary(x): """Show IEEE 754 binary representation.""" # Pack as double, unpack as 8 bytes packed = struct.pack('>d', x) # Convert to binary string bits = ''.join(f'{byte:08b}' for byte in packed)

sign = bits[0]
exponent = bits[1:12]
mantissa = bits[12:]

return sign, exponent, mantissa

Example: 5.75

s, e, m = float_to_binary(5.75) print(f"Sign: {s}") print(f"Exponent: {e} = {int(e, 2)} (biased)") print(f"Mantissa: {m[:20]}...")

Decode

bias = 1023 exp_val = int(e, 2) - bias print(f"\nActual exponent: {int(e, 2)} - {bias} = {exp_val}") ```

Double Precision Format

Python's float uses IEEE 754 double precision (64 bits).

1. Bit Layout

The 64-bit double precision format allocates bits as follows:

Component Bits Range
Sign 1 bit 0 or 1
Exponent 11 bits 0–2047 (biased)
Mantissa 52 bits Fraction after implicit 1

```python import sys

Python float info

print(f"Max float: {sys.float_info.max:.6e}") print(f"Min positive: {sys.float_info.min:.6e}") print(f"Mantissa digits: {sys.float_info.mant_dig}") print(f"Max exponent: {sys.float_info.max_exp}") print(f"Min exponent: {sys.float_info.min_exp}") ```

2. Precision Limits

52 mantissa bits provide approximately 15–17 significant decimal digits.

```python

52 bits of precision

2^52 ≈ 4.5 × 10^15

Numbers up to 2^52 are exact integers

print(252) # 4503599627370496 (exact) print(252 + 1) # 4503599627370497 (exact) print(float(2**52)) # 4503599627370496.0 (exact)

Beyond 2^53, not all integers are representable

print(253) # 9007199254740992 print(253 + 1) # 9007199254740993 print(float(2**53 + 1))# 9007199254740992.0 (lost precision!) ```

3. Exponent Range

The 11-bit exponent allows magnitudes from \(10^{-308}\) to \(10^{308}\).

```python import sys

Exponent limits

print(f"Largest: {sys.float_info.max:.6e}") # ~1.8e+308 print(f"Smallest: {sys.float_info.min:.6e}") # ~2.2e-308

Overflow to infinity

print(1e308 * 10) # inf

Underflow to zero (gradual)

print(1e-323) # 1e-323 (subnormal) print(1e-324) # 0.0 (underflow) ```

Single Precision

NumPy provides single precision (32-bit) floats.

1. Bit Layout

Single precision uses fewer bits:

Component Bits Range
Sign 1 bit 0 or 1
Exponent 8 bits 0–255 (biased)
Mantissa 23 bits Fraction after implicit 1

```python import numpy as np

Single precision info

info = np.finfo(np.float32) print(f"Max: {info.max:.6e}") print(f"Min: {info.min:.6e}") print(f"Precision: {info.precision} decimal digits") print(f"Epsilon: {info.eps:.6e}") ```

2. Double vs Single

Compare precision between formats.

```python import numpy as np

Same value in different precisions

val = 1.23456789012345

d = np.float64(val) s = np.float32(val)

print(f"Original: {val}") print(f"float64: {d}") print(f"float32: {s}") # Lost digits!

Precision difference

print(f"\nfloat64 epsilon: {np.finfo(np.float64).eps:.2e}") print(f"float32 epsilon: {np.finfo(np.float32).eps:.2e}") ```

Decimal Fractions Problem

Many common decimal fractions have infinite binary representations.

1. Why 0.1 Is Inexact

The decimal 0.1 cannot be represented exactly in binary.

\[0.1_{10} = 0.0\overline{0011}_{2}\]

The binary expansion repeats infinitely.

```python

0.1 is not exactly representable

print(f"{0.1:.20f}") # 0.10000000000000000555...

Show the actual stored value

from decimal import Decimal print(Decimal(0.1))

0.1000000000000000055511151231257827021181583404541015625

Compare with true 0.1

true_01 = Decimal('0.1') stored_01 = Decimal(0.1) print(f"Error: {stored_01 - true_01:.2e}") ```

2. Exact Representations

Numbers that are sums of powers of 2 are exact.

```python

Powers of 2 are exact

print(f"{0.5:.20f}") # 0.50000000000000000000 (exact: 2^-1) print(f"{0.25:.20f}") # 0.25000000000000000000 (exact: 2^-2) print(f"{0.125:.20f}") # 0.12500000000000000000 (exact: 2^-3)

Sums of powers of 2 are exact

print(f"{0.75:.20f}") # 0.75000000000000000000 (2^-1 + 2^-2)

But 0.1, 0.2, 0.3 are not

print(f"{0.1:.20f}") # Error in last digits print(f"{0.2:.20f}") # Error in last digits print(f"{0.3:.20f}") # Error in last digits ```

3. Common Inexact Values

Decimal Binary (approximate) Exact?
0.5 0.1
0.25 0.01
0.1 0.0001100110011...
0.2 0.0011001100110...
0.3 0.0100110011001...

```python

Quick test for exact representation

def is_exact_binary(x, tol=1e-16): """Check if x can be represented exactly.""" from decimal import Decimal stored = Decimal(x) intended = Decimal(str(x)) return abs(stored - intended) < Decimal(str(tol))

print(f"0.5 exact: {is_exact_binary(0.5)}") # True print(f"0.1 exact: {is_exact_binary(0.1)}") # False print(f"0.125 exact: {is_exact_binary(0.125)}") # True ```

Rounding Modes

IEEE 754 defines rounding modes for operations.

1. Round to Nearest Even

Default mode—rounds to nearest, ties go to even.

```python

Python's round() uses banker's rounding (round half to even)

print(round(0.5)) # 0 (tie goes to even) print(round(1.5)) # 2 (tie goes to even) print(round(2.5)) # 2 (tie goes to even) print(round(3.5)) # 4 (tie goes to even)

Contrast with away-from-zero rounding

import math print(math.floor(2.5 + 0.5)) # 3 (traditional rounding) ```

2. Directed Rounding

Other rounding directions available via math module.

```python import math

x = 2.7

print(f"floor({x}): {math.floor(x)}") # 2 (toward -∞) print(f"ceil({x}): {math.ceil(x)}") # 3 (toward +∞) print(f"trunc({x}): {math.trunc(x)}") # 2 (toward 0)

Negative numbers show the difference

y = -2.7 print(f"\nfloor({y}): {math.floor(y)}") # -3 print(f"ceil({y}): {math.ceil(y)}") # -2 print(f"trunc({y}): {math.trunc(y)}") # -2 ```

Practical Implications

Understanding IEEE 754 helps write correct numerical code.

1. Integer Range in Floats

Floats can store integers exactly up to \(2^{53}\).

```python

Safe integer range for floats

MAX_SAFE_INT = 2**53

print(f"Max safe integer: {MAX_SAFE_INT:,}")

Within range: exact

a = float(MAX_SAFE_INT - 1) print(a == MAX_SAFE_INT - 1) # True

Beyond range: may lose precision

b = float(MAX_SAFE_INT + 1) print(b == MAX_SAFE_INT + 1) # False! print(f"Stored: {b:.0f}, Expected: {MAX_SAFE_INT + 1}") ```

2. Format String Precision

Avoid false precision in output.

```python x = 1/3

Too many digits suggests false precision

print(f"{x:.20f}") # 0.33333333333333331483

Match actual precision (~15-17 digits)

print(f"{x:.15f}") # 0.333333333333333 print(f"{x:.6f}") # 0.333333 (reasonable for display)

General rule: 15-16 significant digits max

import sys print(f"Reliable digits: {sys.float_info.dig}") # 15 ```

3. Hexadecimal Float Literal

Python supports exact hexadecimal float representation.

```python

Hexadecimal float literals (exact representation)

x = 0x1.999999999999ap-4 # Exact representation of ~0.1 print(x) # 0.1

Get hex representation of any float

print((0.1).hex()) # 0x1.999999999999ap-4 print((0.5).hex()) # 0x1.0000000000000p-1

Round-trip without precision loss

original = 3.141592653589793 hex_rep = original.hex() recovered = float.fromhex(hex_rep) print(original == recovered) # True ```


Exercises

Exercise 1. Use Python's struct module to extract the sign, exponent, and mantissa bits of the float −6.5. Verify your result by reconstructing the value from the extracted components.

Solution to Exercise 1

```python import struct

def float_parts(x): packed = struct.pack('>d', x) bits = ''.join(f'{b:08b}' for b in packed) sign = int(bits[0]) exponent = int(bits[1:12], 2) mantissa = bits[12:] return sign, exponent, mantissa

s, e, m = float_parts(-6.5) print(f"Sign: {s}") # 1 (negative) print(f"Exponent: {e}") # 1025 print(f"Mantissa: {m[:10]}...")

Reconstruct: (-1)^1 * (1 + mantissa_fraction) * 2^(1025-1023)

= -1 * 1.625 * 4 = -6.5

bias = 1023 mantissa_val = int(m, 2) / (252) reconstructed = ((-1)s) * (1 + mantissa_val) * (2**(e - bias)) print(f"Reconstructed: {reconstructed}") # -6.5 ```

The sign bit is 1 (negative), the biased exponent gives an actual exponent of 2, and the mantissa encodes the fraction 0.625, so the value is -(1.625) * 4 = -6.5.


Exercise 2. Demonstrate that float(2**53 + 1) == float(2**53) is True. Explain why this happens in terms of IEEE 754 mantissa bits.

Solution to Exercise 2

```python a = 253 b = 253 + 1

print(float(a) == float(b)) # True print(f"2^53 = {a}") print(f"2^53 + 1 = {b}") print(f"float(2^53) = {float(a):.1f}") print(f"float(2^53 + 1) = {float(b):.1f}") ```

IEEE 754 double precision has 52 mantissa bits (plus one implicit bit), so it can represent integers exactly up to 2^53. Beyond that, consecutive floats are spaced 2 apart, and 2^53 + 1 rounds down to 2^53.


Exercise 3. Write a function is_exact_float(x) that checks whether a decimal string like "0.375" can be represented exactly as a Python float. Test it with "0.5", "0.1", and "0.375".

Solution to Exercise 3

```python from decimal import Decimal

def is_exact_float(s): float_val = float(s) return Decimal(float_val) == Decimal(s)

print(is_exact_float("0.5")) # True print(is_exact_float("0.1")) # False print(is_exact_float("0.375")) # True ```

Values that are sums of powers of 2 (like 0.5 = 2^-1 and 0.375 = 2^-2 + 2^-3) are exact. Values like 0.1 have infinite binary expansions and cannot be stored exactly.