Why Python is Slow¶
The Performance Gap¶
Python is often 10-100x slower than C for computational tasks. Understanding why helps you write faster Python code.
Performance Comparison (sum of 10 million integers):
C: ████ 0.01s
Java: ████████ 0.02s
JavaScript: ████████████ 0.03s
Python: ████████████████████████████████████████ 0.85s
Python is ~85x slower than C for this task!
The Five Reasons Python is Slow¶
1. Dynamic Typing¶
Python checks types at runtime, not compile time:
# Python doesn't know what 'x' is until runtime
def add(x, y):
return x + y
add(1, 2) # int + int
add("a", "b") # str + str
add([1], [2]) # list + list
# Each call requires:
# 1. Check type of x
# 2. Check type of y
# 3. Find appropriate __add__ method
# 4. Call it
C equivalent:
// Type known at compile time - no runtime checks
int add(int x, int y) {
return x + y; // Just one CPU instruction
}
2. Everything is an Object¶
In Python, even simple integers are full objects:
import sys
x = 42
print(sys.getsizeof(x)) # 28 bytes!
# A Python int contains:
# - Reference count (8 bytes)
# - Type pointer (8 bytes)
# - Value size (8 bytes)
# - Actual value (variable)
Python int object (28 bytes):
┌────────────────────────────────┐
│ Reference Count (8 bytes) │
├────────────────────────────────┤
│ Type Pointer (8 bytes) │ → points to int type
├────────────────────────────────┤
│ Object Size (8 bytes) │
├────────────────────────────────┤
│ Actual Value (4 bytes) │ ← the number 42
└────────────────────────────────┘
C int (4 bytes):
┌────────────────────────────────┐
│ Actual Value (4 bytes) │ ← the number 42
└────────────────────────────────┘
7x more memory, plus indirection overhead!
3. Interpreted Execution¶
Python bytecode runs on a virtual machine:
import dis
def simple_loop():
total = 0
for i in range(1000):
total += i
return total
dis.dis(simple_loop)
Output:
2 0 LOAD_CONST 1 (0)
2 STORE_FAST 0 (total)
3 4 LOAD_GLOBAL 0 (range)
6 LOAD_CONST 2 (1000)
8 CALL_FUNCTION 1
10 GET_ITER
>> 12 FOR_ITER 6 (to 26)
14 STORE_FAST 1 (i)
4 16 LOAD_FAST 0 (total)
18 LOAD_FAST 1 (i)
20 INPLACE_ADD
22 STORE_FAST 0 (total)
24 JUMP_ABSOLUTE 6 (to 12)
5 >> 26 LOAD_FAST 0 (total)
28 RETURN_VALUE
Each bytecode instruction involves: 1. Fetch instruction 2. Decode instruction 3. Dispatch to handler 4. Execute handler 5. Repeat
4. Memory Indirection¶
Python lists store references, not values:
Python List of integers:
┌─────────────────┐
│ List Object │
│ ┌───────────┐ │
│ │ ref[0] ──┼──┼──▶ [PyInt: 1] (somewhere in heap)
│ │ ref[1] ──┼──┼──▶ [PyInt: 2] (somewhere else)
│ │ ref[2] ──┼──┼──▶ [PyInt: 3] (somewhere else)
│ └───────────┘ │
└─────────────────┘
Each access:
1. Load list pointer
2. Load reference from list
3. Follow reference to object
4. Check object type
5. Extract value from object
C Array:
┌─────────────────┐
│ 1 │ 2 │ 3 │ Contiguous in memory
└─────────────────┘
Each access:
1. Calculate offset
2. Load value
5. Global Interpreter Lock (GIL)¶
Python can't truly parallelize CPU-bound threads:
import threading
import time
counter = 0
def increment():
global counter
for _ in range(1_000_000):
counter += 1
# Two threads
t1 = threading.Thread(target=increment)
t2 = threading.Thread(target=increment)
start = time.perf_counter()
t1.start()
t2.start()
t1.join()
t2.join()
elapsed = time.perf_counter() - start
print(f"Time: {elapsed:.2f}s")
print(f"Counter: {counter}") # Likely not 2,000,000!
# Two threads are NOT faster than one
# because GIL prevents parallel execution
Quantifying the Overhead¶
import time
import numpy as np
def benchmark_overhead():
n = 10_000_000
# Pure Python loop
start = time.perf_counter()
total = 0
for i in range(n):
total += i
python_time = time.perf_counter() - start
# Python with list
data = list(range(n))
start = time.perf_counter()
total = sum(data)
builtin_time = time.perf_counter() - start
# NumPy (C code)
arr = np.arange(n)
start = time.perf_counter()
total = np.sum(arr)
numpy_time = time.perf_counter() - start
print(f"Python loop: {python_time:.3f}s (1.0x)")
print(f"Python sum(): {builtin_time:.3f}s ({python_time/builtin_time:.1f}x)")
print(f"NumPy sum(): {numpy_time:.3f}s ({python_time/numpy_time:.1f}x)")
benchmark_overhead()
Typical output:
Python loop: 0.850s (1.0x)
Python sum(): 0.120s (7.1x)
NumPy sum(): 0.008s (106.3x)
Where Does the Time Go?¶
Time breakdown for Python loop (x += 1):
┌─────────────────────────────────────────────────────────────┐
│ │
│ Bytecode dispatch: ████████████████ 30% │
│ Type checking: ████████████ 25% │
│ Object creation: ██████████ 20% │
│ Dictionary lookups: ██████ 15% │
│ Reference counting: ████ 10% │
│ Actual computation: █ <1% │
│ │
└─────────────────────────────────────────────────────────────┘
Most time is overhead, not actual computation!
When Python is "Fast Enough"¶
Python's slowness often doesn't matter:
# I/O bound - Python speed doesn't matter
response = requests.get(url) # Network is bottleneck
data = f.read() # Disk is bottleneck
# Human interaction - Python speed doesn't matter
user_input = input("Enter: ") # Human is bottleneck
# Calling fast libraries - Python is just orchestrating
result = np.dot(huge_matrix_a, huge_matrix_b) # NumPy does work
# One-time scripts - Development time matters more
df = pd.read_csv('data.csv').groupby('x').mean()
Strategies to Speed Up Python¶
1. Use Built-in Functions¶
# Slow
total = 0
for x in data:
total += x
# Fast (built-in is C)
total = sum(data)
2. Use NumPy/Pandas¶
# Slow
result = [x * 2 for x in data]
# Fast
result = np.array(data) * 2
3. Use List Comprehensions¶
# Slower
result = []
for x in data:
result.append(x * 2)
# Faster (optimized bytecode)
result = [x * 2 for x in data]
4. Avoid Global Variables¶
# Slower (global lookup each iteration)
multiplier = 2
def slow():
return [x * multiplier for x in range(1000000)]
# Faster (local lookup)
def fast():
mult = 2 # Local variable
return [x * mult for x in range(1000000)]
5. Use Appropriate Data Structures¶
# Slow (O(n) lookup)
if item in large_list:
pass
# Fast (O(1) lookup)
if item in large_set:
pass
Summary¶
| Overhead Source | Impact | Mitigation |
|---|---|---|
| Dynamic typing | Type checks every operation | Use NumPy (typed arrays) |
| Object overhead | Memory and indirection | Use NumPy (raw data) |
| Interpretation | Bytecode dispatch | Use compiled extensions |
| Memory layout | Cache misses | Use contiguous arrays |
| GIL | No CPU parallelism | Use multiprocessing |
Key insight:
Python is slow for tight computational loops. The solution isn't to avoid Python—it's to move the hot loops into compiled code (NumPy, Cython, C extensions) while keeping Python for orchestration and glue code.