Why Python is Slow¶
Mental Model
In a tight Python loop, less than 1% of CPU time goes to actual arithmetic -- the rest is spent on type checking, object creation, dictionary lookups, and reference counting. Python is not slow because your CPU is slow; it is slow because it does enormous bookkeeping for every operation. The fix is to push hot loops into compiled code (NumPy, C extensions) and let Python orchestrate.
The Performance Gap¶
Python is often 10-100x slower than C for computational tasks. Understanding why helps you write faster Python code.
``` Performance Comparison (sum of 10 million integers):
C: ████ 0.01s Java: ████████ 0.02s JavaScript: ████████████ 0.03s Python: ████████████████████████████████████████ 0.85s
Python is ~85x slower than C for this task! ```
The Five Reasons Python is Slow¶
1. Dynamic Typing¶
Python checks types at runtime, not compile time:
```python
Python doesn't know what 'x' is until runtime¶
def add(x, y): return x + y
add(1, 2) # int + int add("a", "b") # str + str add([1], [2]) # list + list
Each call requires:¶
1. Check type of x¶
2. Check type of y¶
3. Find appropriate add method¶
4. Call it¶
```
C equivalent:
c
// Type known at compile time - no runtime checks
int add(int x, int y) {
return x + y; // Just one CPU instruction
}
2. Everything is an Object¶
In Python, even simple integers are full objects:
```python import sys
x = 42 print(sys.getsizeof(x)) # 28 bytes!
A Python int contains:¶
- Reference count (8 bytes)¶
- Type pointer (8 bytes)¶
- Value size (8 bytes)¶
- Actual value (variable)¶
```
``` Python int object (28 bytes): ┌────────────────────────────────┐ │ Reference Count (8 bytes) │ ├────────────────────────────────┤ │ Type Pointer (8 bytes) │ → points to int type ├────────────────────────────────┤ │ Object Size (8 bytes) │ ├────────────────────────────────┤ │ Actual Value (4 bytes) │ ← the number 42 └────────────────────────────────┘
C int (4 bytes): ┌────────────────────────────────┐ │ Actual Value (4 bytes) │ ← the number 42 └────────────────────────────────┘
7x more memory, plus indirection overhead! ```
3. Interpreted Execution¶
Python bytecode runs on a virtual machine:
```python import dis
def simple_loop(): total = 0 for i in range(1000): total += i return total
dis.dis(simple_loop) ```
Output: ``` 2 0 LOAD_CONST 1 (0) 2 STORE_FAST 0 (total)
3 4 LOAD_GLOBAL 0 (range) 6 LOAD_CONST 2 (1000) 8 CALL_FUNCTION 1 10 GET_ITER >> 12 FOR_ITER 6 (to 26) 14 STORE_FAST 1 (i)
4 16 LOAD_FAST 0 (total) 18 LOAD_FAST 1 (i) 20 INPLACE_ADD 22 STORE_FAST 0 (total) 24 JUMP_ABSOLUTE 6 (to 12)
5 >> 26 LOAD_FAST 0 (total) 28 RETURN_VALUE ```
Each bytecode instruction involves:
- Fetch instruction
- Decode instruction
- Dispatch to handler
- Execute handler
- Repeat
4. Memory Indirection¶
Python lists store references, not values:
``` Python List of integers: ┌─────────────────┐ │ List Object │ │ ┌───────────┐ │ │ │ ref[0] ──┼──┼──▶ [PyInt: 1] (somewhere in heap) │ │ ref[1] ──┼──┼──▶ [PyInt: 2] (somewhere else) │ │ ref[2] ──┼──┼──▶ [PyInt: 3] (somewhere else) │ └───────────┘ │ └─────────────────┘
Each access: 1. Load list pointer 2. Load reference from list 3. Follow reference to object 4. Check object type 5. Extract value from object
C Array: ┌─────────────────┐ │ 1 │ 2 │ 3 │ Contiguous in memory └─────────────────┘
Each access: 1. Calculate offset 2. Load value ```
5. Global Interpreter Lock (GIL)¶
Python can't truly parallelize CPU-bound threads:
```python import threading import time
counter = 0
def increment(): global counter for _ in range(1_000_000): counter += 1
Two threads¶
t1 = threading.Thread(target=increment) t2 = threading.Thread(target=increment)
start = time.perf_counter() t1.start() t2.start() t1.join() t2.join() elapsed = time.perf_counter() - start
print(f"Time: {elapsed:.2f}s") print(f"Counter: {counter}") # Likely not 2,000,000!
Two threads are NOT faster than one¶
because GIL prevents parallel execution¶
```
Quantifying the Overhead¶
```python import time import numpy as np
def benchmark_overhead(): n = 10_000_000
# Pure Python loop
start = time.perf_counter()
total = 0
for i in range(n):
total += i
python_time = time.perf_counter() - start
# Python with list
data = list(range(n))
start = time.perf_counter()
total = sum(data)
builtin_time = time.perf_counter() - start
# NumPy (C code)
arr = np.arange(n)
start = time.perf_counter()
total = np.sum(arr)
numpy_time = time.perf_counter() - start
print(f"Python loop: {python_time:.3f}s (1.0x)")
print(f"Python sum(): {builtin_time:.3f}s ({python_time/builtin_time:.1f}x)")
print(f"NumPy sum(): {numpy_time:.3f}s ({python_time/numpy_time:.1f}x)")
benchmark_overhead() ```
Typical output:
Python loop: 0.850s (1.0x)
Python sum(): 0.120s (7.1x)
NumPy sum(): 0.008s (106.3x)
Where Does the Time Go?¶
``` Time breakdown for Python loop (x += 1):
┌─────────────────────────────────────────────────────────────┐ │ │ │ Bytecode dispatch: ████████████████ 30% │ │ Type checking: ████████████ 25% │ │ Object creation: ██████████ 20% │ │ Dictionary lookups: ██████ 15% │ │ Reference counting: ████ 10% │ │ Actual computation: █ <1% │ │ │ └─────────────────────────────────────────────────────────────┘
Most time is overhead, not actual computation! ```
When Python is "Fast Enough"¶
Python's slowness often doesn't matter:
```python
I/O bound - Python speed doesn't matter¶
response = requests.get(url) # Network is bottleneck data = f.read() # Disk is bottleneck
Human interaction - Python speed doesn't matter¶
user_input = input("Enter: ") # Human is bottleneck
Calling fast libraries - Python is just orchestrating¶
result = np.dot(huge_matrix_a, huge_matrix_b) # NumPy does work
One-time scripts - Development time matters more¶
df = pd.read_csv('data.csv').groupby('x').mean() ```
Strategies to Speed Up Python¶
1. Use Built-in Functions¶
```python
Slow¶
total = 0 for x in data: total += x
Fast (built-in is C)¶
total = sum(data) ```
2. Use NumPy/Pandas¶
```python
Slow¶
result = [x * 2 for x in data]
Fast¶
result = np.array(data) * 2 ```
3. Use List Comprehensions¶
```python
Slower¶
result = [] for x in data: result.append(x * 2)
Faster (optimized bytecode)¶
result = [x * 2 for x in data] ```
4. Avoid Global Variables¶
```python
Slower (global lookup each iteration)¶
multiplier = 2 def slow(): return [x * multiplier for x in range(1000000)]
Faster (local lookup)¶
def fast(): mult = 2 # Local variable return [x * mult for x in range(1000000)] ```
5. Use Appropriate Data Structures¶
```python
Slow (O(n) lookup)¶
if item in large_list: pass
Fast (O(1) lookup)¶
if item in large_set: pass ```
Summary¶
| Overhead Source | Impact | Mitigation |
|---|---|---|
| Dynamic typing | Type checks every operation | Use NumPy (typed arrays) |
| Object overhead | Memory and indirection | Use NumPy (raw data) |
| Interpretation | Bytecode dispatch | Use compiled extensions |
| Memory layout | Cache misses | Use contiguous arrays |
| GIL | No CPU parallelism | Use multiprocessing |
Key insight:
Python is slow for tight computational loops. The solution isn't to avoid Python—it's to move the hot loops into compiled code (NumPy, Cython, C extensions) while keeping Python for orchestration and glue code.
Exercises¶
Exercise 1. List three reasons why Python is slower than compiled languages like C or Rust for numerical computation.
Solution to Exercise 1
```python
Conceptual solution - see page content for details¶
import sys import platform
print(f"Python version: {sys.version}") print(f"Platform: {platform.platform()}") print(f"Architecture: {platform.machine()}") ```
Exercise 2. Explain what dynamic typing costs at runtime. How does it affect performance compared to static typing?
Solution to Exercise 2
See the main content for the detailed explanation. The key concept involves understanding the hardware-software interaction and how it affects Python performance.
Exercise 3. Write Python code that benchmarks a pure Python loop versus an equivalent NumPy operation, and calculate the speedup factor.
Solution to Exercise 3
```python import time
Simple benchmark¶
n = 10_000_000 start = time.perf_counter() total = sum(range(n)) elapsed = time.perf_counter() - start print(f"Sum of {n} integers: {total}") print(f"Time: {elapsed:.4f} seconds") ```
Exercise 4. Explain the concept of 'boxing' and 'unboxing' in Python. How does it contribute to Python's overhead for numerical operations?
Solution to Exercise 4
```python import numpy as np import time
n = 1_000_000
Python loop¶
start = time.perf_counter() result_py = sum(i * i for i in range(n)) time_py = time.perf_counter() - start
NumPy vectorized¶
arr = np.arange(n) start = time.perf_counter() result_np = np.sum(arr * arr) time_np = time.perf_counter() - start
print(f"Python: {time_py:.4f}s, NumPy: {time_np:.4f}s") print(f"Speedup: {time_py / time_np:.1f}x") ```