GIL and Hardware¶

What is the GIL?¶

The Global Interpreter Lock (GIL) is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecode simultaneously.

Without GIL (hypothetical):
┌────────────────────────────────────────────────────────────┐
│ Thread 1: refcount++    Thread 2: refcount++              │
│                                                            │
│ Both read refcount = 1                                     │
│ Both compute 1 + 1 = 2                                     │
│ Both write refcount = 2                                    │
│                                                            │
│ Expected: 3, Got: 2 → Memory corruption!                  │
└────────────────────────────────────────────────────────────┘

With GIL:
┌────────────────────────────────────────────────────────────┐
│ Thread 1: [acquire GIL][refcount++][release GIL]          │
│ Thread 2:              [wait........][acquire GIL][ref++] │
│                                                            │
│ Operations are serialized → Safe, but not parallel         │
└────────────────────────────────────────────────────────────┘

GIL and Multi-Core CPUs¶

Modern CPUs have multiple cores, but Python can only use one at a time for Python code:

8-Core CPU with Python Threads:

Core 0: [Python][Python][Python][Python][Python]  ← All Python here
Core 1: [idle ][idle ][idle ][idle ][idle ]
Core 2: [idle ][idle ][idle ][idle ][idle ]
Core 3: [idle ][idle ][idle ][idle ][idle ]
Core 4: [idle ][idle ][idle ][idle ][idle ]
Core 5: [idle ][idle ][idle ][idle ][idle ]
Core 6: [idle ][idle ][idle ][idle ][idle ]
Core 7: [idle ][idle ][idle ][idle ][idle ]

7 cores sitting idle despite multiple threads!

Demonstrating the GIL¶

import threading
import time

def cpu_bound_task(n):
    """CPU-intensive: count to n."""
    count = 0
    for _ in range(n):
        count += 1
    return count

n = 50_000_000

# Single-threaded
start = time.perf_counter()
cpu_bound_task(n)
single_time = time.perf_counter() - start
print(f"Single thread: {single_time:.2f}s")

# Two threads (should be 2x faster, right?)
start = time.perf_counter()
t1 = threading.Thread(target=cpu_bound_task, args=(n//2,))
t2 = threading.Thread(target=cpu_bound_task, args=(n//2,))
t1.start()
t2.start()
t1.join()
t2.join()
two_thread_time = time.perf_counter() - start
print(f"Two threads:   {two_thread_time:.2f}s")

print(f"Speedup: {single_time/two_thread_time:.2f}x")

Typical output:

Single thread: 2.50s
Two threads:   2.80s  ← Actually SLOWER!
Speedup: 0.89x

Two threads are slower due to GIL contention overhead!

When the GIL is Released¶

The GIL is released during:

1. I/O Operations¶

import threading
import time
import urllib.request

def download(url):
    """I/O bound: downloads URL."""
    urllib.request.urlopen(url).read()

urls = ['http://example.com'] * 4

# Sequential
start = time.perf_counter()
for url in urls:
    download(url)
sequential_time = time.perf_counter() - start

# Parallel (GIL released during network I/O)
start = time.perf_counter()
threads = [threading.Thread(target=download, args=(url,)) for url in urls]
for t in threads:
    t.start()
for t in threads:
    t.join()
parallel_time = time.perf_counter() - start

print(f"Sequential: {sequential_time:.2f}s")
print(f"Parallel:   {parallel_time:.2f}s")
print(f"Speedup:    {sequential_time/parallel_time:.2f}x")

Sequential: 2.00s
Parallel:   0.55s
Speedup:    3.64x  ← Threading works for I/O!

2. NumPy Operations¶

import numpy as np
import threading
import time

def numpy_operation(arr):
    """NumPy releases GIL during computation."""
    for _ in range(100):
        np.dot(arr, arr)

arr = np.random.rand(1000, 1000)

# NumPy operations CAN run in parallel
# because NumPy releases the GIL

3. C Extensions That Release GIL¶

// C extension code
Py_BEGIN_ALLOW_THREADS
// ... long computation without Python objects ...
Py_END_ALLOW_THREADS

Working Around the GIL¶

Solution 1: Multiprocessing¶

Use separate processes instead of threads:

from multiprocessing import Pool
import time

def cpu_bound_task(n):
    count = 0
    for _ in range(n):
        count += 1
    return count

n = 50_000_000

# Single process
start = time.perf_counter()
cpu_bound_task(n)
single_time = time.perf_counter() - start

# Multiple processes (no GIL issue!)
start = time.perf_counter()
with Pool(4) as pool:
    pool.map(cpu_bound_task, [n//4] * 4)
multi_time = time.perf_counter() - start

print(f"Single process:    {single_time:.2f}s")
print(f"Four processes:    {multi_time:.2f}s")
print(f"Speedup:           {single_time/multi_time:.2f}x")

Single process:    2.50s
Four processes:    0.70s
Speedup:           3.57x  ← Real parallelism!

Multiprocessing vs Threading:

Threading:
┌──────────┐
│ Process  │
│ ┌──────┐ │   GIL
│ │Thread│◀┼───────────────────────▶ Only one runs Python
│ │Thread│ │
│ │Thread│ │
│ └──────┘ │
└──────────┘

Multiprocessing:
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Process  │ │ Process  │ │ Process  │
│ ┌──────┐ │ │ ┌──────┐ │ │ ┌──────┐ │
│ │ GIL  │ │ │ │ GIL  │ │ │ │ GIL  │ │  Each has own GIL
│ └──────┘ │ │ └──────┘ │ │ └──────┘ │
└──────────┘ └──────────┘ └──────────┘
     ▲             ▲             ▲
     └─────────────┼─────────────┘
              True parallelism!

Solution 2: Use NumPy/SciPy¶

import numpy as np
import time

n = 10_000_000

# Pure Python (GIL-bound)
data = list(range(n))
start = time.perf_counter()
result = sum(x**2 for x in data)
python_time = time.perf_counter() - start

# NumPy (releases GIL, uses SIMD)
arr = np.arange(n)
start = time.perf_counter()
result = np.sum(arr**2)
numpy_time = time.perf_counter() - start

print(f"Python: {python_time:.2f}s")
print(f"NumPy:  {numpy_time:.3f}s")
print(f"Speedup: {python_time/numpy_time:.0f}x")

Solution 3: Numba with `nogil`¶

from numba import jit, prange
import numpy as np
import time

@jit(nopython=True, parallel=True)
def parallel_sum_squares(arr):
    """Numba can release GIL and parallelize."""
    total = 0.0
    for i in prange(len(arr)):  # prange = parallel range
        total += arr[i] ** 2
    return total

arr = np.random.rand(10_000_000)

# Warm up JIT
parallel_sum_squares(arr)

# Benchmark
start = time.perf_counter()
result = parallel_sum_squares(arr)
elapsed = time.perf_counter() - start

print(f"Time: {elapsed:.4f}s")

GIL and Hardware Utilization¶

Task Type           Threading    Multiprocessing    NumPy
────────────────────────────────────────────────────────────
CPU-bound Python    ✗ No gain    ✓ Full parallel    N/A
CPU-bound NumPy     ✓ Can help   ✓ Full parallel    ✓ Built-in
I/O-bound           ✓ Works      ✓ Works            N/A
Memory-bound        ✗ Limited    ✗ Limited          ✓ Optimized

The Future: Free-threaded Python¶

Python 3.13+ introduces experimental GIL-free mode:

# Build Python with --disable-gil (experimental)
# Or use the free-threaded build
python3.13t script.py  # 't' suffix = free-threaded

# In free-threaded Python, true parallelism is possible
import threading

# This will actually use multiple cores!
threads = [threading.Thread(target=cpu_task) for _ in range(4)]

Summary¶

Scenario	GIL Impact	Solution
CPU-bound Python	Serialized	multiprocessing
I/O-bound	Released during I/O	threading works
NumPy computation	Released	threading can help
C extensions	Can release	depends on extension

Key points:

GIL prevents true threading parallelism for Python code
GIL is released during I/O and many C extensions
Use multiprocessing for CPU-bound parallelism
Use threading for I/O-bound concurrency
NumPy releases GIL, enabling parallel computation
Free-threaded Python (3.13+) removes GIL (experimental)

Decision Tree:

Is your code CPU-bound?
├── Yes: Pure Python?
│   ├── Yes → Use multiprocessing
│   └── No (NumPy) → Threading may help, NumPy parallelizes internally
└── No (I/O-bound) → Use threading or asyncio