Introduction to Concurrency¶
Concurrency is about dealing with multiple things at once. This chapter covers Python's tools for concurrent and parallel execution.
Why Concurrency?¶
The Problem: Sequential Execution¶
import time
def download_file(url):
print(f"Downloading {url}...")
time.sleep(2) # Simulate network delay
print(f"Finished {url}")
# Sequential: 6 seconds total
urls = ["file1.zip", "file2.zip", "file3.zip"]
for url in urls:
download_file(url)
Each download waits for the previous one to complete. Total time: 6 seconds.
The Solution: Concurrent Execution¶
import time
from concurrent.futures import ThreadPoolExecutor
def download_file(url):
print(f"Downloading {url}...")
time.sleep(2)
print(f"Finished {url}")
# Concurrent: ~2 seconds total
urls = ["file1.zip", "file2.zip", "file3.zip"]
with ThreadPoolExecutor() as executor:
executor.map(download_file, urls)
All downloads happen simultaneously. Total time: ~2 seconds.
Key Terminology¶
Concurrency vs Parallelism¶
| Term | Definition | Analogy |
|---|---|---|
| Concurrency | Managing multiple tasks at once | One chef juggling multiple dishes |
| Parallelism | Executing multiple tasks simultaneously | Multiple chefs cooking simultaneously |
Concurrency is about structure — organizing code to handle multiple tasks. Parallelism is about execution — actually running tasks at the same time.
Concurrency (single core):
Task A: ██░░██░░██
Task B: ░░██░░██░░
Time →
Parallelism (multiple cores):
Task A: ██████████ (Core 1)
Task B: ██████████ (Core 2)
Time →
Threads vs Processes¶
| Aspect | Thread | Process |
|---|---|---|
| Memory | Shared memory space | Separate memory space |
| Creation | Fast, lightweight | Slower, heavier |
| Communication | Easy (shared variables) | Requires IPC (queues, pipes) |
| GIL impact | Affected by GIL | Not affected by GIL |
| Best for | I/O-bound tasks | CPU-bound tasks |
Synchronous vs Asynchronous¶
| Mode | Description |
|---|---|
| Synchronous | Wait for each operation to complete before starting next |
| Asynchronous | Start operations without waiting, handle results when ready |
Python's Concurrency Tools¶
Standard Library Modules¶
| Module | Purpose | Use Case |
|---|---|---|
threading |
Thread-based concurrency | I/O-bound tasks |
multiprocessing |
Process-based parallelism | CPU-bound tasks |
concurrent.futures |
High-level interface | Both (recommended) |
asyncio |
Async I/O | High-concurrency I/O |
queue |
Thread-safe queues | Producer-consumer patterns |
Which to Use?¶
Start here
│
├─ Is the task CPU-intensive (computation)?
│ │
│ ├─ Yes → multiprocessing / ProcessPoolExecutor
│ │
│ └─ No → Continue
│
├─ Is the task I/O-intensive (network, disk)?
│ │
│ ├─ Yes → threading / ThreadPoolExecutor
│ │ or asyncio for very high concurrency
│ │
│ └─ No → Sequential is probably fine
│
└─ Simple parallel map over data?
│
└─ Yes → concurrent.futures (easiest)
Real-World Examples¶
I/O-Bound: Web Scraping¶
import requests
from concurrent.futures import ThreadPoolExecutor
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
]
def fetch(url):
response = requests.get(url)
return len(response.content)
# Threads work well — waiting for network I/O
with ThreadPoolExecutor(max_workers=10) as executor:
results = list(executor.map(fetch, urls))
CPU-Bound: Number Crunching¶
from concurrent.futures import ProcessPoolExecutor
def compute_heavy(n):
"""CPU-intensive calculation."""
return sum(i * i for i in range(n))
numbers = [10_000_000, 20_000_000, 30_000_000]
# Processes work well — true parallel computation
with ProcessPoolExecutor() as executor:
results = list(executor.map(compute_heavy, numbers))
Mixed: Data Pipeline¶
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
def download_data(url):
"""I/O-bound: fetch from network."""
import requests
return requests.get(url).json()
def process_data(data):
"""CPU-bound: heavy computation."""
return expensive_computation(data)
# Stage 1: Download (I/O-bound) — use threads
with ThreadPoolExecutor() as executor:
raw_data = list(executor.map(download_data, urls))
# Stage 2: Process (CPU-bound) — use processes
with ProcessPoolExecutor() as executor:
results = list(executor.map(process_data, raw_data))
Performance Comparison¶
Benchmark: I/O-Bound Task¶
import time
import requests
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
def fetch_url(url):
requests.get(url)
return url
urls = ["https://httpbin.org/delay/1"] * 5
# Sequential: ~5 seconds
# ThreadPool: ~1 second ✓ Best
# ProcessPool: ~1.5 seconds (overhead)
Benchmark: CPU-Bound Task¶
import time
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
def compute(n):
return sum(i ** 2 for i in range(n))
numbers = [5_000_000] * 4
# Sequential: ~4 seconds (on 4-core machine)
# ThreadPool: ~4 seconds (GIL blocks parallelism)
# ProcessPool: ~1 second ✓ Best
Common Pitfalls¶
1. Using Threads for CPU-Bound Work¶
# Bad: Threads don't help with CPU-bound tasks
with ThreadPoolExecutor() as executor:
results = executor.map(heavy_computation, data) # No speedup!
# Good: Use processes for CPU-bound tasks
with ProcessPoolExecutor() as executor:
results = executor.map(heavy_computation, data) # Real parallelism
2. Too Many Workers¶
# Bad: 1000 threads/processes is wasteful
with ThreadPoolExecutor(max_workers=1000) as executor:
...
# Good: Match workers to task type
# I/O-bound: 10-50 threads typically sufficient
# CPU-bound: Match CPU cores
import os
with ProcessPoolExecutor(max_workers=os.cpu_count()) as executor:
...
3. Shared State Without Synchronization¶
# Bad: Race condition
counter = 0
def increment():
global counter
counter += 1 # Not thread-safe!
# Good: Use synchronization
import threading
counter = 0
lock = threading.Lock()
def increment():
global counter
with lock:
counter += 1
Chapter Overview¶
This chapter covers:
- Concurrency Concepts — GIL, CPU vs I/O bound, threads vs processes
- threading Module — Creating threads, synchronization, communication
- multiprocessing Module — Processes, pools, sharing state
- concurrent.futures — Modern, high-level API (recommended)
- Practical Patterns — Decision guide, common patterns, error handling
Key Takeaways¶
- Concurrency = managing multiple tasks; Parallelism = running simultaneously
- Threads share memory, affected by GIL — best for I/O-bound tasks
- Processes have separate memory, bypass GIL — best for CPU-bound tasks
- concurrent.futures provides the cleanest API for most use cases
- Match your concurrency strategy to your task type
- Always consider synchronization when sharing state