CPU-Memory Communication¶
The Memory Controller¶
Modern CPUs have an integrated memory controller that manages communication with RAM:
┌─────────────────────────────────────────────────────────────┐
│ CPU │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ CPU Cores │ │
│ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ │
│ │ │Core 0│ │Core 1│ │Core 2│ │Core 3│ │ │
│ │ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ │ │
│ │ └────────┼────────┼────────┘ │ │
│ │ ▼ │ │
│ │ ┌─────────────┐ │ │
│ │ │ L3 Cache │ │ │
│ │ └──────┬──────┘ │ │
│ └───────────────┼──────────────────────────────────────┘ │
│ ▼ │
│ ┌───────────────────────────────────┐ │
│ │ Integrated Memory Controller │ │
│ │ ┌─────────┐ ┌─────────┐ │ │
│ │ │Channel A│ │Channel B│ │ │
│ │ └────┬────┘ └────┬────┘ │ │
│ └───────┼─────────────────┼────────┘ │
└──────────┼─────────────────┼────────────────────────────────┘
│ │
▼ ▼
┌────────┐ ┌────────┐
│ DIMM 1 │ │ DIMM 2 │
└────────┘ └────────┘
Memory Access Flow¶
Read Request¶
1. CPU Core needs data at address X
│
▼
2. Check L1 Cache ──── Hit? ──→ Return data (1 ns)
│ Miss
▼
3. Check L2 Cache ──── Hit? ──→ Return data (3 ns)
│ Miss
▼
4. Check L3 Cache ──── Hit? ──→ Return data (10 ns)
│ Miss
▼
5. Memory Controller receives request
│
▼
6. Controller sends address to RAM via memory bus
│
▼
7. RAM retrieves data (Row → Column activation)
│
▼
8. Data returns to CPU (~60 ns total from request)
│
▼
9. Data cached in L1/L2/L3 for future access
Memory Channels¶
Multiple channels allow parallel access:
Single Channel:
┌─────────────────┐
│ Memory Controller│
│ ┌──────┐ │
│ │Chan A│ │
│ └──┬───┘ │
└───────┼─────────┘
│
┌────┴────┐
│ DIMM │ Bandwidth: ~25 GB/s
└─────────┘
Dual Channel:
┌─────────────────┐
│ Memory Controller│
│ ┌──────┐ ┌──────┐│
│ │Chan A│ │Chan B││
│ └──┬───┘ └──┬───┘│
└────┼────────┼────┘
│ │
┌────┴────┐┌────┴────┐
│ DIMM ││ DIMM │ Bandwidth: ~50 GB/s (2×)
└─────────┘└─────────┘
Channel Interleaving¶
Data is striped across channels for parallelism:
Memory Address Interleaving:
Address 0x0000: Channel A, DIMM 0
Address 0x0040: Channel B, DIMM 0 (64-byte offset)
Address 0x0080: Channel A, DIMM 0
Address 0x00C0: Channel B, DIMM 0
...
Sequential access automatically uses both channels!
Memory Timing¶
DDR SDRAM Timing Parameters¶
Memory Access Timeline:
tCL (CAS Latency): Column access time
tRCD: Row to Column Delay
tRP: Row Precharge time
tRAS: Row Active time
Example DDR4-3200 CL16:
Timings: 16-18-18-36
tRCD tCL
│ │
┌─────┴─────┐ ┌───┴───┐
│ │ │ │
────[Row Cmd]───[Col Cmd]─[Data]────
│ │
└─────────────────────┘
Total: ~20 ns
Timing Impact¶
import numpy as np
import time
def measure_memory_latency():
"""Measure effective memory access latency."""
# Create array larger than cache
size = 100 * 1024 * 1024 # 100 MB
arr = np.zeros(size // 8, dtype=np.float64)
# Random access pattern defeats prefetching
indices = np.random.permutation(len(arr))
# Pointer chasing to measure latency
n_accesses = 1_000_000
start = time.perf_counter()
total = 0.0
for i in range(n_accesses):
total += arr[indices[i % len(indices)]]
elapsed = time.perf_counter() - start
latency_ns = elapsed / n_accesses * 1e9
print(f"Effective latency: {latency_ns:.0f} ns")
measure_memory_latency() # Typically 60-100 ns
Cache Coherency¶
When multiple cores access the same memory, coherency must be maintained:
MESI Protocol States:
M (Modified): This cache has the only valid copy (dirty)
E (Exclusive): This cache has the only copy (clean)
S (Shared): Multiple caches have copies (clean)
I (Invalid): Cache line is not valid
State Transitions:
┌───────────┐ Read by ┌───────────┐
│ Invalid │ ────────────▶ │ Shared │
└───────────┘ this core └───────────┘
▲ │
│ Other core │ Write by
│ writes │ this core
│ ▼
┌───────────┐ ┌───────────┐
│ Modified │ ◀──────────── │ Exclusive │
└───────────┘ Write by └───────────┘
this core
False Sharing¶
When cores modify different data in the same cache line:
import numpy as np
from concurrent.futures import ThreadPoolExecutor
import time
def false_sharing_demo():
# Bad: Adjacent data (same cache line)
shared_array = np.zeros(2, dtype=np.int64)
def increment_0():
for _ in range(10_000_000):
shared_array[0] += 1
def increment_1():
for _ in range(10_000_000):
shared_array[1] += 1
# Both threads fight over same cache line!
start = time.perf_counter()
with ThreadPoolExecutor(max_workers=2) as ex:
ex.submit(increment_0)
ex.submit(increment_1)
bad_time = time.perf_counter() - start
# Better: Pad to separate cache lines (64 bytes apart)
padded = np.zeros(16, dtype=np.int64) # 128 bytes
# Thread 0 uses padded[0], Thread 1 uses padded[8]
print(f"Adjacent (false sharing): {bad_time:.2f}s")
Memory Bandwidth Measurement¶
import numpy as np
import time
def measure_bandwidth():
"""Measure achievable memory bandwidth."""
sizes_mb = [1, 10, 100, 1000]
for size_mb in sizes_mb:
n = size_mb * 1024 * 1024 // 8
arr = np.random.rand(n)
# Read bandwidth (sum reads all elements)
start = time.perf_counter()
for _ in range(10):
_ = np.sum(arr)
elapsed = time.perf_counter() - start
bytes_read = n * 8 * 10
bandwidth = bytes_read / elapsed / 1e9
print(f"{size_mb:4d} MB: {bandwidth:.1f} GB/s")
measure_bandwidth()
Expected output:
1 MB: 80.0 GB/s (fits in L3 cache)
10 MB: 50.0 GB/s (partially cached)
100 MB: 35.0 GB/s (RAM limited)
1000 MB: 32.0 GB/s (RAM limited)
NUMA: Non-Uniform Memory Access¶
Multi-socket systems have local and remote memory:
NUMA Architecture (2 sockets)
┌──────────────────────┐ ┌──────────────────────┐
│ Socket 0 │ │ Socket 1 │
│ ┌────────────────┐ │ │ ┌────────────────┐ │
│ │ CPU Cores │ │ │ │ CPU Cores │ │
│ └───────┬────────┘ │ │ └───────┬────────┘ │
│ │ │ │ │ │
│ ┌───────┴────────┐ │ │ ┌───────┴────────┐ │
│ │ Memory Ctrl │ │◀═══▶│ │ Memory Ctrl │ │
│ └───────┬────────┘ │ QPI │ └───────┬────────┘ │
│ │ │ │ │ │
│ ┌───┴───┐ │ │ ┌───┴───┐ │
│ │ RAM │ │ │ │ RAM │ │
│ │(Local)│ │ │ │(Local)│ │
│ └───────┘ │ │ └───────┘ │
└──────────────────────┘ └──────────────────────┘
Local access: ~60 ns
Remote access: ~100 ns (must cross QPI link)
NUMA-Aware Allocation¶
import numpy as np
# NumPy doesn't directly control NUMA
# But OS may place memory on local node
# For NUMA-aware code, use:
# - numactl command-line tool
# - numa library bindings
# - Process pinning to specific nodes
Summary¶
| Concept | Description |
|---|---|
| Memory Controller | Manages CPU-RAM communication |
| Channels | Parallel paths to memory (dual/quad) |
| Interleaving | Striping data across channels |
| CAS Latency | Cycles from column command to data |
| Cache Coherency | Keeping caches consistent (MESI) |
| False Sharing | Performance loss from shared cache lines |
| NUMA | Non-uniform memory access in multi-socket |
Key insights for Python:
- Memory bandwidth (~30-50 GB/s) limits large array operations
- Sequential access enables prefetching and channel interleaving
- Random access suffers full memory latency (~60 ns)
- False sharing can hurt multi-threaded code
- NumPy operations are often memory-bound, not compute-bound