Registers and Cache¶
Registers and caches are the fastest storage locations in a computer. They sit between the CPU's execution units and main memory, allowing programs to operate on data with extremely low latency.
Because modern processors execute instructions much faster than data can be retrieved from RAM, these small but fast storage layers play a critical role in overall performance.
Understanding registers and caches explains why:
- sequential array processing is fast
- random memory access is slow
- vectorized operations outperform loops
- NumPy is dramatically faster than Python lists for numerical computation
1. Registers: The Fastest Storage¶
Registers are small storage locations located directly inside the CPU.
They are used to hold:
- intermediate computation results
- operands for arithmetic instructions
- memory addresses
- loop counters and temporary variables
Because registers are part of the processor itself, accessing them requires only one CPU cycle.
Register characteristics¶
Typical properties for modern x86-64 CPUs:
| Property | Value |
|---|---|
| Number of general-purpose registers | 16 |
| Register size | 64 bits |
| Total capacity | ~128 bytes |
| Access latency | ~1 CPU cycle |
Compared to other memory layers, registers are extremely small but extremely fast.
Register usage example¶
A simple arithmetic operation in machine code might look like:
```text id="j30v79" ADD RAX, RBX
This instruction adds the value in register `RBX` to register `RAX`.
Because both operands are already in registers, the operation completes in a single cycle.
---
#### Visualization
```mermaid id="pso53r"
flowchart LR
A[Register RAX] --> C[ALU]
B[Register RBX] --> C
C --> D[Result stored in RAX]
2. Register Allocation¶
Registers are limited resources. Programs often require more temporary values than available registers.
Two mechanisms help manage this constraint:
Compiler allocation¶
Compilers attempt to keep frequently used values in registers.
Example:
```c id="0t86d4" c = a + b
The compiler loads `a` and `b` into registers, performs the addition, and stores the result.
---
### Register renaming
Modern CPUs dynamically map logical registers to physical registers using **register renaming**.
This allows processors to:
* avoid false dependencies
* execute instructions out of order
* improve pipeline utilization
These mechanisms are invisible to software but critical for high performance.
---
## 3. Cache Memory
Caches store recently accessed memory values closer to the CPU.
They are implemented using fast **SRAM** rather than the slower **DRAM** used for main memory.
---
### Cache hierarchy
Most processors use multiple cache levels.
| Cache | Size | Latency |
| ----- | -------- | ---------- |
| L1 | ~32 KB | ~4 cycles |
| L2 | ~256 KB | ~12 cycles |
| L3 | ~8–32 MB | ~40 cycles |
---
### L1 cache split
The L1 cache is usually divided into two separate caches:
| Cache | Purpose |
| ----- | ----------------- |
| L1-I | instruction cache |
| L1-D | data cache |
This separation allows instruction fetching and data access to occur simultaneously.
---
#### Cache hierarchy visualization
```mermaid id="0z48xp"
flowchart TD
CPU --> L1[L1 Cache]
L1 --> L2[L2 Cache]
L2 --> L3[L3 Cache]
L3 --> RAM[Main Memory]
Data moves down the hierarchy when needed.
4. Cache Lines¶
Caches store memory in blocks called cache lines.
Typical size:
```text id="r1k5tq" 64 bytes
When a single byte is requested, the entire cache line containing that byte is loaded.
---
### Example
Suppose a program reads memory address:
```text id="9z8htc"
100
The CPU loads the block:
```text id="ck7p9p" 64-byte region containing address 100
This block might include addresses:
```text id="cy2og1"
64–127
Visualization¶
```mermaid id="uz1ipf" flowchart LR A[Memory block 64 bytes] --> B[Cache line] B --> C[CPU accesses data]
This design improves performance when nearby data is accessed.
---
## 5. Sequential Access and Cache Efficiency
Cache lines make **sequential memory access** extremely efficient.
Consider a NumPy array of `float64` values.
Each element uses:
```text id="a4kbf7"
8 bytes
Because cache lines contain 64 bytes, each cache line stores:
```text id="z3k82c" 8 float64 values
Thus a single cache miss loads eight elements.
---
### Example access pattern
Sequential access:
```text id="l2jqfh"
arr[0], arr[1], arr[2], arr[3]
Behavior:
| Access | Result |
|---|---|
| arr[0] | cache miss |
| arr[1] | cache hit |
| arr[2] | cache hit |
| arr[3] | cache hit |
Visualization¶
```mermaid id="k4qnh2" flowchart LR A[arr[0]] --> B[cache line loaded] B --> C[arr[1]] B --> D[arr[2]] B --> E[arr[3]]
---
## 6. Cache Associativity
Caches are organized into **sets** containing multiple cache lines.
The number of lines per set defines the **associativity**.
---
### Types of caches
| Type | Description |
| ----------------- | ----------------- |
| Direct-mapped | one line per set |
| N-way associative | N lines per set |
| Fully associative | any line anywhere |
Most modern caches are **8-way or 16-way associative**.
---
### Conflict misses
A **conflict miss** occurs when multiple memory addresses map to the same cache set.
If more addresses compete for a set than the associativity allows, lines must be repeatedly evicted.
This can degrade performance significantly.
---
#### Visualization
```mermaid id="pd1c88"
flowchart TD
A[Memory address] --> B[Cache set]
C[Memory address] --> B
D[Memory address] --> B
B --> E[Cache lines]
If too many addresses map to the same set, older lines are evicted.
7. SIMD Registers and Vectorization¶
Modern CPUs include SIMD (Single Instruction Multiple Data) registers.
These registers allow a single instruction to operate on multiple data elements simultaneously.
SIMD register types¶
| Register | Size |
|---|---|
| XMM | 128 bits |
| YMM | 256 bits |
| ZMM | 512 bits |
Example capacities:
| Data type | Values per register (256-bit) |
|---|---|
| float32 | 8 |
| float64 | 4 |
Visualization¶
```mermaid id="zj2in6" flowchart LR A[SIMD register] --> B[value1] A --> C[value2] A --> D[value3] A --> E[value4]
Multiple values are processed in parallel.
---
## 8. NumPy and Vectorized Computation
NumPy operations are fast because they:
1. operate on **contiguous memory**
2. use **SIMD instructions**
3. run in **compiled C code**
Example:
```python id="aq9p5c"
import numpy as np
a = np.arange(1_000_000)
b = np.arange(1_000_000)
c = a + b
Instead of performing one addition at a time, the CPU processes multiple elements per instruction using SIMD registers.
9. Python Interpreter Overhead¶
Pure Python arithmetic is much slower because each operation involves many steps.
Example:
```python id="y10tdg" x = a + b
Internally this requires:
1. type checking
2. method dispatch
3. object allocation
4. reference counting
Instead of a single machine instruction, Python may execute **dozens of instructions**.
---
#### Visualization
```mermaid id="pnjv1k"
flowchart TD
A[Python expression] --> B[type checks]
B --> C[object operations]
C --> D[allocation]
D --> E[result]
NumPy bypasses this overhead by operating on raw memory arrays in compiled code.
10. Observing Cache Effects¶
The impact of cache size can be observed experimentally.
When arrays fit in cache, operations are very fast.
When arrays exceed cache capacity, performance drops because data must be fetched from slower memory.
Example experiment¶
```python id="i21ylt" import numpy as np import time
for name, n in [('L1 32KB', 4000), ('L2 256KB', 32000), ('L3 8MB', 1000000), ('RAM 64MB', 8000000)]:
arr = np.random.rand(n)
_ = np.sum(arr)
start = time.perf_counter()
for _ in range(100):
_ = np.sum(arr)
elapsed = time.perf_counter() - start
bw = (n * 8 * 100) / elapsed / 1e9
print(f"{name:12}: {bw:6.1f} GB/s")
``` Typical result: * small arrays → high bandwidth * large arrays → slower bandwidth
11. Strided Access and Cache Waste¶
Cache lines improve sequential access but can be wasted by strided access patterns.
Example¶
python id="5j7bq1"
arr = np.arange(1_000_000, dtype=np.float64)
total_seq = np.sum(arr)
total_strided = np.sum(arr[::64])
The strided version accesses only one element per cache line, wasting most of the data fetched.
Visualization¶
mermaid id="qjot8r"
flowchart LR
A[Cache line] --> B[value used]
A --> C[value unused]
A --> D[value unused]
A --> E[value unused]
12. Worked Examples¶
Example 1¶
How many float64 values fit in a 256-bit SIMD register?
[ 256 / 64 = 4 ]
Example 2¶
If a cache line is 64 bytes and each value is 8 bytes:
[ 64 / 8 = 8 ]
So one cache miss loads 8 elements.
Example 3¶
Explain why NumPy operations outperform Python loops.
NumPy performs operations in compiled code using SIMD registers and contiguous memory, avoiding interpreter overhead.
13. Exercises¶
- What are CPU registers used for?
- How large are general-purpose registers on x86-64?
- What is a cache line?
- Why is sequential memory access faster than random access?
- What is SIMD?
- How many float64 values fit in a 512-bit register?
- What causes conflict misses?
- Why is NumPy faster than Python loops?
14. Short Answers¶
- Temporary storage for CPU operations
- 64 bits
- Block of memory transferred between cache and RAM
- Cache lines preload nearby values
- Single Instruction Multiple Data processing
- 8
- Multiple addresses mapping to the same cache set
- Vectorized compiled operations and contiguous memory
15. Summary¶
- Registers are the fastest storage in a computer and hold temporary computation data.
- Caches store recently accessed memory to reduce RAM access latency.
- Memory is transferred in cache lines, typically 64 bytes.
- Sequential access benefits from spatial locality.
- Cache associativity affects how memory addresses map to cache sets.
- SIMD registers allow multiple values to be processed in parallel.
- NumPy exploits SIMD and contiguous memory to achieve much higher performance than Python loops.
Understanding registers and cache behavior is essential for writing efficient numerical and high-performance code.