Memory Overview¶
Modern processors can execute billions of instructions per second, but accessing main memory is far slower than executing arithmetic operations. This large gap between CPU speed and memory speed is known as the memory wall.
To bridge this gap, computers use a memory hierarchy: multiple layers of storage with different speeds and capacities. Data moves between these layers automatically, allowing frequently used data to be accessed quickly while still supporting large datasets.
Understanding the memory hierarchy is essential for writing high-performance programs. Many performance differences between algorithms arise not from computation but from how efficiently they access memory.
1. The Memory Hierarchy¶
The memory hierarchy organizes storage into layers with different characteristics.
Higher levels are:
- smaller
- faster
- more expensive
Lower levels are:
- larger
- slower
- cheaper
Typical hierarchy¶
| Level | Size | Latency | Managed By |
|---|---|---|---|
| Registers | ~1 KB | ~0.25 ns | Compiler |
| L1 Cache | ~32 KB | ~1 ns | Hardware |
| L2 Cache | ~256–512 KB | ~4 ns | Hardware |
| L3 Cache | ~4–32 MB | ~12 ns | Hardware |
| RAM | 8–128 GB | ~80–120 ns | Operating system |
| SSD / Disk | 256 GB–4 TB | ~100 µs | Operating system |
Latency increases dramatically as we move down the hierarchy.
For comparison:
CPU register access: ~0.25 ns
RAM access: ~100 ns
SSD access: ~100,000 ns
This means RAM can be hundreds of times slower than CPU operations.
Hierarchy visualization¶
flowchart TD
A[Registers
~1 KB] --> B[L1 Cache
~32 KB]
B --> C[L2 Cache
~256 KB]
C --> D[L3 Cache
~4–32 MB]
D --> E[RAM
8–128 GB]
E --> F[SSD / Disk
256 GB–4 TB]
As capacity increases, speed decreases.
2. How the CPU Accesses Memory¶
When a program requests data, the CPU checks each level of the hierarchy in order.
- registers
- L1 cache
- L2 cache
- L3 cache
- RAM
If the data is not found in a level, the next level must be accessed.
Cache hit vs cache miss¶
| Event | Meaning |
|---|---|
| Cache hit | Data found in cache |
| Cache miss | Data must be fetched from lower level |
A cache miss is expensive because data must be copied from a slower layer.
Data flow visualization¶
flowchart LR
CPU --> L1
L1 --> L2
L2 --> L3
L3 --> RAM
If the requested data is found in L1, the access completes immediately.
If not, the hardware fetches the data from lower levels.
3. Cache Lines¶
Caches do not load individual bytes. Instead, they load blocks of memory called cache lines.
Typical cache line size:
64 bytes
When a program accesses a single byte, the entire 64-byte block containing that byte is loaded into cache.
Example¶
Suppose the program accesses:
address 100
The CPU loads the entire block:
64-byte block containing addresses 64–127
This means nearby data becomes available for free.
Visualization¶
flowchart LR
A[Memory block
64 bytes] --> B[Cache line]
B --> C[CPU access]
4. Locality of Reference¶
The memory hierarchy works because programs exhibit locality.
Two important patterns appear in most programs.
Temporal locality¶
Recently accessed data is likely to be used again soon.
Example:
for i in range(1000):
total += x
The variable x is reused repeatedly.
Spatial locality¶
Data near recently accessed memory is likely to be accessed soon.
Example:
for i in range(len(arr)):
total += arr[i]
Array elements are stored next to each other.
Because cache lines load neighboring data, sequential access is very efficient.
Locality visualization¶
flowchart LR
A[Access memory] --> B[Nearby data loaded]
B --> C[Next accesses hit cache]
5. Why Sequential Access Is Fast¶
Consider an array of float64 values.
Each value uses:
8 bytes
Since cache lines are 64 bytes, each cache line contains:
8 float64 values
When one value is accessed, the other seven values are already loaded.
Therefore sequential access results in many cache hits.
Example¶
Access pattern:
arr[0], arr[1], arr[2], arr[3]
First access:
cache miss
Remaining accesses:
cache hits
Visualization¶
flowchart LR
A[arr[0]] --> B[load cache line]
B --> C[arr[1]]
B --> D[arr[2]]
B --> E[arr[3]]
6. Python Memory Layout¶
Python data structures often store objects indirectly, which affects memory performance.
Python integers¶
Each integer is a full Python object stored on the heap.
Example:
import sys
x = 42
print(sys.getsizeof(x))
Typical result:
28 bytes
This is much larger than a simple 4-byte or 8-byte integer.
Python lists¶
A list stores pointers to objects, not the objects themselves.
Example:
lst = [1, 2, 3]
Memory layout:
list → pointer → int object
This means elements are scattered throughout memory.
Visualization¶
flowchart LR
A[List] --> B[Pointer]
A --> C[Pointer]
A --> D[Pointer]
B --> E[Int object]
C --> F[Int object]
D --> G[Int object]
This layout results in poor spatial locality.
7. NumPy Arrays and Contiguous Memory¶
NumPy arrays store raw numeric values in contiguous memory.
Example:
import numpy as np
arr = np.array([1, 2, 3], dtype=np.int64)
print(arr.nbytes)
Output:
24
Memory layout:
[1][2][3]
Each element occupies exactly 8 bytes.
Visualization¶
flowchart LR
A[NumPy array] --> B[1]
B --> C[2]
C --> D[3]
Because the values are stored next to each other, NumPy achieves excellent spatial locality.
This is one of the primary reasons NumPy operations are dramatically faster than equivalent Python loops.
8. Measuring Memory Bandwidth¶
The effect of the memory hierarchy can be observed experimentally.
When arrays are small enough to fit in cache, operations are very fast.
As array size increases and exceeds cache capacity, performance drops.
Example experiment¶
import numpy as np
import time
def measure_bandwidth(size_mb):
n = size_mb * 1024 * 1024 // 8
arr = np.random.rand(n)
_ = np.sum(arr) # warm up
start = time.perf_counter()
for _ in range(10):
_ = np.sum(arr)
elapsed = time.perf_counter() - start
bandwidth = (n * 8 * 10) / elapsed / 1e9
print(f"{size_mb:6.2f} MB: {bandwidth:.1f} GB/s")
for size in [0.01, 0.1, 1, 10, 100]:
measure_bandwidth(size)
Typical observation:
- small arrays → high bandwidth (cache)
- large arrays → lower bandwidth (RAM)
9. Memory Latency vs Bandwidth¶
Two important performance metrics describe memory.
Latency¶
Latency measures how long it takes to access memory.
Example:
RAM latency ≈ 100 ns
This is the delay before data begins to arrive.
Bandwidth¶
Bandwidth measures how quickly large amounts of data can be transferred.
Example:
Memory bandwidth ≈ 30–80 GB/s
Bandwidth determines how quickly large arrays can be processed.
Visualization¶
flowchart LR
A[Latency] --> B[Delay before first byte]
C[Bandwidth] --> D[Rate of data transfer]
10. Performance Implications¶
Understanding memory behavior can dramatically improve program performance.
Prefer contiguous data structures¶
Good:
numpy arrays
Poor:
lists of objects
Access memory sequentially¶
Sequential access benefits from spatial locality.
Random access defeats caching.
Avoid unnecessary allocations¶
Frequent memory allocation can reduce cache efficiency.
Vectorized operations¶
NumPy operations operate on contiguous memory blocks, maximizing cache efficiency.
11. Worked Examples¶
Example 1¶
Explain why NumPy is faster than Python lists for numerical loops.
NumPy stores values contiguously, allowing efficient cache usage and vectorized instructions.
Example 2¶
If a cache line is 64 bytes and each value is 8 bytes, how many values fit in one cache line?
[ 64 / 8 = 8 ]
Example 3¶
Explain the performance drop when arrays exceed cache size.
When arrays no longer fit in cache, accesses must fetch data from RAM, which is much slower.
12. Exercises¶
- What is the purpose of the memory hierarchy?
- What is a cache hit?
- What is a cache miss?
- What is a cache line?
- What is temporal locality?
- What is spatial locality?
- Why are NumPy arrays faster than Python lists for numerical work?
- How many
float64values fit in a 64-byte cache line?
13. Short Answers¶
- To balance speed and capacity of memory
- Data found in cache
- Data must be fetched from lower memory
- Block of memory transferred between cache and RAM
- Recently used data reused again
- Nearby memory accessed together
- Contiguous memory improves cache efficiency
- Eight
14. Summary¶
- The memory hierarchy organizes storage into layers of different speeds and capacities.
- Registers and caches are extremely fast but small; RAM and disk are large but slower.
- Data moves between levels in cache lines, typically 64 bytes.
- Programs benefit from temporal locality and spatial locality.
- Sequential memory access is faster because cache lines preload nearby data.
- Python objects are scattered in memory, reducing cache efficiency.
- NumPy arrays store contiguous data, enabling much faster numerical computation.
Understanding memory hierarchy and data layout is crucial for designing high-performance programs and efficient numerical algorithms.