GPU Architecture¶
Graphics Processing Units (GPUs) are specialized processors designed to execute massively parallel workloads.
They power many modern computational tasks including:
- deep learning
- scientific simulations
- computer graphics
- large-scale numerical computation
While CPUs prioritize low latency for individual threads, GPUs prioritize high throughput across thousands of threads.
Instead of executing a few complex threads quickly, GPUs execute many lightweight threads simultaneously.
This design is particularly effective for workloads exhibiting data parallelism, where the same operation must be applied to many elements.
1. CPU vs GPU Design Philosophy¶
CPUs and GPUs are built with fundamentally different architectural goals.
CPU¶
CPUs optimize:
- low-latency execution
- complex control flow
- sequential performance
Typical features include:
- few powerful cores
- large caches
- complex out-of-order execution
GPU¶
GPUs optimize:
- extremely high throughput
- massive thread-level parallelism
- predictable memory access patterns
Typical features include:
- thousands of simple cores
- small caches
- simple in-order execution
- hardware thread scheduling
CPU vs GPU architecture¶
flowchart LR
subgraph CPU["CPU: Few Complex Cores"]
C1[Core]
C2[Core]
C3[Core]
C4[Core]
end
subgraph GPU["GPU: Many Simple Cores"]
G1[Core]
G2[Core]
G3[Core]
G4[Core]
G5[Core]
G6[Core]
G7[Core]
G8[Core]
end
Comparison¶
| Property | CPU | GPU |
|---|---|---|
| Cores | few (4–64) | thousands |
| Core complexity | high | simple |
| Execution model | out-of-order | mostly in-order |
| Optimization | latency | throughput |
| Best workloads | branching logic | data parallelism |
2. System Architecture¶
GPUs operate as accelerators attached to a CPU system.
The CPU orchestrates execution while the GPU performs large parallel computations.
System overview¶
flowchart LR
RAM[System RAM]
CPU[CPU]
BUS[PCIe / NVLink]
GPU[GPU]
VRAM[GPU Memory]
RAM <--> CPU
CPU <--> BUS
BUS <--> GPU
GPU <--> VRAM
Execution flow:
- CPU loads data into system RAM
- Data is copied to GPU memory
- CPU launches a GPU kernel
- GPU executes parallel computation
- Results may be copied back to CPU
Thus the conceptual model is:
CPU = controller
GPU = accelerator
3. Kernel Execution Model¶
GPU programs run functions called kernels.
A kernel executes many parallel threads organized in a hierarchical structure.
Execution hierarchy¶
flowchart TD
Kernel --> Grid
Grid --> Block1[Thread Block]
Grid --> Block2[Thread Block]
Grid --> Block3[Thread Block]
Block1 --> WarpA
Block1 --> WarpB
WarpA --> ThreadsA[32 Threads]
WarpB --> ThreadsB[32 Threads]
Hierarchy:
Kernel
└ Grid
└ Thread Blocks
└ Warps
└ Threads
Key rules:
- Thread blocks execute on a single SM
- Threads in a block share memory
- Warps are the hardware scheduling unit
4. Streaming Multiprocessors (SM)¶
The fundamental compute unit of a GPU is the Streaming Multiprocessor (SM).
A GPU contains many SMs that operate independently.
SM architecture¶
flowchart TB
SM[Streaming Multiprocessor]
SM --> WS0[Warp Scheduler]
SM --> WS1[Warp Scheduler]
SM --> ALU[CUDA Cores]
SM --> LSU[Load/Store Units]
SM --> SFU[Special Function Units]
SM --> TC[Tensor Cores]
SM --> RF[Register File]
SM --> SHMEM[Shared Memory / L1]
An SM manages many active threads simultaneously.
Typical hardware inside an SM:
- warp schedulers
- CUDA cores (scalar ALUs)
- tensor cores
- load/store units
- register file
- shared memory
5. Warps and SIMT Execution¶
GPU threads execute in groups called warps.
A warp contains:
32 threads
Warps execute using the SIMT (Single Instruction Multiple Threads) model.
All threads execute the same instruction but operate on different data.
SIMT model¶
flowchart TD
Warp --> Instruction
Instruction --> Thread1
Instruction --> Thread2
Instruction --> Thread3
Each thread maintains its own:
- registers
- memory addresses
- control state
6. Warp Divergence¶
Problems occur when threads within a warp follow different branches.
Example:
if condition:
path_A
else:
path_B
When divergence occurs, the warp must execute both paths sequentially.
Warp divergence¶
flowchart TD
Warp --> Branch
Branch --> PathA
Branch --> PathB
PathA --> Reconverge
PathB --> Reconverge
This reduces effective parallelism.
Thus GPUs perform best when threads follow similar execution paths.
7. Latency Hiding Through Concurrency¶
GPU memory accesses may take hundreds of cycles.
Instead of relying on large caches, GPUs hide latency by switching between warps.
Warp scheduling¶
flowchart LR
Warp1[Warp waiting for memory]
Warp2[Ready warp]
Warp3[Ready warp]
Scheduler --> Warp2
Scheduler --> Warp3
If one warp stalls on memory, another warp immediately runs.
This switching occurs with zero context switch cost.
Occupancy¶
The number of active warps on an SM is called occupancy.
High occupancy allows the GPU to hide memory latency effectively.
Occupancy depends on:
- registers per thread
- shared memory usage
- threads per block
8. GPU Memory Hierarchy¶
GPUs use a memory hierarchy similar to CPUs but optimized for throughput.
GPU memory hierarchy¶
flowchart TB
Registers --> Shared
Shared --> L2
L2 --> Global
Global --> Host
Memory types¶
| Memory | Scope | Latency |
|---|---|---|
| Registers | per thread | ~1 cycle |
| Shared memory | per block | ~20–30 cycles |
| L2 cache | whole GPU | ~200 cycles |
| Global memory | VRAM | ~400–800 cycles |
Additional specialized memory types include:
- constant memory
- texture memory
- local memory
9. Memory Coalescing¶
GPU memory bandwidth is maximized when threads in a warp access contiguous addresses.
Coalesced access¶
flowchart LR
T1 --> Addr0
T2 --> Addr1
T3 --> Addr2
T4 --> Addr3
These accesses combine into a single memory transaction.
Uncoalesced access¶
flowchart LR
T1 --> Addr100
T2 --> Addr3
T3 --> Addr900
This requires multiple memory transactions and reduces performance.
10. Tensor Cores¶
Modern GPUs contain tensor cores, specialized units designed for matrix multiplication.
They perform the fused operation:
[ D = A \times B + C ]
in a single instruction.
Tensor core operation¶
flowchart LR
A[Matrix Tile A] --> TC[Tensor Core]
B[Matrix Tile B] --> TC
TC --> C[Result Tile]
Tensor cores dramatically accelerate deep learning workloads.
11. Tiling for High Performance¶
GPU kernels often use tiling to reduce global memory access.
Instead of repeatedly reading data from global memory, blocks load tiles into shared memory.
Tiling strategy¶
flowchart LR
GlobalMemory --> TileA
GlobalMemory --> TileB
TileA --> SharedMemory
TileB --> SharedMemory
SharedMemory --> Threads
This allows many threads to reuse the same data.
12. Arithmetic Intensity and the Roofline Model¶
GPU performance depends on arithmetic intensity:
Programs with:
- low arithmetic intensity → memory bound
- high arithmetic intensity → compute bound
Roofline model¶
Performance
(FLOPS)
│
│ ________ compute limit
│ /
│ /
│ /
│_______/
│
└────────────────
arithmetic intensity
The goal of GPU optimization is to increase data reuse, pushing kernels toward the compute limit.
13. Why GPUs Excel at Deep Learning¶
Neural networks rely heavily on:
- matrix multiplication
- convolution
- large batch processing
These operations exhibit:
- massive data parallelism
- high arithmetic intensity
Thus GPU hardware matches deep learning workloads extremely well.
14. Example GPU Usage¶
CuPy example¶
import cupy as cp
a = cp.random.rand(10000,10000)
b = cp.random.rand(10000,10000)
c = cp.dot(a,b)
The computation executes entirely on the GPU.
PyTorch example¶
import torch
device = torch.device("cuda")
x = torch.randn(4096,4096,device=device)
y = torch.randn(4096,4096,device=device)
z = torch.mm(x,y)
15. Key GPU Optimization Principles¶
GPU performance depends on several factors.
| Factor | Impact |
|---|---|
| Occupancy | more active warps hide latency |
| Memory coalescing | maximizes bandwidth |
| Warp divergence | reduces parallel efficiency |
| Arithmetic intensity | determines compute vs memory bound |
| CPU-GPU transfers | may dominate runtime |
16. Summary¶
| Concept | Explanation |
|---|---|
| GPU | massively parallel processor |
| SM | primary compute unit |
| Kernel | function executed on GPU |
| Warp | 32-thread execution group |
| SIMT | single instruction across many threads |
| Occupancy | number of active warps |
| Memory coalescing | contiguous memory access |
| Tensor cores | specialized matrix units |
| Roofline model | compute vs memory limits |
GPUs achieve extraordinary performance by combining:
- thousands of lightweight threads
- massive parallel execution
- high memory bandwidth
- latency hiding through concurrency
These architectural principles explain why GPUs excel at deep learning, scientific computing, and large-scale numerical workloads.