GPU Architecture¶

Mental Model

A CPU is like a team of a few expert chefs who can each prepare any dish quickly; a GPU is like a kitchen with thousands of line cooks, each handling one simple step. When the same operation must be applied to millions of data points -- as in deep learning or numerical simulation -- the GPU's army of simple cores overwhelms the CPU's small team of complex ones.

Graphics Processing Units (GPUs) are specialized processors designed to execute massively parallel workloads.

They power many modern computational tasks including:

deep learning
scientific simulations
computer graphics
large-scale numerical computation

While CPUs prioritize low latency for individual threads, GPUs prioritize high throughput across thousands of threads.

Instead of executing a few complex threads quickly, GPUs execute many lightweight threads simultaneously.

This design is particularly effective for workloads exhibiting data parallelism, where the same operation must be applied to many elements.

1. CPU vs GPU Design Philosophy¶

CPUs and GPUs are built with fundamentally different architectural goals.

CPU¶

CPUs optimize:

low-latency execution
complex control flow
sequential performance

Typical features include:

few powerful cores
large caches
complex out-of-order execution

GPU¶

GPUs optimize:

extremely high throughput
massive thread-level parallelism
predictable memory access patterns

Typical features include:

thousands of simple cores
small caches
simple in-order execution
hardware thread scheduling

CPU vs GPU architecture¶

flowchart LR
    subgraph CPU["CPU: Few Complex Cores"]
        C1[Core]
        C2[Core]
        C3[Core]
        C4[Core]
    end

    subgraph GPU["GPU: Many Simple Cores"]
        G1[Core]
        G2[Core]
        G3[Core]
        G4[Core]
        G5[Core]
        G6[Core]
        G7[Core]
        G8[Core]
    end

Comparison¶

Property	CPU	GPU
Cores	few (4–64)	thousands
Core complexity	high	simple
Execution model	out-of-order	mostly in-order
Optimization	latency	throughput
Best workloads	branching logic	data parallelism

2. System Architecture¶

GPUs operate as accelerators attached to a CPU system.

The CPU orchestrates execution while the GPU performs large parallel computations.

System overview¶

flowchart LR
    RAM[System RAM]
    CPU[CPU]
    BUS[PCIe / NVLink]
    GPU[GPU]
    VRAM[GPU Memory]

    RAM <--> CPU
    CPU <--> BUS
    BUS <--> GPU
    GPU <--> VRAM

Execution flow:

CPU loads data into system RAM
Data is copied to GPU memory
CPU launches a GPU kernel
GPU executes parallel computation
Results may be copied back to CPU

Thus the conceptual model is:

CPU = controller GPU = accelerator

3. Kernel Execution Model¶

GPU programs run functions called kernels.

A kernel executes many parallel threads organized in a hierarchical structure.

Execution hierarchy¶

flowchart TD
    Kernel --> Grid
    Grid --> Block1[Thread Block]
    Grid --> Block2[Thread Block]
    Grid --> Block3[Thread Block]

    Block1 --> WarpA
    Block1 --> WarpB

    WarpA --> ThreadsA[32 Threads]
    WarpB --> ThreadsB[32 Threads]

Hierarchy:

Kernel └ Grid └ Thread Blocks └ Warps └ Threads

Key rules:

Thread blocks execute on a single SM
Threads in a block share memory
Warps are the hardware scheduling unit

4. Streaming Multiprocessors (SM)¶

The fundamental compute unit of a GPU is the Streaming Multiprocessor (SM).

A GPU contains many SMs that operate independently.

SM architecture¶

flowchart TB
    SM[Streaming Multiprocessor]

    SM --> WS0[Warp Scheduler]
    SM --> WS1[Warp Scheduler]

    SM --> ALU[CUDA Cores]
    SM --> LSU[Load/Store Units]
    SM --> SFU[Special Function Units]
    SM --> TC[Tensor Cores]

    SM --> RF[Register File]
    SM --> SHMEM[Shared Memory / L1]

An SM manages many active threads simultaneously.

Typical hardware inside an SM:

warp schedulers
CUDA cores (scalar ALUs)
tensor cores
load/store units
register file
shared memory

5. Warps and SIMT Execution¶

GPU threads execute in groups called warps.

A warp contains:

32 threads

Warps execute using the SIMT (Single Instruction Multiple Threads) model.

All threads execute the same instruction but operate on different data.

SIMT model¶

flowchart TD
    Warp --> Instruction
    Instruction --> Thread1
    Instruction --> Thread2
    Instruction --> Thread3

Each thread maintains its own:

registers
memory addresses
control state

6. Warp Divergence¶

Problems occur when threads within a warp follow different branches.

Example:

if condition: path_A else: path_B

When divergence occurs, the warp must execute both paths sequentially.

Warp divergence¶

flowchart TD
    Warp --> Branch

    Branch --> PathA
    Branch --> PathB

    PathA --> Reconverge
    PathB --> Reconverge

This reduces effective parallelism.

Thus GPUs perform best when threads follow similar execution paths.

7. Latency Hiding Through Concurrency¶

GPU memory accesses may take hundreds of cycles.

Instead of relying on large caches, GPUs hide latency by switching between warps.

Warp scheduling¶

flowchart LR
    Warp1[Warp waiting for memory]
    Warp2[Ready warp]
    Warp3[Ready warp]

    Scheduler --> Warp2
    Scheduler --> Warp3

If one warp stalls on memory, another warp immediately runs.

This switching occurs with zero context switch cost.

Occupancy¶

The number of active warps on an SM is called occupancy.

High occupancy allows the GPU to hide memory latency effectively.

Occupancy depends on:

registers per thread
shared memory usage
threads per block

8. GPU Memory Hierarchy¶

GPUs use a memory hierarchy similar to CPUs but optimized for throughput.

GPU memory hierarchy¶

flowchart TB
    Registers --> Shared
    Shared --> L2
    L2 --> Global
    Global --> Host

Memory types¶

Memory	Scope	Latency
Registers	per thread	~1 cycle
Shared memory	per block	~20–30 cycles
L2 cache	whole GPU	~200 cycles
Global memory	VRAM	~400–800 cycles

Additional specialized memory types include:

constant memory
texture memory
local memory

9. Memory Coalescing¶

GPU memory bandwidth is maximized when threads in a warp access contiguous addresses.

Coalesced access¶

flowchart LR
    T1 --> Addr0
    T2 --> Addr1
    T3 --> Addr2
    T4 --> Addr3

These accesses combine into a single memory transaction.

Uncoalesced access¶

flowchart LR
    T1 --> Addr100
    T2 --> Addr3
    T3 --> Addr900

This requires multiple memory transactions and reduces performance.

10. Tensor Cores¶

Modern GPUs contain tensor cores, specialized units designed for matrix multiplication.

They perform the fused operation:

[ D = A \times B + C ]

in a single instruction.

Tensor core operation¶

flowchart LR
    A[Matrix Tile A] --> TC[Tensor Core]
    B[Matrix Tile B] --> TC
    TC --> C[Result Tile]

Tensor cores dramatically accelerate deep learning workloads.

11. Tiling for High Performance¶

GPU kernels often use tiling to reduce global memory access.

Instead of repeatedly reading data from global memory, blocks load tiles into shared memory.

Tiling strategy¶

flowchart LR
    GlobalMemory --> TileA
    GlobalMemory --> TileB

    TileA --> SharedMemory
    TileB --> SharedMemory

    SharedMemory --> Threads

This allows many threads to reuse the same data.

12. Arithmetic Intensity and the Roofline Model¶

GPU performance depends on arithmetic intensity:

\[ \text{Arithmetic Intensity} =========================== \frac{\text{FLOPs}}{\text{bytes transferred}} \]

Programs with:

low arithmetic intensity → memory bound
high arithmetic intensity → compute bound

Roofline model¶

Performance (FLOPS) │ │ ________ compute limit │ / │ / │ / │_______/ │ └──────────────── arithmetic intensity

The goal of GPU optimization is to increase data reuse, pushing kernels toward the compute limit.

13. Why GPUs Excel at Deep Learning¶

Neural networks rely heavily on:

matrix multiplication
convolution
large batch processing

These operations exhibit:

massive data parallelism
high arithmetic intensity

Thus GPU hardware matches deep learning workloads extremely well.

14. Example GPU Usage¶

CuPy example¶

```python import cupy as cp

a = cp.random.rand(10000,10000) b = cp.random.rand(10000,10000)

c = cp.dot(a,b) ```

The computation executes entirely on the GPU.

PyTorch example¶

```python import torch

device = torch.device("cuda")

x = torch.randn(4096,4096,device=device) y = torch.randn(4096,4096,device=device)

z = torch.mm(x,y) ```

15. Key GPU Optimization Principles¶

GPU performance depends on several factors.

Factor	Impact
Occupancy	more active warps hide latency
Memory coalescing	maximizes bandwidth
Warp divergence	reduces parallel efficiency
Arithmetic intensity	determines compute vs memory bound
CPU-GPU transfers	may dominate runtime

16. Summary¶

Concept	Explanation
GPU	massively parallel processor
SM	primary compute unit
Kernel	function executed on GPU
Warp	32-thread execution group
SIMT	single instruction across many threads
Occupancy	number of active warps
Memory coalescing	contiguous memory access
Tensor cores	specialized matrix units
Roofline model	compute vs memory limits

GPUs achieve extraordinary performance by combining:

thousands of lightweight threads
massive parallel execution
high memory bandwidth
latency hiding through concurrency

These architectural principles explain why GPUs excel at deep learning, scientific computing, and large-scale numerical workloads.

Exercises¶

Exercise 1. GPUs and CPUs have fundamentally different design philosophies. A modern CPU has ~8 cores optimized for low latency, while a GPU has ~10,000 CUDA cores optimized for high throughput.

(a) Why does a CPU core have large caches and sophisticated branch prediction, while a GPU core has minimal cache and no branch prediction? (b) For which type of workload is each design better: (i) parsing a complex JSON file, (ii) multiplying two 10,000x10,000 matrices?

Solution to Exercise 1

(a) CPU cores have large caches and branch prediction because they run complex, branchy, serial code (operating systems, compilers, web browsers) where a single thread needs low latency. GPUs have minimal per-core cache because they hide latency through massive parallelism -- instead of caching, they switch to another warp when one stalls on memory. Branch prediction is unnecessary because GPUs execute all paths and mask inactive threads.

(b) (i) CPU -- JSON parsing is a serial, branchy task with data dependencies between tokens. A single CPU core with branch prediction and out-of-order execution handles this efficiently. (ii) GPU -- matrix multiplication is massively data-parallel (each output element is independent) with high arithmetic intensity, perfectly suited for thousands of GPU threads.

Exercise 2. Warp divergence reduces GPU performance. Consider this kernel executed by a warp of 32 threads:

if thread_id % 2 == 0: path_A (10 instructions) else: path_B (10 instructions)

(a) How many instruction cycles does this warp take (assuming no divergence would take 10 cycles)? (b) What if the condition were if thread_id < 16 instead? Would divergence still occur? (c) Why is warp divergence a problem unique to GPU SIMT execution and not an issue for CPU threads?

Solution to Exercise 2

(a) With divergence, the warp takes 20 cycles: first it executes path_A for the even threads (10 cycles, odd threads masked), then path_B for the odd threads (10 cycles, even threads masked). Without divergence, it would take 10 cycles. Divergence halves the effective throughput.

(b) Yes, divergence still occurs -- threads 0-15 take path_A and threads 16-31 take path_B. Any split within a warp causes divergence. However, this pattern is slightly better than alternating because the two groups are contiguous, which may help the hardware reconverge faster.

(c) CPU threads are fully independent -- each has its own instruction pointer and pipeline. Thread 1 can execute path_A while Thread 2 simultaneously executes path_B on different cores. In GPU SIMT, all 32 threads in a warp share a single instruction pointer, so they must all execute the same instruction at the same time. Divergence forces serialization of the divergent paths.

Exercise 3. Memory coalescing is critical for GPU performance. A warp of 32 threads accesses memory:

Pattern A: Thread i accesses address base + i * 4 (sequential, stride-1)
Pattern B: Thread i accesses address base + i * 1024 (strided)
Pattern C: Thread i accesses a random address

(a) Which pattern achieves the best memory bandwidth and why? (b) Approximately how many memory transactions does each pattern require (assuming a 128-byte cache line)? (c) How does this relate to the difference between row-major and column-major matrix storage?

Solution to Exercise 3

(a) Pattern A (sequential) achieves the best bandwidth. 32 threads accessing consecutive 4-byte addresses span 128 bytes, which fits in a single cache line. One memory transaction serves the entire warp.

(b)

Pattern A: 1 transaction (128 bytes covers all 32 * 4 = 128 bytes)
Pattern B: 32 transactions (each thread's address is 1024 bytes apart, requiring separate cache lines)
Pattern C: up to 32 transactions (worst case: every address hits a different cache line)

Pattern A is 32x more bandwidth-efficient than Patterns B and C.

(c) In a row-major matrix, accessing a row is sequential (stride-1, coalesced). Accessing a column requires stride-N jumps (uncoalesced). In column-major, it is reversed. Choosing the right storage order for your access pattern is critical for GPU performance. This is why transposing a matrix before a GPU kernel can dramatically improve performance.