GPU Architecture¶

Graphics Processing Units (GPUs) are specialized processors designed to execute massively parallel workloads.

They power many modern computational tasks including:

deep learning
scientific simulations
computer graphics
large-scale numerical computation

While CPUs prioritize low latency for individual threads, GPUs prioritize high throughput across thousands of threads.

Instead of executing a few complex threads quickly, GPUs execute many lightweight threads simultaneously.

This design is particularly effective for workloads exhibiting data parallelism, where the same operation must be applied to many elements.

1. CPU vs GPU Design Philosophy¶

CPUs and GPUs are built with fundamentally different architectural goals.

CPU¶

CPUs optimize:

low-latency execution
complex control flow
sequential performance

Typical features include:

few powerful cores
large caches
complex out-of-order execution

GPU¶

GPUs optimize:

extremely high throughput
massive thread-level parallelism
predictable memory access patterns

Typical features include:

thousands of simple cores
small caches
simple in-order execution
hardware thread scheduling

CPU vs GPU architecture¶

flowchart LR
    subgraph CPU["CPU: Few Complex Cores"]
        C1[Core]
        C2[Core]
        C3[Core]
        C4[Core]
    end

    subgraph GPU["GPU: Many Simple Cores"]
        G1[Core]
        G2[Core]
        G3[Core]
        G4[Core]
        G5[Core]
        G6[Core]
        G7[Core]
        G8[Core]
    end

Comparison¶

Property	CPU	GPU
Cores	few (4–64)	thousands
Core complexity	high	simple
Execution model	out-of-order	mostly in-order
Optimization	latency	throughput
Best workloads	branching logic	data parallelism

2. System Architecture¶

GPUs operate as accelerators attached to a CPU system.

The CPU orchestrates execution while the GPU performs large parallel computations.

System overview¶

flowchart LR
    RAM[System RAM]
    CPU[CPU]
    BUS[PCIe / NVLink]
    GPU[GPU]
    VRAM[GPU Memory]

    RAM <--> CPU
    CPU <--> BUS
    BUS <--> GPU
    GPU <--> VRAM

Execution flow:

CPU loads data into system RAM
Data is copied to GPU memory
CPU launches a GPU kernel
GPU executes parallel computation
Results may be copied back to CPU

Thus the conceptual model is:

CPU = controller
GPU = accelerator

3. Kernel Execution Model¶

GPU programs run functions called kernels.

A kernel executes many parallel threads organized in a hierarchical structure.

Execution hierarchy¶

flowchart TD
    Kernel --> Grid
    Grid --> Block1[Thread Block]
    Grid --> Block2[Thread Block]
    Grid --> Block3[Thread Block]

    Block1 --> WarpA
    Block1 --> WarpB

    WarpA --> ThreadsA[32 Threads]
    WarpB --> ThreadsB[32 Threads]

Hierarchy:

Kernel
 └ Grid
     └ Thread Blocks
         └ Warps
             └ Threads

Key rules:

Thread blocks execute on a single SM
Threads in a block share memory
Warps are the hardware scheduling unit

4. Streaming Multiprocessors (SM)¶

The fundamental compute unit of a GPU is the Streaming Multiprocessor (SM).

A GPU contains many SMs that operate independently.

SM architecture¶

flowchart TB
    SM[Streaming Multiprocessor]

    SM --> WS0[Warp Scheduler]
    SM --> WS1[Warp Scheduler]

    SM --> ALU[CUDA Cores]
    SM --> LSU[Load/Store Units]
    SM --> SFU[Special Function Units]
    SM --> TC[Tensor Cores]

    SM --> RF[Register File]
    SM --> SHMEM[Shared Memory / L1]

An SM manages many active threads simultaneously.

Typical hardware inside an SM:

warp schedulers
CUDA cores (scalar ALUs)
tensor cores
load/store units
register file
shared memory

5. Warps and SIMT Execution¶

GPU threads execute in groups called warps.

A warp contains:

32 threads

Warps execute using the SIMT (Single Instruction Multiple Threads) model.

All threads execute the same instruction but operate on different data.

SIMT model¶

flowchart TD
    Warp --> Instruction
    Instruction --> Thread1
    Instruction --> Thread2
    Instruction --> Thread3

Each thread maintains its own:

registers
memory addresses
control state

6. Warp Divergence¶

Problems occur when threads within a warp follow different branches.

Example:

if condition:
    path_A
else:
    path_B

When divergence occurs, the warp must execute both paths sequentially.

Warp divergence¶

flowchart TD
    Warp --> Branch

    Branch --> PathA
    Branch --> PathB

    PathA --> Reconverge
    PathB --> Reconverge

This reduces effective parallelism.

Thus GPUs perform best when threads follow similar execution paths.

7. Latency Hiding Through Concurrency¶

GPU memory accesses may take hundreds of cycles.

Instead of relying on large caches, GPUs hide latency by switching between warps.

Warp scheduling¶

flowchart LR
    Warp1[Warp waiting for memory]
    Warp2[Ready warp]
    Warp3[Ready warp]

    Scheduler --> Warp2
    Scheduler --> Warp3

If one warp stalls on memory, another warp immediately runs.

This switching occurs with zero context switch cost.

Occupancy¶

The number of active warps on an SM is called occupancy.

High occupancy allows the GPU to hide memory latency effectively.

Occupancy depends on:

registers per thread
shared memory usage
threads per block

8. GPU Memory Hierarchy¶

GPUs use a memory hierarchy similar to CPUs but optimized for throughput.

GPU memory hierarchy¶

flowchart TB
    Registers --> Shared
    Shared --> L2
    L2 --> Global
    Global --> Host

Memory types¶

Memory	Scope	Latency
Registers	per thread	~1 cycle
Shared memory	per block	~20–30 cycles
L2 cache	whole GPU	~200 cycles
Global memory	VRAM	~400–800 cycles

Additional specialized memory types include:

constant memory
texture memory
local memory

9. Memory Coalescing¶

GPU memory bandwidth is maximized when threads in a warp access contiguous addresses.

Coalesced access¶

flowchart LR
    T1 --> Addr0
    T2 --> Addr1
    T3 --> Addr2
    T4 --> Addr3

These accesses combine into a single memory transaction.

Uncoalesced access¶

flowchart LR
    T1 --> Addr100
    T2 --> Addr3
    T3 --> Addr900

This requires multiple memory transactions and reduces performance.

10. Tensor Cores¶

Modern GPUs contain tensor cores, specialized units designed for matrix multiplication.

They perform the fused operation:

[ D = A \times B + C ]

in a single instruction.

Tensor core operation¶

flowchart LR
    A[Matrix Tile A] --> TC[Tensor Core]
    B[Matrix Tile B] --> TC
    TC --> C[Result Tile]

Tensor cores dramatically accelerate deep learning workloads.

11. Tiling for High Performance¶

GPU kernels often use tiling to reduce global memory access.

Instead of repeatedly reading data from global memory, blocks load tiles into shared memory.

Tiling strategy¶

flowchart LR
    GlobalMemory --> TileA
    GlobalMemory --> TileB

    TileA --> SharedMemory
    TileB --> SharedMemory

    SharedMemory --> Threads

This allows many threads to reuse the same data.

12. Arithmetic Intensity and the Roofline Model¶

GPU performance depends on arithmetic intensity:

\[ \text{Arithmetic Intensity} =========================== \frac{\text{FLOPs}}{\text{bytes transferred}} \]

Programs with:

low arithmetic intensity → memory bound
high arithmetic intensity → compute bound

Roofline model¶

Performance
(FLOPS)
      │
      │           ________  compute limit
      │          /
      │         /
      │        /
      │_______/
      │
      └────────────────
       arithmetic intensity

The goal of GPU optimization is to increase data reuse, pushing kernels toward the compute limit.

13. Why GPUs Excel at Deep Learning¶

Neural networks rely heavily on:

matrix multiplication
convolution
large batch processing

These operations exhibit:

massive data parallelism
high arithmetic intensity

Thus GPU hardware matches deep learning workloads extremely well.

14. Example GPU Usage¶

CuPy example¶

import cupy as cp

a = cp.random.rand(10000,10000)
b = cp.random.rand(10000,10000)

c = cp.dot(a,b)

The computation executes entirely on the GPU.

PyTorch example¶

import torch

device = torch.device("cuda")

x = torch.randn(4096,4096,device=device)
y = torch.randn(4096,4096,device=device)

z = torch.mm(x,y)

15. Key GPU Optimization Principles¶

GPU performance depends on several factors.

Factor	Impact
Occupancy	more active warps hide latency
Memory coalescing	maximizes bandwidth
Warp divergence	reduces parallel efficiency
Arithmetic intensity	determines compute vs memory bound
CPU-GPU transfers	may dominate runtime

16. Summary¶

Concept	Explanation
GPU	massively parallel processor
SM	primary compute unit
Kernel	function executed on GPU
Warp	32-thread execution group
SIMT	single instruction across many threads
Occupancy	number of active warps
Memory coalescing	contiguous memory access
Tensor cores	specialized matrix units
Roofline model	compute vs memory limits

GPUs achieve extraordinary performance by combining:

thousands of lightweight threads
massive parallel execution
high memory bandwidth
latency hiding through concurrency

These architectural principles explain why GPUs excel at deep learning, scientific computing, and large-scale numerical workloads.