GPU Clusters¶

Mental Model

A GPU cluster is a team of powerful machines, each packed with GPUs, connected by a fast network. The challenge is not raw compute -- individual GPUs are fast -- but coordination: synchronizing gradients across nodes costs time that grows with cluster size. Scaling efficiency is the gap between the parallelism you buy and the speedup you actually get.

Why GPU Clusters?¶

Single GPUs have limits. Large-scale deep learning and scientific computing often require multiple GPUs across multiple machines.

``` GPU Scaling Journey:

Single GPU: Multi-GPU (one machine): GPU Cluster: ┌────────────┐ ┌────────────────────┐ ┌─────────┐ ┌─────────┐ │ GPU │ → │ GPU GPU GPU GPU │ → │ 8 GPUs │ │ 8 GPUs │ │ (8-80 GB) │ │ (connected via │ │ Node 1 │ │ Node 2 │ └────────────┘ │ NVLink/PCIe) │ └────┬────┘ └────┬────┘ └────────────────────┘ │ │ └────┬─────┘ │ ┌─────┴─────┐ │ Network │ │(InfiniBand)│ └───────────┘ ```

GPU Cluster Architecture¶

Typical Node Configuration¶

``` GPU Node (e.g., DGX A100):

┌─────────────────────────────────────────────────────────────┐ │ GPU Node │ │ │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ 8 × A100 GPUs │ │ │ │ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │ │ │ │ │ GPU │═│ GPU │═│ GPU │═│ GPU │ │ │ │ │ │ 0 │ │ 1 │ │ 2 │ │ 3 │ NVLink: 600 GB/s │ │ │ │ └──╪──┘ └──╪──┘ └──╪──┘ └──╪──┘ │ │ │ │ ╪═══════╪═══════╪═══════╪ │ │ │ │ ┌──╪──┐ ┌──╪──┐ ┌──╪──┐ ┌──╪──┐ │ │ │ │ │ GPU │═│ GPU │═│ GPU │═│ GPU │ │ │ │ │ │ 4 │ │ 5 │ │ 6 │ │ 7 │ │ │ │ │ └─────┘ └─────┘ └─────┘ └─────┘ │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ │ ┌───────────────────────┴───────────────────────────────┐ │ │ │ 2 × CPU (AMD EPYC / Intel Xeon) │ │ │ │ 1-2 TB RAM │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ │ InfiniBand NIC │ │ (200-400 Gbps) │ └─────────────────────────────────────────────────────────────┘ ```

Cluster Interconnect¶

``` GPU Cluster Network:

┌──────────┐ ┌──────────┐ ┌──────────┐ │ Node 1 │ │ Node 2 │ │ Node N │ │ 8 GPUs │ │ 8 GPUs │ │ 8 GPUs │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ │ InfiniBand Fabric │ │ (200-400 Gbps/node) │ └────────────────┴────────────────┘

Bandwidth Hierarchy: GPU Memory: ~2000 GB/s NVLink (in-node): ~600 GB/s InfiniBand: ~50 GB/s (200 Gbps) Ethernet: ~12 GB/s (100 Gbps) ```

Distributed Training Strategies¶

Data Parallelism¶

Each GPU has full model copy, different data:

``` Data Parallelism:

                ┌─────────────┐
                │ Full Dataset │
                └──────┬──────┘
       ┌───────────────┼───────────────┐
       ▼               ▼               ▼
  ┌────────┐      ┌────────┐      ┌────────┐
  │Batch 1 │      │Batch 2 │      │Batch 3 │
  └────┬───┘      └────┬───┘      └────┬───┘
       │               │               │
  ┌────▼───┐      ┌────▼───┐      ┌────▼───┐
  │ GPU 0  │      │ GPU 1  │      │ GPU 2  │
  │ Model  │      │ Model  │      │ Model  │
  │ Copy   │      │ Copy   │      │ Copy   │
  └────┬───┘      └────┬───┘      └────┬───┘
       │               │               │
       └───────────────┼───────────────┘
                       ▼
             ┌─────────────────┐
             │ Sync Gradients  │
             │   (AllReduce)   │
             └─────────────────┘

```

Model Parallelism¶

Model split across GPUs:

``` Model Parallelism:

Large Model (won't fit on one GPU):

┌─────────────────────────────────────────────────┐ │ Neural Network Layers │ │ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ │ │ │Layer 1│→│Layer 2│→│Layer 3│→│Layer 4│ │ │ └───────┘ └───────┘ └───────┘ └───────┘ │ └─────────────────────────────────────────────────┘

Split across GPUs:

┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ GPU 0 │───▶│ GPU 1 │───▶│ GPU 2 │───▶│ GPU 3 │ │ Layer 1 │ │ Layer 2 │ │ Layer 3 │ │ Layer 4 │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ ```

Pipeline Parallelism¶

Combine model parallelism with micro-batching:

``` Pipeline Parallelism:

Time → ┌─────┬─────┬─────┬─────┬─────┬─────┐ GPU 0: │ B1 │ B2 │ B3 │ B4 │ │ │ └─────┴──┬──┴──┬──┴──┬──┴─────┴─────┘ │ │ │ ┌────────▼──┬──▼──┬──▼──┬─────┬─────┐ GPU 1: │ │ B1 │ B2 │ B3 │ B4 │ │ └─────┴─────┴──┬──┴──┬──┴──┬──┴─────┘ │ │ │ ┌──────────────▼──┬──▼──┬──▼──┬─────┐ GPU 2: │ │ │ B1 │ B2 │ B3 │ B4 │ └─────┴─────┴─────┴─────┴─────┴─────┘

B1, B2, ... = Micro-batches GPUs stay busy with different batches ```

PyTorch Distributed Training¶

Data Parallel (Single Node)¶

```python import torch import torch.nn as nn

Simple DataParallel (single node, multiple GPUs)¶

model = MyModel() model = nn.DataParallel(model) # Wraps model model = model.cuda()

Training loop unchanged¶

for data, target in dataloader: data, target = data.cuda(), target.cuda() output = model(data) # Automatically split across GPUs loss = criterion(output, target) loss.backward() optimizer.step() ```

Distributed Data Parallel (Multi-Node)¶

```python import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP

Initialize distributed process group¶

dist.init_process_group( backend='nccl', # NVIDIA Collective Communications Library init_method='env://', world_size=world_size, rank=rank )

Create model and wrap with DDP¶

model = MyModel().cuda() model = DDP(model, device_ids=[local_rank])

Use DistributedSampler for data¶

sampler = torch.utils.data.distributed.DistributedSampler(dataset) dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)

Training loop¶

for epoch in range(num_epochs): sampler.set_epoch(epoch) # Shuffle differently each epoch for data, target in dataloader: data, target = data.cuda(), target.cuda() output = model(data) loss = criterion(output, target) loss.backward() # Gradients automatically synchronized optimizer.step() ```

Launch Script¶

```bash

Launch across 4 nodes, 8 GPUs each¶

torchrun \ --nnodes=4 \ --nproc_per_node=8 \ --rdzv_endpoint=master:29500 \ train.py ```

Scaling Efficiency¶

Communication Overhead¶

``` Scaling Efficiency:

Perfect linear scaling (theoretical): 1 GPU: 100 samples/sec 8 GPUs: 800 samples/sec 64 GPUs: 6400 samples/sec

Actual (with communication overhead): 1 GPU: 100 samples/sec 8 GPUs: 700 samples/sec (87.5% efficiency) 64 GPUs: 4000 samples/sec (62.5% efficiency)

Efficiency drops as: - More GPUs = more synchronization - Smaller batches = higher communication ratio - Slower interconnect = longer waits ```

Batch Size Considerations¶

```python

Effective batch size = per_gpu_batch × num_gpus¶

Single GPU: batch_size = 32¶

8 GPUs: effective_batch = 32 × 8 = 256¶

64 GPUs: effective_batch = 32 × 64 = 2048¶

May need to adjust learning rate¶

Linear scaling rule: lr_new = lr_base × num_gpus¶

learning_rate = base_lr * world_size ```

Frameworks for GPU Clusters¶

Framework	Use Case	Complexity
PyTorch DDP	General distributed training	Medium
DeepSpeed	Large model training	Medium-High
Megatron-LM	Massive language models	High
Horovod	Framework-agnostic	Medium
Ray	General distributed ML	Medium

DeepSpeed Example¶

```python import deepspeed

Config for ZeRO optimization¶

ds_config = { "train_batch_size": 256, "gradient_accumulation_steps": 4, "fp16": {"enabled": True}, "zero_optimization": { "stage": 2, # Partition gradients and optimizer states "offload_optimizer": {"device": "cpu"} } }

Initialize DeepSpeed¶

model, optimizer, _, _ = deepspeed.initialize( model=model, config=ds_config, model_parameters=model.parameters() )

Training loop¶

for batch in dataloader: outputs = model(batch) loss = compute_loss(outputs) model.backward(loss) model.step() ```

Cost Considerations¶

``` GPU Cluster Costs (Cloud):

┌─────────────────────────────────────────────────────────────┐ │ Configuration │ Hourly Cost │ Monthly (24/7) │ ├─────────────────────────┼─────────────┼────────────────────┤ │ 1 × A100 (40GB) │ ~$3 │ ~$2,200 │ │ 8 × A100 (one node) │ ~$25 │ ~$18,000 │ │ 64 × A100 (8 nodes) │ ~$200 │ ~$144,000 │ │ 512 × A100 (64 nodes) │ ~$1,600 │ ~$1,150,000 │ └─────────────────────────┴─────────────┴────────────────────┘

Cost optimization: - Spot instances (60-70% cheaper, can be interrupted) - Reserved capacity (30-40% cheaper, commitment) - Right-size your cluster (don't over-provision) ```

Summary¶

Aspect	Single GPU	Multi-GPU Node	GPU Cluster
Memory	8-80 GB	64-640 GB	Terabytes
Interconnect	N/A	NVLink (600 GB/s)	InfiniBand (50 GB/s)
Complexity	Simple	Medium	High
Use Case	Development, small models	Medium models	Large models, fast training

Key points:

GPU clusters enable training models too large for single GPUs
Data parallelism is simplest; model parallelism for huge models
Communication overhead limits scaling efficiency
NVLink (intra-node) >> InfiniBand (inter-node) >> Ethernet
Choose cluster size based on model size and time constraints
Cost scales roughly linearly; efficiency doesn't

Exercises¶

Exercise 1. Explain the difference between data parallelism and model parallelism in distributed GPU training.

Solution to Exercise 1

```python

Conceptual solution - see page content for details¶

import sys import platform

print(f"Python version: {sys.version}") print(f"Platform: {platform.platform()}") print(f"Architecture: {platform.machine()}") ```

Exercise 2. Explain what CUDA is and why it is important for GPU computing. Can Python code use CUDA?

Solution to Exercise 2

See the main content for the detailed explanation. The key concept involves understanding the hardware-software interaction and how it affects Python performance.

Exercise 3. Describe the role of NCCL (NVIDIA Collective Communication Library) in multi-GPU training.

Solution to Exercise 3

```python import time

Simple benchmark¶

n = 10_000_000 start = time.perf_counter() total = sum(range(n)) elapsed = time.perf_counter() - start print(f"Sum of {n} integers: {total}") print(f"Time: {elapsed:.4f} seconds") ```

Exercise 4. Explain what GPU memory (VRAM) is and why running out of VRAM is a common issue in deep learning. What strategies help mitigate this?

Solution to Exercise 4

```python import numpy as np import time

n = 1_000_000

Python loop¶

start = time.perf_counter() result_py = sum(i * i for i in range(n)) time_py = time.perf_counter() - start

NumPy vectorized¶

arr = np.arange(n) start = time.perf_counter() result_np = np.sum(arr * arr) time_np = time.perf_counter() - start

print(f"Python: {time_py:.4f}s, NumPy: {time_np:.4f}s") print(f"Speedup: {time_py / time_np:.1f}x") ```