GPU Clusters¶
Mental Model
A GPU cluster is a team of powerful machines, each packed with GPUs, connected by a fast network. The challenge is not raw compute -- individual GPUs are fast -- but coordination: synchronizing gradients across nodes costs time that grows with cluster size. Scaling efficiency is the gap between the parallelism you buy and the speedup you actually get.
Why GPU Clusters?¶
Single GPUs have limits. Large-scale deep learning and scientific computing often require multiple GPUs across multiple machines.
``` GPU Scaling Journey:
Single GPU: Multi-GPU (one machine): GPU Cluster: ┌────────────┐ ┌────────────────────┐ ┌─────────┐ ┌─────────┐ │ GPU │ → │ GPU GPU GPU GPU │ → │ 8 GPUs │ │ 8 GPUs │ │ (8-80 GB) │ │ (connected via │ │ Node 1 │ │ Node 2 │ └────────────┘ │ NVLink/PCIe) │ └────┬────┘ └────┬────┘ └────────────────────┘ │ │ └────┬─────┘ │ ┌─────┴─────┐ │ Network │ │(InfiniBand)│ └───────────┘ ```
GPU Cluster Architecture¶
Typical Node Configuration¶
``` GPU Node (e.g., DGX A100):
┌─────────────────────────────────────────────────────────────┐ │ GPU Node │ │ │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ 8 × A100 GPUs │ │ │ │ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │ │ │ │ │ GPU │═│ GPU │═│ GPU │═│ GPU │ │ │ │ │ │ 0 │ │ 1 │ │ 2 │ │ 3 │ NVLink: 600 GB/s │ │ │ │ └──╪──┘ └──╪──┘ └──╪──┘ └──╪──┘ │ │ │ │ ╪═══════╪═══════╪═══════╪ │ │ │ │ ┌──╪──┐ ┌──╪──┐ ┌──╪──┐ ┌──╪──┐ │ │ │ │ │ GPU │═│ GPU │═│ GPU │═│ GPU │ │ │ │ │ │ 4 │ │ 5 │ │ 6 │ │ 7 │ │ │ │ │ └─────┘ └─────┘ └─────┘ └─────┘ │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ │ ┌───────────────────────┴───────────────────────────────┐ │ │ │ 2 × CPU (AMD EPYC / Intel Xeon) │ │ │ │ 1-2 TB RAM │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ │ InfiniBand NIC │ │ (200-400 Gbps) │ └─────────────────────────────────────────────────────────────┘ ```
Cluster Interconnect¶
``` GPU Cluster Network:
┌──────────┐ ┌──────────┐ ┌──────────┐ │ Node 1 │ │ Node 2 │ │ Node N │ │ 8 GPUs │ │ 8 GPUs │ │ 8 GPUs │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ │ InfiniBand Fabric │ │ (200-400 Gbps/node) │ └────────────────┴────────────────┘
Bandwidth Hierarchy: GPU Memory: ~2000 GB/s NVLink (in-node): ~600 GB/s InfiniBand: ~50 GB/s (200 Gbps) Ethernet: ~12 GB/s (100 Gbps) ```
Distributed Training Strategies¶
Data Parallelism¶
Each GPU has full model copy, different data:
``` Data Parallelism:
┌─────────────┐
│ Full Dataset │
└──────┬──────┘
┌───────────────┼───────────────┐
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│Batch 1 │ │Batch 2 │ │Batch 3 │
└────┬───┘ └────┬───┘ └────┬───┘
│ │ │
┌────▼───┐ ┌────▼───┐ ┌────▼───┐
│ GPU 0 │ │ GPU 1 │ │ GPU 2 │
│ Model │ │ Model │ │ Model │
│ Copy │ │ Copy │ │ Copy │
└────┬───┘ └────┬───┘ └────┬───┘
│ │ │
└───────────────┼───────────────┘
▼
┌─────────────────┐
│ Sync Gradients │
│ (AllReduce) │
└─────────────────┘
```
Model Parallelism¶
Model split across GPUs:
``` Model Parallelism:
Large Model (won't fit on one GPU):
┌─────────────────────────────────────────────────┐ │ Neural Network Layers │ │ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ │ │ │Layer 1│→│Layer 2│→│Layer 3│→│Layer 4│ │ │ └───────┘ └───────┘ └───────┘ └───────┘ │ └─────────────────────────────────────────────────┘
Split across GPUs:
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ GPU 0 │───▶│ GPU 1 │───▶│ GPU 2 │───▶│ GPU 3 │ │ Layer 1 │ │ Layer 2 │ │ Layer 3 │ │ Layer 4 │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ ```
Pipeline Parallelism¶
Combine model parallelism with micro-batching:
``` Pipeline Parallelism:
Time → ┌─────┬─────┬─────┬─────┬─────┬─────┐ GPU 0: │ B1 │ B2 │ B3 │ B4 │ │ │ └─────┴──┬──┴──┬──┴──┬──┴─────┴─────┘ │ │ │ ┌────────▼──┬──▼──┬──▼──┬─────┬─────┐ GPU 1: │ │ B1 │ B2 │ B3 │ B4 │ │ └─────┴─────┴──┬──┴──┬──┴──┬──┴─────┘ │ │ │ ┌──────────────▼──┬──▼──┬──▼──┬─────┐ GPU 2: │ │ │ B1 │ B2 │ B3 │ B4 │ └─────┴─────┴─────┴─────┴─────┴─────┘
B1, B2, ... = Micro-batches GPUs stay busy with different batches ```
PyTorch Distributed Training¶
Data Parallel (Single Node)¶
```python import torch import torch.nn as nn
Simple DataParallel (single node, multiple GPUs)¶
model = MyModel() model = nn.DataParallel(model) # Wraps model model = model.cuda()
Training loop unchanged¶
for data, target in dataloader: data, target = data.cuda(), target.cuda() output = model(data) # Automatically split across GPUs loss = criterion(output, target) loss.backward() optimizer.step() ```
Distributed Data Parallel (Multi-Node)¶
```python import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP
Initialize distributed process group¶
dist.init_process_group( backend='nccl', # NVIDIA Collective Communications Library init_method='env://', world_size=world_size, rank=rank )
Create model and wrap with DDP¶
model = MyModel().cuda() model = DDP(model, device_ids=[local_rank])
Use DistributedSampler for data¶
sampler = torch.utils.data.distributed.DistributedSampler(dataset) dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)
Training loop¶
for epoch in range(num_epochs): sampler.set_epoch(epoch) # Shuffle differently each epoch for data, target in dataloader: data, target = data.cuda(), target.cuda() output = model(data) loss = criterion(output, target) loss.backward() # Gradients automatically synchronized optimizer.step() ```
Launch Script¶
```bash
Launch across 4 nodes, 8 GPUs each¶
torchrun \ --nnodes=4 \ --nproc_per_node=8 \ --rdzv_endpoint=master:29500 \ train.py ```
Scaling Efficiency¶
Communication Overhead¶
``` Scaling Efficiency:
Perfect linear scaling (theoretical): 1 GPU: 100 samples/sec 8 GPUs: 800 samples/sec 64 GPUs: 6400 samples/sec
Actual (with communication overhead): 1 GPU: 100 samples/sec 8 GPUs: 700 samples/sec (87.5% efficiency) 64 GPUs: 4000 samples/sec (62.5% efficiency)
Efficiency drops as: - More GPUs = more synchronization - Smaller batches = higher communication ratio - Slower interconnect = longer waits ```
Batch Size Considerations¶
```python
Effective batch size = per_gpu_batch × num_gpus¶
Single GPU: batch_size = 32¶
8 GPUs: effective_batch = 32 × 8 = 256¶
64 GPUs: effective_batch = 32 × 64 = 2048¶
May need to adjust learning rate¶
Linear scaling rule: lr_new = lr_base × num_gpus¶
learning_rate = base_lr * world_size ```
Frameworks for GPU Clusters¶
| Framework | Use Case | Complexity |
|---|---|---|
| PyTorch DDP | General distributed training | Medium |
| DeepSpeed | Large model training | Medium-High |
| Megatron-LM | Massive language models | High |
| Horovod | Framework-agnostic | Medium |
| Ray | General distributed ML | Medium |
DeepSpeed Example¶
```python import deepspeed
Config for ZeRO optimization¶
ds_config = { "train_batch_size": 256, "gradient_accumulation_steps": 4, "fp16": {"enabled": True}, "zero_optimization": { "stage": 2, # Partition gradients and optimizer states "offload_optimizer": {"device": "cpu"} } }
Initialize DeepSpeed¶
model, optimizer, _, _ = deepspeed.initialize( model=model, config=ds_config, model_parameters=model.parameters() )
Training loop¶
for batch in dataloader: outputs = model(batch) loss = compute_loss(outputs) model.backward(loss) model.step() ```
Cost Considerations¶
``` GPU Cluster Costs (Cloud):
┌─────────────────────────────────────────────────────────────┐ │ Configuration │ Hourly Cost │ Monthly (24/7) │ ├─────────────────────────┼─────────────┼────────────────────┤ │ 1 × A100 (40GB) │ ~$3 │ ~$2,200 │ │ 8 × A100 (one node) │ ~$25 │ ~$18,000 │ │ 64 × A100 (8 nodes) │ ~$200 │ ~$144,000 │ │ 512 × A100 (64 nodes) │ ~$1,600 │ ~$1,150,000 │ └─────────────────────────┴─────────────┴────────────────────┘
Cost optimization: - Spot instances (60-70% cheaper, can be interrupted) - Reserved capacity (30-40% cheaper, commitment) - Right-size your cluster (don't over-provision) ```
Summary¶
| Aspect | Single GPU | Multi-GPU Node | GPU Cluster |
|---|---|---|---|
| Memory | 8-80 GB | 64-640 GB | Terabytes |
| Interconnect | N/A | NVLink (600 GB/s) | InfiniBand (50 GB/s) |
| Complexity | Simple | Medium | High |
| Use Case | Development, small models | Medium models | Large models, fast training |
Key points:
- GPU clusters enable training models too large for single GPUs
- Data parallelism is simplest; model parallelism for huge models
- Communication overhead limits scaling efficiency
- NVLink (intra-node) >> InfiniBand (inter-node) >> Ethernet
- Choose cluster size based on model size and time constraints
- Cost scales roughly linearly; efficiency doesn't
Exercises¶
Exercise 1. Explain the difference between data parallelism and model parallelism in distributed GPU training.
Solution to Exercise 1
```python
Conceptual solution - see page content for details¶
import sys import platform
print(f"Python version: {sys.version}") print(f"Platform: {platform.platform()}") print(f"Architecture: {platform.machine()}") ```
Exercise 2. Explain what CUDA is and why it is important for GPU computing. Can Python code use CUDA?
Solution to Exercise 2
See the main content for the detailed explanation. The key concept involves understanding the hardware-software interaction and how it affects Python performance.
Exercise 3. Describe the role of NCCL (NVIDIA Collective Communication Library) in multi-GPU training.
Solution to Exercise 3
```python import time
Simple benchmark¶
n = 10_000_000 start = time.perf_counter() total = sum(range(n)) elapsed = time.perf_counter() - start print(f"Sum of {n} integers: {total}") print(f"Time: {elapsed:.4f} seconds") ```
Exercise 4. Explain what GPU memory (VRAM) is and why running out of VRAM is a common issue in deep learning. What strategies help mitigate this?
Solution to Exercise 4
```python import numpy as np import time
n = 1_000_000
Python loop¶
start = time.perf_counter() result_py = sum(i * i for i in range(n)) time_py = time.perf_counter() - start
NumPy vectorized¶
arr = np.arange(n) start = time.perf_counter() result_np = np.sum(arr * arr) time_np = time.perf_counter() - start
print(f"Python: {time_py:.4f}s, NumPy: {time_np:.4f}s") print(f"Speedup: {time_py / time_np:.1f}x") ```