GPU Clusters¶

Why GPU Clusters?¶

Single GPUs have limits. Large-scale deep learning and scientific computing often require multiple GPUs across multiple machines.

GPU Scaling Journey:

Single GPU:           Multi-GPU (one machine):     GPU Cluster:
┌────────────┐        ┌────────────────────┐      ┌─────────┐ ┌─────────┐
│    GPU     │   →    │ GPU  GPU  GPU  GPU │  →   │ 8 GPUs  │ │ 8 GPUs  │
│  (8-80 GB) │        │ (connected via     │      │ Node 1  │ │ Node 2  │
└────────────┘        │  NVLink/PCIe)      │      └────┬────┘ └────┬────┘
                      └────────────────────┘           │          │
                                                       └────┬─────┘
                                                            │
                                                      ┌─────┴─────┐
                                                      │ Network   │
                                                      │(InfiniBand)│
                                                      └───────────┘

GPU Cluster Architecture¶

Typical Node Configuration¶

GPU Node (e.g., DGX A100):

┌─────────────────────────────────────────────────────────────┐
│                        GPU Node                             │
│                                                             │
│  ┌───────────────────────────────────────────────────────┐ │
│  │                    8 × A100 GPUs                       │ │
│  │   ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐                     │ │
│  │   │ GPU │═│ GPU │═│ GPU │═│ GPU │                     │ │
│  │   │  0  │ │  1  │ │  2  │ │  3  │  NVLink: 600 GB/s   │ │
│  │   └──╪──┘ └──╪──┘ └──╪──┘ └──╪──┘                     │ │
│  │      ╪═══════╪═══════╪═══════╪                        │ │
│  │   ┌──╪──┐ ┌──╪──┐ ┌──╪──┐ ┌──╪──┐                     │ │
│  │   │ GPU │═│ GPU │═│ GPU │═│ GPU │                     │ │
│  │   │  4  │ │  5  │ │  6  │ │  7  │                     │ │
│  │   └─────┘ └─────┘ └─────┘ └─────┘                     │ │
│  └───────────────────────────────────────────────────────┘ │
│                           │                                 │
│  ┌───────────────────────┴───────────────────────────────┐ │
│  │          2 × CPU (AMD EPYC / Intel Xeon)              │ │
│  │                    1-2 TB RAM                          │ │
│  └───────────────────────────────────────────────────────┘ │
│                           │                                 │
│                    InfiniBand NIC                          │
│                     (200-400 Gbps)                         │
└─────────────────────────────────────────────────────────────┘

Cluster Interconnect¶

GPU Cluster Network:

┌──────────┐     ┌──────────┐     ┌──────────┐
│  Node 1  │     │  Node 2  │     │  Node N  │
│  8 GPUs  │     │  8 GPUs  │     │  8 GPUs  │
└────┬─────┘     └────┬─────┘     └────┬─────┘
     │                │                │
     │     InfiniBand Fabric           │
     │     (200-400 Gbps/node)         │
     └────────────────┴────────────────┘

Bandwidth Hierarchy:
  GPU Memory:        ~2000 GB/s
  NVLink (in-node):  ~600 GB/s
  InfiniBand:        ~50 GB/s (200 Gbps)
  Ethernet:          ~12 GB/s (100 Gbps)

Distributed Training Strategies¶

Data Parallelism¶

Each GPU has full model copy, different data:

Data Parallelism:

                    ┌─────────────┐
                    │ Full Dataset │
                    └──────┬──────┘
           ┌───────────────┼───────────────┐
           ▼               ▼               ▼
      ┌────────┐      ┌────────┐      ┌────────┐
      │Batch 1 │      │Batch 2 │      │Batch 3 │
      └────┬───┘      └────┬───┘      └────┬───┘
           │               │               │
      ┌────▼───┐      ┌────▼───┐      ┌────▼───┐
      │ GPU 0  │      │ GPU 1  │      │ GPU 2  │
      │ Model  │      │ Model  │      │ Model  │
      │ Copy   │      │ Copy   │      │ Copy   │
      └────┬───┘      └────┬───┘      └────┬───┘
           │               │               │
           └───────────────┼───────────────┘
                           ▼
                 ┌─────────────────┐
                 │ Sync Gradients  │
                 │   (AllReduce)   │
                 └─────────────────┘

Model Parallelism¶

Model split across GPUs:

Model Parallelism:

Large Model (won't fit on one GPU):

┌─────────────────────────────────────────────────┐
│              Neural Network Layers              │
│  ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐      │
│  │Layer 1│→│Layer 2│→│Layer 3│→│Layer 4│      │
│  └───────┘ └───────┘ └───────┘ └───────┘      │
└─────────────────────────────────────────────────┘

Split across GPUs:

┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐
│  GPU 0  │───▶│  GPU 1  │───▶│  GPU 2  │───▶│  GPU 3  │
│ Layer 1 │    │ Layer 2 │    │ Layer 3 │    │ Layer 4 │
└─────────┘    └─────────┘    └─────────┘    └─────────┘

Pipeline Parallelism¶

Combine model parallelism with micro-batching:

Pipeline Parallelism:

Time →
       ┌─────┬─────┬─────┬─────┬─────┬─────┐
GPU 0: │ B1  │ B2  │ B3  │ B4  │     │     │
       └─────┴──┬──┴──┬──┴──┬──┴─────┴─────┘
                │     │     │
       ┌────────▼──┬──▼──┬──▼──┬─────┬─────┐
GPU 1: │     │ B1  │ B2  │ B3  │ B4  │     │
       └─────┴─────┴──┬──┴──┬──┴──┬──┴─────┘
                      │     │     │
       ┌──────────────▼──┬──▼──┬──▼──┬─────┐
GPU 2: │     │     │ B1  │ B2  │ B3  │ B4  │
       └─────┴─────┴─────┴─────┴─────┴─────┘

B1, B2, ... = Micro-batches
GPUs stay busy with different batches

PyTorch Distributed Training¶

Data Parallel (Single Node)¶

import torch
import torch.nn as nn

# Simple DataParallel (single node, multiple GPUs)
model = MyModel()
model = nn.DataParallel(model)  # Wraps model
model = model.cuda()

# Training loop unchanged
for data, target in dataloader:
    data, target = data.cuda(), target.cuda()
    output = model(data)  # Automatically split across GPUs
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

Distributed Data Parallel (Multi-Node)¶

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize distributed process group
dist.init_process_group(
    backend='nccl',  # NVIDIA Collective Communications Library
    init_method='env://',
    world_size=world_size,
    rank=rank
)

# Create model and wrap with DDP
model = MyModel().cuda()
model = DDP(model, device_ids=[local_rank])

# Use DistributedSampler for data
sampler = torch.utils.data.distributed.DistributedSampler(dataset)
dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)

# Training loop
for epoch in range(num_epochs):
    sampler.set_epoch(epoch)  # Shuffle differently each epoch
    for data, target in dataloader:
        data, target = data.cuda(), target.cuda()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()  # Gradients automatically synchronized
        optimizer.step()

Launch Script¶

# Launch across 4 nodes, 8 GPUs each
torchrun \
    --nnodes=4 \
    --nproc_per_node=8 \
    --rdzv_endpoint=master:29500 \
    train.py

Scaling Efficiency¶

Communication Overhead¶

Scaling Efficiency:

Perfect linear scaling (theoretical):
  1 GPU:  100 samples/sec
  8 GPUs: 800 samples/sec
  64 GPUs: 6400 samples/sec

Actual (with communication overhead):
  1 GPU:  100 samples/sec
  8 GPUs: 700 samples/sec (87.5% efficiency)
  64 GPUs: 4000 samples/sec (62.5% efficiency)

Efficiency drops as:
  - More GPUs = more synchronization
  - Smaller batches = higher communication ratio
  - Slower interconnect = longer waits

Batch Size Considerations¶

# Effective batch size = per_gpu_batch × num_gpus

# Single GPU: batch_size = 32
# 8 GPUs: effective_batch = 32 × 8 = 256
# 64 GPUs: effective_batch = 32 × 64 = 2048

# May need to adjust learning rate
# Linear scaling rule: lr_new = lr_base × num_gpus
learning_rate = base_lr * world_size

Frameworks for GPU Clusters¶

Framework	Use Case	Complexity
PyTorch DDP	General distributed training	Medium
DeepSpeed	Large model training	Medium-High
Megatron-LM	Massive language models	High
Horovod	Framework-agnostic	Medium
Ray	General distributed ML	Medium

DeepSpeed Example¶

import deepspeed

# Config for ZeRO optimization
ds_config = {
    "train_batch_size": 256,
    "gradient_accumulation_steps": 4,
    "fp16": {"enabled": True},
    "zero_optimization": {
        "stage": 2,  # Partition gradients and optimizer states
        "offload_optimizer": {"device": "cpu"}
    }
}

# Initialize DeepSpeed
model, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config=ds_config,
    model_parameters=model.parameters()
)

# Training loop
for batch in dataloader:
    outputs = model(batch)
    loss = compute_loss(outputs)
    model.backward(loss)
    model.step()

Cost Considerations¶

GPU Cluster Costs (Cloud):

┌─────────────────────────────────────────────────────────────┐
│  Configuration          │ Hourly Cost │ Monthly (24/7)     │
├─────────────────────────┼─────────────┼────────────────────┤
│  1 × A100 (40GB)        │    ~\$3      │    ~\$2,200         │
│  8 × A100 (one node)    │    ~\$25     │    ~\$18,000        │
│  64 × A100 (8 nodes)    │    ~\$200    │    ~\$144,000       │
│  512 × A100 (64 nodes)  │    ~\$1,600  │    ~\$1,150,000     │
└─────────────────────────┴─────────────┴────────────────────┘

Cost optimization:
  - Spot instances (60-70% cheaper, can be interrupted)
  - Reserved capacity (30-40% cheaper, commitment)
  - Right-size your cluster (don't over-provision)

Summary¶

Aspect	Single GPU	Multi-GPU Node	GPU Cluster
Memory	8-80 GB	64-640 GB	Terabytes
Interconnect	N/A	NVLink (600 GB/s)	InfiniBand (50 GB/s)
Complexity	Simple	Medium	High
Use Case	Development, small models	Medium models	Large models, fast training

Key points:

GPU clusters enable training models too large for single GPUs
Data parallelism is simplest; model parallelism for huge models
Communication overhead limits scaling efficiency
NVLink (intra-node) >> InfiniBand (inter-node) >> Ethernet
Choose cluster size based on model size and time constraints
Cost scales roughly linearly; efficiency doesn't