CPU-GPU Communication¶

The CPU-GPU Boundary¶

CPUs and GPUs have separate memory spaces connected by the PCIe bus:

┌────────────────────────┐          ┌────────────────────────┐
│         CPU            │          │          GPU           │
│  ┌──────────────────┐  │          │  ┌──────────────────┐  │
│  │    CPU Cores     │  │          │  │   CUDA Cores     │  │
│  └────────┬─────────┘  │          │  └────────┬─────────┘  │
│           │            │          │           │            │
│  ┌────────┴─────────┐  │          │  ┌────────┴─────────┐  │
│  │    System RAM    │  │   PCIe   │  │   GPU Memory     │  │
│  │    (16-128 GB)   │◀═╬════════▶╬═▶│    (8-80 GB)     │  │
│  └──────────────────┘  │ ~32 GB/s │  └──────────────────┘  │
└────────────────────────┘          └────────────────────────┘
                              ▲
                              │
                    Major bottleneck!

PCIe: The Connection¶

PCIe Bandwidth¶

Generation	Per-Lane	x16 Slot	Bidirectional
PCIe 3.0	~1 GB/s	~16 GB/s	~32 GB/s
PCIe 4.0	~2 GB/s	~32 GB/s	~64 GB/s
PCIe 5.0	~4 GB/s	~64 GB/s	~128 GB/s

Bandwidth Comparison¶

GPU Internal Memory:    ████████████████████████████  ~1000 GB/s
System RAM:             ████████████                  ~50 GB/s
PCIe 4.0 x16:           █████                         ~32 GB/s

PCIe is 30-60x slower than GPU memory!

Data Transfer Operations¶

Basic Transfer Pattern¶

import torch
import time

# Create data on CPU
cpu_tensor = torch.randn(10000, 10000)  # 400 MB

# Transfer to GPU (copy)
start = time.perf_counter()
gpu_tensor = cpu_tensor.to('cuda')
torch.cuda.synchronize()  # Wait for transfer to complete
h2d_time = time.perf_counter() - start

# Transfer back to CPU (copy)
start = time.perf_counter()
result = gpu_tensor.to('cpu')
torch.cuda.synchronize()
d2h_time = time.perf_counter() - start

size_gb = cpu_tensor.numel() * 4 / 1e9

print(f"Host→Device: {size_gb/h2d_time:.1f} GB/s")
print(f"Device→Host: {size_gb/d2h_time:.1f} GB/s")

Typical output:

Host→Device: 12.5 GB/s
Device→Host: 11.8 GB/s

Why Measured < Theoretical?¶

PCIe 4.0 x16 theoretical: 32 GB/s

Actual achieved: 12-15 GB/s

Reasons for gap:
  - Protocol overhead
  - DMA setup time
  - Memory allocation
  - Driver overhead
  - System interrupts

Transfer Methods¶

Synchronous Transfer (Default)¶

import torch

# Blocks until complete
gpu_data = cpu_data.to('cuda')  # CPU waits
result = gpu_data @ gpu_data    # Then compute

Asynchronous Transfer¶

import torch

# Non-blocking transfer with streams
stream = torch.cuda.Stream()

with torch.cuda.stream(stream):
    # Transfer happens asynchronously
    gpu_data = cpu_data.to('cuda', non_blocking=True)

# Can do other CPU work here...
cpu_work()

# Wait when you need the result
stream.synchronize()
result = gpu_data @ gpu_data

Pinned (Page-Locked) Memory¶

import torch

# Regular memory (pageable)
regular = torch.randn(10000, 10000)

# Pinned memory (faster transfers)
pinned = torch.randn(10000, 10000, pin_memory=True)

# Transfer comparison
def time_transfer(tensor):
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    start.record()
    gpu = tensor.to('cuda', non_blocking=True)
    end.record()
    torch.cuda.synchronize()

    return start.elapsed_time(end)

print(f"Regular: {time_transfer(regular):.1f} ms")
print(f"Pinned:  {time_transfer(pinned):.1f} ms")

Why Pinned Memory is Faster¶

Regular (Pageable) Memory Transfer:
┌────────────────┐    ┌────────────────┐    ┌──────────────┐
│ User Buffer    │ →  │ Pinned Buffer  │ →  │ GPU Memory   │
│ (may be paged) │    │ (DMA staging)  │    │              │
└────────────────┘    └────────────────┘    └──────────────┘
     Extra copy needed!

Pinned Memory Transfer:
┌────────────────┐                         ┌──────────────┐
│ Pinned Buffer  │ ──────── DMA ────────→ │ GPU Memory   │
│ (always in RAM)│                         │              │
└────────────────┘                         └──────────────┘
     Direct transfer!

Minimizing Transfer Overhead¶

Strategy 1: Batch Transfers¶

# Bad: Transfer every iteration
for epoch in range(100):
    for batch in dataloader:
        batch_gpu = batch.to('cuda')  # Transfer every batch!
        result = model(batch_gpu)

# Good: Use DataLoader with pin_memory
dataloader = DataLoader(dataset, pin_memory=True, num_workers=4)

for epoch in range(100):
    for batch in dataloader:
        # Transfer overlaps with previous batch processing
        batch_gpu = batch.to('cuda', non_blocking=True)
        result = model(batch_gpu)

Strategy 2: Keep Data on GPU¶

# Bad: Round-trip every operation
x_gpu = x_cpu.to('cuda')
y = model(x_gpu)
y_cpu = y.to('cpu')  # Why transfer back?
z_gpu = y_cpu.to('cuda')  # Just to transfer again?

# Good: Stay on GPU
x_gpu = x_cpu.to('cuda')
y_gpu = model(x_gpu)
z_gpu = another_model(y_gpu)  # Stay on GPU
final = z_gpu.to('cpu')  # Transfer only at the end

Strategy 3: Overlap Transfer and Compute¶

import torch

# Use multiple streams to overlap
stream1 = torch.cuda.Stream()
stream2 = torch.cuda.Stream()

# While computing on batch N, transfer batch N+1
for i in range(num_batches):
    with torch.cuda.stream(stream1):
        # Transfer next batch
        next_batch = batches[i+1].to('cuda', non_blocking=True)

    with torch.cuda.stream(stream2):
        # Compute on current batch
        result = model(current_batch)

    current_batch = next_batch
    torch.cuda.synchronize()

Unified Memory (CUDA Managed Memory)¶

CUDA can automatically manage CPU-GPU transfers:

import cupy as cp

# Managed memory - system handles transfers
managed = cp.cuda.managed_memory.alloc(size)

# Data migrates on demand between CPU and GPU
# Simpler but less control over performance

Trade-offs¶

Approach	Performance	Complexity
Manual Transfers	Best (with optimization)	High
Pinned Memory	Very Good	Medium
Managed Memory	Good	Low

NVLink: High-Speed GPU Connection¶

Some systems have NVLink for faster GPU-GPU communication:

PCIe (CPU ↔ GPU):     ~32 GB/s
NVLink (GPU ↔ GPU):   ~600 GB/s (per link)

┌───────────┐           ┌───────────┐
│   GPU 0   │◀═══════▶│   GPU 1   │
│           │  NVLink   │           │
└─────┬─────┘  600 GB/s └─────┬─────┘
      │                       │
      │         PCIe          │
      └───────────┬───────────┘
                  │
              ┌───┴───┐
              │  CPU  │
              └───────┘

Summary¶

Concept	Description
PCIe	Bus connecting CPU and GPU (~32 GB/s)
Host→Device	CPU to GPU transfer (H2D)
Device→Host	GPU to CPU transfer (D2H)
Pinned Memory	Page-locked RAM for faster transfers
Async Transfer	Non-blocking transfers with streams
NVLink	High-speed GPU-GPU interconnect

Key optimization strategies:

Minimize transfers: Keep data on GPU as long as possible
Use pinned memory: 1.5-2x faster transfers
Batch operations: Transfer large chunks, not small pieces
Overlap compute and transfer: Use async operations
Profile transfer time: Often dominates total time for small operations

# Rule of thumb for GPU benefit:
compute_time = time_on_gpu(operation)
transfer_time = data_size / 12e9  # ~12 GB/s practical

if compute_time > transfer_time:
    print("GPU beneficial")
else:
    print("Transfer overhead too high")