CPU-GPU Communication¶

Mental Model

Think of the CPU and GPU as two offices in different buildings connected by a narrow hallway (PCIe bus). Each office is fast internally, but passing documents between them is slow. The key to performance is minimizing trips through the hallway and carrying as much as possible each time.

The CPU-GPU Boundary¶

CPUs and GPUs have separate memory spaces connected by the PCIe bus:

┌────────────────────────┐ ┌────────────────────────┐ │ CPU │ │ GPU │ │ ┌──────────────────┐ │ │ ┌──────────────────┐ │ │ │ CPU Cores │ │ │ │ CUDA Cores │ │ │ └────────┬─────────┘ │ │ └────────┬─────────┘ │ │ │ │ │ │ │ │ ┌────────┴─────────┐ │ │ ┌────────┴─────────┐ │ │ │ System RAM │ │ PCIe │ │ GPU Memory │ │ │ │ (16-128 GB) │◀═╬════════▶╬═▶│ (8-80 GB) │ │ │ └──────────────────┘ │ ~32 GB/s │ └──────────────────┘ │ └────────────────────────┘ └────────────────────────┘ ▲ │ Major bottleneck!

PCIe: The Connection¶

PCIe Bandwidth¶

Generation	Per-Lane	x16 Slot	Bidirectional
PCIe 3.0	~1 GB/s	~16 GB/s	~32 GB/s
PCIe 4.0	~2 GB/s	~32 GB/s	~64 GB/s
PCIe 5.0	~4 GB/s	~64 GB/s	~128 GB/s

Bandwidth Comparison¶

``` GPU Internal Memory: ████████████████████████████ ~1000 GB/s System RAM: ████████████ ~50 GB/s PCIe 4.0 x16: █████ ~32 GB/s

PCIe is 30-60x slower than GPU memory! ```

Data Transfer Operations¶

Basic Transfer Pattern¶

```python import torch import time

Create data on CPU¶

cpu_tensor = torch.randn(10000, 10000) # 400 MB

Transfer to GPU (copy)¶

start = time.perf_counter() gpu_tensor = cpu_tensor.to('cuda') torch.cuda.synchronize() # Wait for transfer to complete h2d_time = time.perf_counter() - start

Transfer back to CPU (copy)¶

start = time.perf_counter() result = gpu_tensor.to('cpu') torch.cuda.synchronize() d2h_time = time.perf_counter() - start

size_gb = cpu_tensor.numel() * 4 / 1e9

print(f"Host→Device: {size_gb/h2d_time:.1f} GB/s") print(f"Device→Host: {size_gb/d2h_time:.1f} GB/s") ```

Typical output:

Host→Device: 12.5 GB/s Device→Host: 11.8 GB/s

Why Measured < Theoretical?¶

``` PCIe 4.0 x16 theoretical: 32 GB/s

Actual achieved: 12-15 GB/s

Reasons for gap: - Protocol overhead - DMA setup time - Memory allocation - Driver overhead - System interrupts ```

Transfer Methods¶

Synchronous Transfer (Default)¶

```python import torch

Blocks until complete¶

gpu_data = cpu_data.to('cuda') # CPU waits result = gpu_data @ gpu_data # Then compute ```

Asynchronous Transfer¶

```python import torch

Non-blocking transfer with streams¶

stream = torch.cuda.Stream()

with torch.cuda.stream(stream): # Transfer happens asynchronously gpu_data = cpu_data.to('cuda', non_blocking=True)

Can do other CPU work here...¶

cpu_work()

Wait when you need the result¶

stream.synchronize() result = gpu_data @ gpu_data ```

Pinned (Page-Locked) Memory¶

```python import torch

Regular memory (pageable)¶

regular = torch.randn(10000, 10000)

Pinned memory (faster transfers)¶

pinned = torch.randn(10000, 10000, pin_memory=True)

Transfer comparison¶

def time_transfer(tensor): start = torch.cuda.Event(enable_timing=True) end = torch.cuda.Event(enable_timing=True)

start.record()
gpu = tensor.to('cuda', non_blocking=True)
end.record()
torch.cuda.synchronize()

return start.elapsed_time(end)

print(f"Regular: {time_transfer(regular):.1f} ms") print(f"Pinned: {time_transfer(pinned):.1f} ms") ```

Why Pinned Memory is Faster¶

``` Regular (Pageable) Memory Transfer: ┌────────────────┐ ┌────────────────┐ ┌──────────────┐ │ User Buffer │ → │ Pinned Buffer │ → │ GPU Memory │ │ (may be paged) │ │ (DMA staging) │ │ │ └────────────────┘ └────────────────┘ └──────────────┘ Extra copy needed!

Pinned Memory Transfer: ┌────────────────┐ ┌──────────────┐ │ Pinned Buffer │ ──────── DMA ────────→ │ GPU Memory │ │ (always in RAM)│ │ │ └────────────────┘ └──────────────┘ Direct transfer! ```

Minimizing Transfer Overhead¶

Strategy 1: Batch Transfers¶

```python

Bad: Transfer every iteration¶

for epoch in range(100): for batch in dataloader: batch_gpu = batch.to('cuda') # Transfer every batch! result = model(batch_gpu)

Good: Use DataLoader with pin_memory¶

dataloader = DataLoader(dataset, pin_memory=True, num_workers=4)

for epoch in range(100): for batch in dataloader: # Transfer overlaps with previous batch processing batch_gpu = batch.to('cuda', non_blocking=True) result = model(batch_gpu) ```

Strategy 2: Keep Data on GPU¶

```python

Bad: Round-trip every operation¶

x_gpu = x_cpu.to('cuda') y = model(x_gpu) y_cpu = y.to('cpu') # Why transfer back? z_gpu = y_cpu.to('cuda') # Just to transfer again?

Good: Stay on GPU¶

x_gpu = x_cpu.to('cuda') y_gpu = model(x_gpu) z_gpu = another_model(y_gpu) # Stay on GPU final = z_gpu.to('cpu') # Transfer only at the end ```

Strategy 3: Overlap Transfer and Compute¶

```python import torch

Use multiple streams to overlap¶

stream1 = torch.cuda.Stream() stream2 = torch.cuda.Stream()

While computing on batch N, transfer batch N+1¶

for i in range(num_batches): with torch.cuda.stream(stream1): # Transfer next batch next_batch = batches[i+1].to('cuda', non_blocking=True)

with torch.cuda.stream(stream2):
    # Compute on current batch
    result = model(current_batch)

current_batch = next_batch
torch.cuda.synchronize()

```

Unified Memory (CUDA Managed Memory)¶

CUDA can automatically manage CPU-GPU transfers:

```python import cupy as cp

Managed memory - system handles transfers¶

managed = cp.cuda.managed_memory.alloc(size)

Data migrates on demand between CPU and GPU¶

Simpler but less control over performance¶

```

Trade-offs¶

Approach	Performance	Complexity
Manual Transfers	Best (with optimization)	High
Pinned Memory	Very Good	Medium
Managed Memory	Good	Low

NVLink: High-Speed GPU Connection¶

Some systems have NVLink for faster GPU-GPU communication:

``` PCIe (CPU ↔ GPU): ~32 GB/s NVLink (GPU ↔ GPU): ~600 GB/s (per link)

┌───────────┐ ┌───────────┐ │ GPU 0 │◀═══════▶│ GPU 1 │ │ │ NVLink │ │ └─────┬─────┘ 600 GB/s └─────┬─────┘ │ │ │ PCIe │ └───────────┬───────────┘ │ ┌───┴───┐ │ CPU │ └───────┘ ```

Summary¶

Concept	Description
PCIe	Bus connecting CPU and GPU (~32 GB/s)
Host→Device	CPU to GPU transfer (H2D)
Device→Host	GPU to CPU transfer (D2H)
Pinned Memory	Page-locked RAM for faster transfers
Async Transfer	Non-blocking transfers with streams
NVLink	High-speed GPU-GPU interconnect

Key optimization strategies:

Minimize transfers: Keep data on GPU as long as possible
Use pinned memory: 1.5-2x faster transfers
Batch operations: Transfer large chunks, not small pieces
Overlap compute and transfer: Use async operations
Profile transfer time: Often dominates total time for small operations

```python

Rule of thumb for GPU benefit:¶

compute_time = time_on_gpu(operation) transfer_time = data_size / 12e9 # ~12 GB/s practical

if compute_time > transfer_time: print("GPU beneficial") else: print("Transfer overhead too high") ```

Exercises¶

Exercise 1. Explain the fundamental difference between CPU and GPU architectures. Why are GPUs better for parallel workloads?

Solution to Exercise 1

```python

Conceptual solution - see page content for details¶

import sys import platform

print(f"Python version: {sys.version}") print(f"Platform: {platform.platform()}") print(f"Architecture: {platform.machine()}") ```

Exercise 2. Write Python code using numpy to perform a large matrix multiplication, and explain why a GPU would be faster for this operation.

Solution to Exercise 2

See the main content for the detailed explanation. The key concept involves understanding the hardware-software interaction and how it affects Python performance.

Exercise 3. Explain what data transfer overhead between CPU and GPU means. Why is it important to minimize transfers?

Solution to Exercise 3

```python import time

Simple benchmark¶

n = 10_000_000 start = time.perf_counter() total = sum(range(n)) elapsed = time.perf_counter() - start print(f"Sum of {n} integers: {total}") print(f"Time: {elapsed:.4f} seconds") ```

Exercise 4. Describe a scenario where running a computation on the CPU would be faster than transferring data to the GPU, computing, and transferring back.

Solution to Exercise 4

```python import numpy as np import time

n = 1_000_000

Python loop¶

start = time.perf_counter() result_py = sum(i * i for i in range(n)) time_py = time.perf_counter() - start

NumPy vectorized¶

arr = np.arange(n) start = time.perf_counter() result_np = np.sum(arr * arr) time_np = time.perf_counter() - start

print(f"Python: {time_py:.4f}s, NumPy: {time_np:.4f}s") print(f"Speedup: {time_py / time_np:.1f}x") ```