System Bus¶

Mental Model

A bus is the highway system inside your computer. The address bus names the destination, the data bus carries the cargo, and the control bus acts as the traffic signals. Just like a real highway, bandwidth is finite and congestion creates bottlenecks -- understanding bus speeds tells you why some operations are inherently slower than others.

What is a Bus?¶

A bus is a communication pathway that transfers data between computer components. Think of it as a highway system connecting different parts of the computer.

┌─────────────────────────────────────────────────────────────┐ │ System Bus │ │ ═══════════════════════════════════════════════════════ │ │ │ │ │ │ │ │ ┌───┴───┐ ┌───┴───┐ ┌───┴───┐ ┌───┴───┐ │ │ │ CPU │ │ RAM │ │ GPU │ │ I/O │ │ │ └───────┘ └───────┘ └───────┘ └───────┘ │ └─────────────────────────────────────────────────────────────┘

Bus Components¶

A bus consists of three types of lines:

1. Address Bus¶

Specifies where to read/write data:

``` Address Bus (unidirectional: CPU → Memory) ┌─────┐ ┌─────────┐ │ CPU │ ══════ Address Lines ══════▶ │ Memory │ └─────┘ (e.g., 0x7FFF0000) └─────────┘

Width determines addressable memory: 32-bit address bus → 2³² = 4 GB addressable 64-bit address bus → 2⁶⁴ = 16 EB addressable (theoretical) ```

2. Data Bus¶

Transfers actual data between components:

``` Data Bus (bidirectional) ┌─────┐ ┌─────────┐ │ CPU │ ◀══════ Data Lines ═══════▶ │ Memory │ └─────┘ (e.g., 64 bits wide) └─────────┘

Width determines transfer size per cycle: 32-bit data bus → 4 bytes per transfer 64-bit data bus → 8 bytes per transfer ```

3. Control Bus¶

Carries command signals:

``` Control Bus (various directions) ┌─────┐ ┌─────────┐ │ CPU │ ◀═══ Control Signals ══════▶ │ Memory │ └─────┘ └─────────┘

Signals include: - Read/Write - Clock - Interrupt - Bus Request/Grant ```

Bus Operation¶

Read Operation¶

``` CPU wants to read from address 0x1000:

CPU places 0x1000 on Address Bus ────▶
CPU sets Read signal on Control Bus ────▶
Memory reads data at 0x1000
Memory places data on Data Bus ◀────
CPU reads data from Data Bus ```

Write Operation¶

``` CPU wants to write 42 to address 0x1000:

CPU places 0x1000 on Address Bus ────▶
CPU places 42 on Data Bus ────▶
CPU sets Write signal on Control Bus ────▶
Memory stores 42 at address 0x1000 ```

Bus Hierarchy¶

Modern computers have multiple buses at different speeds:

``` ┌─────────────────────────────────────────────────────────────┐ │ CPU │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ Internal Bus (fastest) │ │ │ │ Registers ←→ ALU ←→ Cache │ │ │ └─────────────────────────────────────────────────────┘ │ └────────────────────────────┬────────────────────────────────┘ │ ┌──────────────┴──────────────┐ │ Front-Side Bus / QPI │ (~25 GB/s) │ (CPU ↔ Memory Controller)│ └──────────────┬──────────────┘ │ ┌──────────────┴──────────────┐ │ Memory Bus │ (~50 GB/s) │ (DDR4/DDR5) │ └──────────────┬──────────────┘ │ ┌───┴───┐ │ RAM │ └───────┘

          ┌──────────────────────────────┐
          │         PCIe Bus             │  (~32 GB/s per x16)
          │   (CPU ↔ GPU, NVMe, etc.)    │
          └──────────────────────────────┘

```

Bus Speed and Bandwidth¶

Calculating Bus Bandwidth¶

``` Bandwidth = Bus Width × Clock Speed × Transfers per Clock

Example DDR4-3200: Width: 64 bits = 8 bytes Speed: 1600 MHz (base clock) Transfers: 2 per clock (DDR = Double Data Rate)

Bandwidth = 8 × 1600 × 2 = 25,600 MB/s ≈ 25 GB/s per channel ```

Common Bus Bandwidths¶

Bus Type	Bandwidth	Use
CPU Internal	~1 TB/s	Register ↔ ALU
L1 Cache	~500 GB/s	L1 ↔ CPU
L3 Cache	~200 GB/s	L3 ↔ L2
Memory (DDR4)	~25 GB/s	RAM ↔ CPU
PCIe 4.0 x16	~32 GB/s	GPU ↔ CPU
SATA III	~600 MB/s	SSD ↔ CPU
USB 3.0	~625 MB/s	Peripherals

PCIe: Modern Expansion Bus¶

PCIe (Peripheral Component Interconnect Express) is the primary expansion bus:

``` PCIe Lane Configuration

x1: [──────] ~2 GB/s (PCIe 4.0) x4: [──────────────] ~8 GB/s x8: [──────────────────────────] ~16 GB/s x16: [──────────────────────────────────────────] ~32 GB/s ```

PCIe Generations¶

Generation	Per-Lane Bandwidth	x16 Total
PCIe 3.0	~1 GB/s	~16 GB/s
PCIe 4.0	~2 GB/s	~32 GB/s
PCIe 5.0	~4 GB/s	~64 GB/s
PCIe 6.0	~8 GB/s	~128 GB/s

Bus Contention¶

When multiple components need the bus simultaneously:

``` Problem: Bus Contention

Time → Device A: [Request]────[Wait]────[Wait]────[Transfer] Device B: ─────────[Request]────[Wait]────[Wait]────[Transfer] Device C: ─────────────────[Request]────[Wait]────[Wait]────[Transfer]

Only one device can use the bus at a time! ```

Bus Arbitration¶

A bus arbiter decides who gets access:

``` Arbitration Methods:

Priority-based: Higher priority devices go first
Round-robin: Fair rotation among devices
First-come-first-served: Queue-based ```

Python Perspective¶

Why Bus Speed Matters¶

```python import numpy as np import time

Memory bandwidth limits computation¶

def memory_bound_operation(): # 1 GB array arr = np.random.rand(125_000_000) # 1 GB of float64

start = time.perf_counter()
# Simple operation - limited by memory bus
result = np.sum(arr)
elapsed = time.perf_counter() - start

bandwidth = arr.nbytes / elapsed / 1e9
print(f"Achieved bandwidth: {bandwidth:.1f} GB/s")
# Typically ~30-40 GB/s, limited by memory bus

memory_bound_operation() ```

PCIe and GPU Operations¶

```python import torch

GPU data transfer goes over PCIe¶

data = torch.randn(1000, 1000)

CPU → GPU (over PCIe)¶

start = time.perf_counter() data_gpu = data.to('cuda') torch.cuda.synchronize() transfer_time = time.perf_counter() - start

bytes_transferred = data.numel() * 4 # float32 bandwidth = bytes_transferred / transfer_time / 1e9 print(f"CPU→GPU bandwidth: {bandwidth:.1f} GB/s")

Typically ~12-15 GB/s (PCIe limited)¶

```

Summary¶

Component	Function	Direction
Address Bus	Specifies memory location	CPU → Memory
Data Bus	Transfers actual data	Bidirectional
Control Bus	Command signals	Various

Key points:

Bus bandwidth often limits system performance
Multiple bus levels with different speeds
PCIe connects high-speed devices (GPU, NVMe)
Bus contention can create bottlenecks
Memory bandwidth (~50 GB/s) often limits Python/NumPy performance
GPU transfer bandwidth (~15 GB/s) limits data movement

Exercises¶

Exercise 1. Explain what a system bus is and name its three main components (data bus, address bus, control bus).

Solution to Exercise 1

```python

Conceptual solution - see page content for details¶

import sys import platform

print(f"Python version: {sys.version}") print(f"Platform: {platform.platform()}") print(f"Architecture: {platform.machine()}") ```

Exercise 2. Explain how bus bandwidth limits the rate at which data can move between the CPU and memory. How does this relate to the 'memory wall'?

Solution to Exercise 2

See the main content for the detailed explanation. The key concept involves understanding the hardware-software interaction and how it affects Python performance.

Exercise 3. Write Python code that demonstrates the concept of bandwidth limitation by comparing the time to process data from memory versus from CPU cache (using small vs large arrays).

Solution to Exercise 3

```python import time

Simple benchmark¶

n = 10_000_000 start = time.perf_counter() total = sum(range(n)) elapsed = time.perf_counter() - start print(f"Sum of {n} integers: {total}") print(f"Time: {elapsed:.4f} seconds") ```

Exercise 4. Explain what DMA (Direct Memory Access) is and how it helps reduce CPU overhead during I/O operations.

Solution to Exercise 4

```python import numpy as np import time

n = 1_000_000

Python loop¶

start = time.perf_counter() result_py = sum(i * i for i in range(n)) time_py = time.perf_counter() - start

NumPy vectorized¶

arr = np.arange(n) start = time.perf_counter() result_np = np.sum(arr * arr) time_np = time.perf_counter() - start

print(f"Python: {time_py:.4f}s, NumPy: {time_np:.4f}s") print(f"Speedup: {time_py / time_np:.1f}x") ```