Single Machine vs Cluster¶
Mental Model
A single machine is like one strong worker -- simple to manage but limited by what one person can carry. A cluster is a team of workers that can share the load, but now you pay the cost of communication and coordination. Use one machine until you hit a hard wall (RAM, time, or availability), then distribute.
When One Computer Isn't Enough¶
A single machine has hard limits. When you exceed them, you need multiple machines working together.
``` Single Machine Limits:
┌─────────────────────────────────────────────────────────────┐ │ One Computer │ │ │ │ CPU Cores: Limited (4-128) │ │ RAM: Limited (8 GB - 2 TB) │ │ Storage: Limited (256 GB - 100 TB) │ │ GPU Memory: Limited (8-80 GB per GPU) │ │ Availability: Single point of failure │ │ │ └─────────────────────────────────────────────────────────────┘
Cluster of Machines:
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ Node N │ │ 64 cores │ │ 64 cores │ │ 64 cores │ │ 64 cores │ │ 256 GB │ │ 256 GB │ │ 256 GB │ │ 256 GB │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └──────────────┴──────────────┴──────────────┘ Network Fabric
Combined: 64N cores, 256N GB RAM, fault tolerant ```
Scaling Strategies¶
Vertical Scaling (Scale Up)¶
Get a bigger machine:
``` Vertical Scaling:
Before: After: ┌──────────┐ ┌────────────────┐ │ 4 cores │ → │ 64 cores │ │ 16 GB │ │ 512 GB │ │ 1 GPU │ │ 8 GPUs │ └──────────┘ └────────────────┘
Pros: + Simple (no code changes) + No network overhead + Easy to manage
Cons: - Hard limits exist - Expensive at high end - Single point of failure ```
Horizontal Scaling (Scale Out)¶
Add more machines:
``` Horizontal Scaling:
Before: After: ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ 4 cores │ → │ 4 cores │ │ 4 cores │ │ 4 cores │ │ 16 GB │ │ 16 GB │ │ 16 GB │ │ 16 GB │ └──────────┘ └──────────┘ └──────────┘ └──────────┘
Pros: + Nearly unlimited scale + Fault tolerant + Cost-effective (commodity hardware)
Cons: - Code complexity - Network overhead - Coordination challenges ```
When to Use a Cluster¶
Decision Framework¶
```python def need_cluster( data_size_gb, memory_required_gb, compute_hours_single, availability_requirement, local_machine_specs ): """Determine if a cluster is needed."""
reasons = []
# Memory constraint
if memory_required_gb > local_machine_specs['ram_gb'] * 0.8:
reasons.append(f"Data ({memory_required_gb} GB) exceeds RAM")
# Time constraint
if compute_hours_single > 24:
reasons.append(f"Computation too slow ({compute_hours_single}h)")
# Availability constraint
if availability_requirement == '99.99%':
reasons.append("High availability requires redundancy")
# Storage constraint
if data_size_gb > local_machine_specs['storage_gb'] * 0.8:
reasons.append(f"Data exceeds storage capacity")
if reasons:
print("Cluster recommended:")
for r in reasons:
print(f" - {r}")
return True
else:
print("Single machine sufficient")
return False
```
Common Thresholds¶
| Constraint | Single Machine Limit | Cluster Benefit |
|---|---|---|
| RAM | ~2 TB max | Aggregate memory |
| Compute | Hours/days | Parallel speedup |
| Storage | ~100 TB | Distributed storage |
| Availability | ~99% | Redundancy → 99.99%+ |
| Throughput | Limited I/O | Parallel I/O |
Cluster Architectures¶
Shared-Nothing Architecture¶
Each node is independent:
``` Shared-Nothing:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ CPU │ │ │ │ CPU │ │ │ │ CPU │ │ │ │ RAM │ │ │ │ RAM │ │ │ │ RAM │ │ │ │ Disk │ │ │ │ Disk │ │ │ │ Disk │ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └─────────────────┴─────────────────┘ Network Only
Examples: Hadoop, Spark, Cassandra Pros: Scales well, fault tolerant Cons: Data must be partitioned ```
Shared-Storage Architecture¶
Nodes share a storage layer:
``` Shared-Storage:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ CPU │ │ │ │ CPU │ │ │ │ CPU │ │ │ │ RAM │ │ │ │ RAM │ │ │ │ RAM │ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └─────────────────┴─────────────────┘ │ ┌─────────┴─────────┐ │ Shared Storage │ │ (SAN / NFS) │ └───────────────────┘
Examples: Traditional databases, HPC Pros: Simpler data management Cons: Storage can be bottleneck ```
Data Parallelism¶
Partition Data Across Nodes¶
``` Data Parallelism (MapReduce pattern):
Original Data: [A B C D E F G H I J K L]
Partition: Node 1: [A B C D] Node 2: [E F G H] Node 3: [I J K L]
Process in parallel: Node 1: process([A B C D]) → result1 Node 2: process([E F G H]) → result2 Node 3: process([I J K L]) → result3
Combine: final_result = combine(result1, result2, result3) ```
Python with Dask¶
```python import dask.dataframe as dd
Single machine pandas - limited by RAM¶
import pandas as pd¶
df = pd.read_csv('huge_file.csv') # Fails if > RAM¶
Dask - works across cluster¶
df = dd.read_csv('huge_file.csv') # Lazy, partitioned
Same pandas API, distributed execution¶
result = df.groupby('category').sum().compute() ```
Python with PySpark¶
```python from pyspark.sql import SparkSession
Create Spark session (connects to cluster)¶
spark = SparkSession.builder \ .appName("MyApp") \ .master("spark://cluster:7077") \ .getOrCreate()
Load distributed dataset¶
df = spark.read.csv("hdfs://cluster/huge_file.csv")
Operations distributed across cluster¶
result = df.groupBy("category").sum().collect() ```
Challenges of Distributed Computing¶
1. Network Overhead¶
``` Single Machine: Memory access: ~60 ns
Cluster: Network access: ~100,000 ns (100 μs) ~1,000,000 ns (1 ms) across datacenter
Network is 1000-10000x slower than memory! ```
2. Partial Failures¶
``` Single Machine: Either works or doesn't
Cluster: ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ OK │ │ FAILED │ │ OK │ └──────────┘ └──────────┘ └──────────┘
What happens to Node 2's work? Need: Retry, redundancy, checkpointing ```
3. Coordination¶
``` Problems that arise: - Which node handles which data? - How to synchronize results? - What if nodes disagree? - How to handle stragglers?
Solutions: - Consensus protocols (Paxos, Raft) - Distributed coordination (ZooKeeper) - Idempotent operations - Speculative execution ```
Comparison Summary¶
| Aspect | Single Machine | Cluster |
|---|---|---|
| Setup | Simple | Complex |
| Scaling | Limited | Nearly unlimited |
| Latency | Nanoseconds | Microseconds-milliseconds |
| Fault tolerance | None | Built-in |
| Cost | Lower initially | Higher, but scales better |
| Code complexity | Simple | Distributed algorithms |
| Debugging | Easy | Hard |
Decision Checklist¶
``` Use Single Machine when: □ Data fits in memory (with headroom) □ Computation completes in acceptable time □ Downtime is acceptable □ Simpler is better
Use Cluster when: □ Data exceeds single machine capacity □ Need faster results (parallel speedup) □ Require high availability □ Workload is embarrassingly parallel □ Already using distributed frameworks ```
Starting Point Recommendations¶
``` Data Size Recommended Approach ───────────────────────────────────────────── < 10 GB Single machine, pandas 10-100 GB Single machine, chunked processing 100 GB - 1 TB Consider Dask (single or cluster) 1-10 TB Spark cluster
10 TB Dedicated big data infrastructure ```
Exercises¶
Exercise 1. Explain the advantages and disadvantages of using a single powerful machine versus a cluster of machines for computation.
Solution to Exercise 1
```python
Conceptual solution - see page content for details¶
import sys import platform
print(f"Python version: {sys.version}") print(f"Platform: {platform.platform()}") print(f"Architecture: {platform.machine()}") ```
Exercise 2. Describe Amdahl's Law and explain how it limits the speedup from parallelization.
Solution to Exercise 2
See the main content for the detailed explanation. The key concept involves understanding the hardware-software interaction and how it affects Python performance.
Exercise 3. Write Python code using the multiprocessing module to parallelize a simple computation across multiple CPU cores.
Solution to Exercise 3
```python import time
Simple benchmark¶
n = 10_000_000 start = time.perf_counter() total = sum(range(n)) elapsed = time.perf_counter() - start print(f"Sum of {n} integers: {total}") print(f"Time: {elapsed:.4f} seconds") ```
Exercise 4. Explain the communication overhead in distributed computing. Why doesn't doubling the number of machines always halve the computation time?
Solution to Exercise 4
```python import numpy as np import time
n = 1_000_000
Python loop¶
start = time.perf_counter() result_py = sum(i * i for i in range(n)) time_py = time.perf_counter() - start
NumPy vectorized¶
arr = np.arange(n) start = time.perf_counter() result_np = np.sum(arr * arr) time_np = time.perf_counter() - start
print(f"Python: {time_py:.4f}s, NumPy: {time_np:.4f}s") print(f"Speedup: {time_py / time_np:.1f}x") ```