Single Machine vs Cluster¶
When One Computer Isn't Enough¶
A single machine has hard limits. When you exceed them, you need multiple machines working together.
Single Machine Limits:
┌─────────────────────────────────────────────────────────────┐
│ One Computer │
│ │
│ CPU Cores: Limited (4-128) │
│ RAM: Limited (8 GB - 2 TB) │
│ Storage: Limited (256 GB - 100 TB) │
│ GPU Memory: Limited (8-80 GB per GPU) │
│ Availability: Single point of failure │
│ │
└─────────────────────────────────────────────────────────────┘
Cluster of Machines:
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │ │ Node N │
│ 64 cores │ │ 64 cores │ │ 64 cores │ │ 64 cores │
│ 256 GB │ │ 256 GB │ │ 256 GB │ │ 256 GB │
└────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
└──────────────┴──────────────┴──────────────┘
Network Fabric
Combined: 64N cores, 256N GB RAM, fault tolerant
Scaling Strategies¶
Vertical Scaling (Scale Up)¶
Get a bigger machine:
Vertical Scaling:
Before: After:
┌──────────┐ ┌────────────────┐
│ 4 cores │ → │ 64 cores │
│ 16 GB │ │ 512 GB │
│ 1 GPU │ │ 8 GPUs │
└──────────┘ └────────────────┘
Pros:
+ Simple (no code changes)
+ No network overhead
+ Easy to manage
Cons:
- Hard limits exist
- Expensive at high end
- Single point of failure
Horizontal Scaling (Scale Out)¶
Add more machines:
Horizontal Scaling:
Before: After:
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ 4 cores │ → │ 4 cores │ │ 4 cores │ │ 4 cores │
│ 16 GB │ │ 16 GB │ │ 16 GB │ │ 16 GB │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
Pros:
+ Nearly unlimited scale
+ Fault tolerant
+ Cost-effective (commodity hardware)
Cons:
- Code complexity
- Network overhead
- Coordination challenges
When to Use a Cluster¶
Decision Framework¶
def need_cluster(
data_size_gb,
memory_required_gb,
compute_hours_single,
availability_requirement,
local_machine_specs
):
"""Determine if a cluster is needed."""
reasons = []
# Memory constraint
if memory_required_gb > local_machine_specs['ram_gb'] * 0.8:
reasons.append(f"Data ({memory_required_gb} GB) exceeds RAM")
# Time constraint
if compute_hours_single > 24:
reasons.append(f"Computation too slow ({compute_hours_single}h)")
# Availability constraint
if availability_requirement == '99.99%':
reasons.append("High availability requires redundancy")
# Storage constraint
if data_size_gb > local_machine_specs['storage_gb'] * 0.8:
reasons.append(f"Data exceeds storage capacity")
if reasons:
print("Cluster recommended:")
for r in reasons:
print(f" - {r}")
return True
else:
print("Single machine sufficient")
return False
Common Thresholds¶
| Constraint | Single Machine Limit | Cluster Benefit |
|---|---|---|
| RAM | ~2 TB max | Aggregate memory |
| Compute | Hours/days | Parallel speedup |
| Storage | ~100 TB | Distributed storage |
| Availability | ~99% | Redundancy → 99.99%+ |
| Throughput | Limited I/O | Parallel I/O |
Cluster Architectures¶
Shared-Nothing Architecture¶
Each node is independent:
Shared-Nothing:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │
│ │ CPU │ │ │ │ CPU │ │ │ │ CPU │ │
│ │ RAM │ │ │ │ RAM │ │ │ │ RAM │ │
│ │ Disk │ │ │ │ Disk │ │ │ │ Disk │ │
│ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
└─────────────────┴─────────────────┘
Network Only
Examples: Hadoop, Spark, Cassandra
Pros: Scales well, fault tolerant
Cons: Data must be partitioned
Shared-Storage Architecture¶
Nodes share a storage layer:
Shared-Storage:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │
│ │ CPU │ │ │ │ CPU │ │ │ │ CPU │ │
│ │ RAM │ │ │ │ RAM │ │ │ │ RAM │ │
│ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
└─────────────────┴─────────────────┘
│
┌─────────┴─────────┐
│ Shared Storage │
│ (SAN / NFS) │
└───────────────────┘
Examples: Traditional databases, HPC
Pros: Simpler data management
Cons: Storage can be bottleneck
Data Parallelism¶
Partition Data Across Nodes¶
Data Parallelism (MapReduce pattern):
Original Data: [A B C D E F G H I J K L]
Partition:
Node 1: [A B C D]
Node 2: [E F G H]
Node 3: [I J K L]
Process in parallel:
Node 1: process([A B C D]) → result1
Node 2: process([E F G H]) → result2
Node 3: process([I J K L]) → result3
Combine:
final_result = combine(result1, result2, result3)
Python with Dask¶
import dask.dataframe as dd
# Single machine pandas - limited by RAM
# import pandas as pd
# df = pd.read_csv('huge_file.csv') # Fails if > RAM
# Dask - works across cluster
df = dd.read_csv('huge_file.csv') # Lazy, partitioned
# Same pandas API, distributed execution
result = df.groupby('category').sum().compute()
Python with PySpark¶
from pyspark.sql import SparkSession
# Create Spark session (connects to cluster)
spark = SparkSession.builder \
.appName("MyApp") \
.master("spark://cluster:7077") \
.getOrCreate()
# Load distributed dataset
df = spark.read.csv("hdfs://cluster/huge_file.csv")
# Operations distributed across cluster
result = df.groupBy("category").sum().collect()
Challenges of Distributed Computing¶
1. Network Overhead¶
Single Machine:
Memory access: ~60 ns
Cluster:
Network access: ~100,000 ns (100 μs)
~1,000,000 ns (1 ms) across datacenter
Network is 1000-10000x slower than memory!
2. Partial Failures¶
Single Machine:
Either works or doesn't
Cluster:
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ OK │ │ FAILED │ │ OK │
└──────────┘ └──────────┘ └──────────┘
What happens to Node 2's work?
Need: Retry, redundancy, checkpointing
3. Coordination¶
Problems that arise:
- Which node handles which data?
- How to synchronize results?
- What if nodes disagree?
- How to handle stragglers?
Solutions:
- Consensus protocols (Paxos, Raft)
- Distributed coordination (ZooKeeper)
- Idempotent operations
- Speculative execution
Comparison Summary¶
| Aspect | Single Machine | Cluster |
|---|---|---|
| Setup | Simple | Complex |
| Scaling | Limited | Nearly unlimited |
| Latency | Nanoseconds | Microseconds-milliseconds |
| Fault tolerance | None | Built-in |
| Cost | Lower initially | Higher, but scales better |
| Code complexity | Simple | Distributed algorithms |
| Debugging | Easy | Hard |
Decision Checklist¶
Use Single Machine when:
□ Data fits in memory (with headroom)
□ Computation completes in acceptable time
□ Downtime is acceptable
□ Simpler is better
Use Cluster when:
□ Data exceeds single machine capacity
□ Need faster results (parallel speedup)
□ Require high availability
□ Workload is embarrassingly parallel
□ Already using distributed frameworks
Starting Point Recommendations¶
Data Size Recommended Approach
─────────────────────────────────────────────
< 10 GB Single machine, pandas
10-100 GB Single machine, chunked processing
100 GB - 1 TB Consider Dask (single or cluster)
1-10 TB Spark cluster
> 10 TB Dedicated big data infrastructure