Single Machine vs Cluster¶

When One Computer Isn't Enough¶

A single machine has hard limits. When you exceed them, you need multiple machines working together.

Single Machine Limits:

┌─────────────────────────────────────────────────────────────┐
│                    One Computer                             │
│                                                             │
│  CPU Cores:     Limited (4-128)                             │
│  RAM:           Limited (8 GB - 2 TB)                       │
│  Storage:       Limited (256 GB - 100 TB)                   │
│  GPU Memory:    Limited (8-80 GB per GPU)                   │
│  Availability:  Single point of failure                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Cluster of Machines:

┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐
│ Node 1   │  │ Node 2   │  │ Node 3   │  │ Node N   │
│ 64 cores │  │ 64 cores │  │ 64 cores │  │ 64 cores │
│ 256 GB   │  │ 256 GB   │  │ 256 GB   │  │ 256 GB   │
└────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘
     └──────────────┴──────────────┴──────────────┘
                    Network Fabric

Combined: 64N cores, 256N GB RAM, fault tolerant

Scaling Strategies¶

Vertical Scaling (Scale Up)¶

Get a bigger machine:

Vertical Scaling:

Before:           After:
┌──────────┐      ┌────────────────┐
│ 4 cores  │  →   │   64 cores     │
│ 16 GB    │      │   512 GB       │
│ 1 GPU    │      │   8 GPUs       │
└──────────┘      └────────────────┘

Pros:
  + Simple (no code changes)
  + No network overhead
  + Easy to manage

Cons:
  - Hard limits exist
  - Expensive at high end
  - Single point of failure

Horizontal Scaling (Scale Out)¶

Add more machines:

Horizontal Scaling:

Before:                After:
┌──────────┐           ┌──────────┐ ┌──────────┐ ┌──────────┐
│ 4 cores  │    →      │ 4 cores  │ │ 4 cores  │ │ 4 cores  │
│ 16 GB    │           │ 16 GB    │ │ 16 GB    │ │ 16 GB    │
└──────────┘           └──────────┘ └──────────┘ └──────────┘

Pros:
  + Nearly unlimited scale
  + Fault tolerant
  + Cost-effective (commodity hardware)

Cons:
  - Code complexity
  - Network overhead
  - Coordination challenges

When to Use a Cluster¶

Decision Framework¶

def need_cluster(
    data_size_gb,
    memory_required_gb,
    compute_hours_single,
    availability_requirement,
    local_machine_specs
):
    """Determine if a cluster is needed."""

    reasons = []

    # Memory constraint
    if memory_required_gb > local_machine_specs['ram_gb'] * 0.8:
        reasons.append(f"Data ({memory_required_gb} GB) exceeds RAM")

    # Time constraint
    if compute_hours_single > 24:
        reasons.append(f"Computation too slow ({compute_hours_single}h)")

    # Availability constraint
    if availability_requirement == '99.99%':
        reasons.append("High availability requires redundancy")

    # Storage constraint
    if data_size_gb > local_machine_specs['storage_gb'] * 0.8:
        reasons.append(f"Data exceeds storage capacity")

    if reasons:
        print("Cluster recommended:")
        for r in reasons:
            print(f"  - {r}")
        return True
    else:
        print("Single machine sufficient")
        return False

Common Thresholds¶

Constraint	Single Machine Limit	Cluster Benefit
RAM	~2 TB max	Aggregate memory
Compute	Hours/days	Parallel speedup
Storage	~100 TB	Distributed storage
Availability	~99%	Redundancy → 99.99%+
Throughput	Limited I/O	Parallel I/O

Cluster Architectures¶

Shared-Nothing Architecture¶

Each node is independent:

Shared-Nothing:

┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│   Node 1    │  │   Node 2    │  │   Node 3    │
│ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │
│ │   CPU   │ │  │ │   CPU   │ │  │ │   CPU   │ │
│ │   RAM   │ │  │ │   RAM   │ │  │ │   RAM   │ │
│ │  Disk   │ │  │ │  Disk   │ │  │ │  Disk   │ │
│ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
       └─────────────────┴─────────────────┘
                   Network Only

Examples: Hadoop, Spark, Cassandra
Pros: Scales well, fault tolerant
Cons: Data must be partitioned

Shared-Storage Architecture¶

Nodes share a storage layer:

Shared-Storage:

┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│   Node 1    │  │   Node 2    │  │   Node 3    │
│ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │
│ │   CPU   │ │  │ │   CPU   │ │  │ │   CPU   │ │
│ │   RAM   │ │  │ │   RAM   │ │  │ │   RAM   │ │
│ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
       └─────────────────┴─────────────────┘
                        │
              ┌─────────┴─────────┐
              │  Shared Storage   │
              │   (SAN / NFS)     │
              └───────────────────┘

Examples: Traditional databases, HPC
Pros: Simpler data management
Cons: Storage can be bottleneck

Data Parallelism¶

Partition Data Across Nodes¶

Data Parallelism (MapReduce pattern):

Original Data: [A B C D E F G H I J K L]

Partition:
  Node 1: [A B C D]
  Node 2: [E F G H]
  Node 3: [I J K L]

Process in parallel:
  Node 1: process([A B C D]) → result1
  Node 2: process([E F G H]) → result2
  Node 3: process([I J K L]) → result3

Combine:
  final_result = combine(result1, result2, result3)

Python with Dask¶

import dask.dataframe as dd

# Single machine pandas - limited by RAM
# import pandas as pd
# df = pd.read_csv('huge_file.csv')  # Fails if > RAM

# Dask - works across cluster
df = dd.read_csv('huge_file.csv')  # Lazy, partitioned

# Same pandas API, distributed execution
result = df.groupby('category').sum().compute()

Python with PySpark¶

from pyspark.sql import SparkSession

# Create Spark session (connects to cluster)
spark = SparkSession.builder \
    .appName("MyApp") \
    .master("spark://cluster:7077") \
    .getOrCreate()

# Load distributed dataset
df = spark.read.csv("hdfs://cluster/huge_file.csv")

# Operations distributed across cluster
result = df.groupBy("category").sum().collect()

Challenges of Distributed Computing¶

1. Network Overhead¶

Single Machine:
  Memory access: ~60 ns

Cluster:
  Network access: ~100,000 ns (100 μs)
                  ~1,000,000 ns (1 ms) across datacenter

Network is 1000-10000x slower than memory!

2. Partial Failures¶

Single Machine:
  Either works or doesn't

Cluster:
  ┌──────────┐  ┌──────────┐  ┌──────────┐
  │ Node 1   │  │ Node 2   │  │ Node 3   │
  │   OK     │  │  FAILED  │  │   OK     │
  └──────────┘  └──────────┘  └──────────┘

  What happens to Node 2's work?
  Need: Retry, redundancy, checkpointing

3. Coordination¶

Problems that arise:
  - Which node handles which data?
  - How to synchronize results?
  - What if nodes disagree?
  - How to handle stragglers?

Solutions:
  - Consensus protocols (Paxos, Raft)
  - Distributed coordination (ZooKeeper)
  - Idempotent operations
  - Speculative execution

Comparison Summary¶

Aspect	Single Machine	Cluster
Setup	Simple	Complex
Scaling	Limited	Nearly unlimited
Latency	Nanoseconds	Microseconds-milliseconds
Fault tolerance	None	Built-in
Cost	Lower initially	Higher, but scales better
Code complexity	Simple	Distributed algorithms
Debugging	Easy	Hard

Decision Checklist¶

Use Single Machine when:
  □ Data fits in memory (with headroom)
  □ Computation completes in acceptable time
  □ Downtime is acceptable
  □ Simpler is better

Use Cluster when:
  □ Data exceeds single machine capacity
  □ Need faster results (parallel speedup)
  □ Require high availability
  □ Workload is embarrassingly parallel
  □ Already using distributed frameworks

Starting Point Recommendations¶

Data Size        Recommended Approach
─────────────────────────────────────────────
< 10 GB          Single machine, pandas
10-100 GB        Single machine, chunked processing
100 GB - 1 TB    Consider Dask (single or cluster)
1-10 TB          Spark cluster
> 10 TB          Dedicated big data infrastructure