Storage (SSD and HDD)¶

Storage devices provide persistent data storage for computers. Unlike RAM, which loses its contents when power is removed, storage retains data permanently.

However, storage is much slower than main memory. Accessing data from disk may take thousands to hundreds of thousands of times longer than accessing RAM.

Because of this large performance gap, the choice of storage technology and data format can significantly affect the performance of data-intensive programs.

For many Python workloads—especially data science and machine learning—data loading time from storage can dominate total runtime.

1. Persistent Storage¶

Storage devices retain data even when power is turned off. They store:

operating systems
application programs
databases
documents and datasets
backups and archives

When a program starts, its code and data must be loaded from storage into RAM before execution.

Data movement in a program¶

flowchart LR
    Storage[SSD / HDD] --> RAM
    RAM --> CPU

The CPU cannot execute programs directly from disk; data must first be loaded into memory.

2. Hard Disk Drives (HDD)¶

Hard disk drives store data using magnetic recording on spinning disks.

Inside an HDD are:

rotating magnetic platters
a spindle motor
read/write heads
an actuator arm

How HDDs work¶

Data is stored magnetically on circular tracks on each platter.

To read data:

the disk rotates to position the correct sector
the actuator moves the read/write head
the data is read magnetically

HDD structure¶

flowchart TD
    A[Spinning platter] --> B[Track]
    B --> C[Sector]
    D[Read/write head] --> B

HDD performance characteristics¶

Because HDDs rely on mechanical movement, they have relatively high latency.

Typical performance:

Metric	Value
Latency	5–15 ms
Sequential bandwidth	100–200 MB/s
Random IOPS	50–200

(IOPS = input/output operations per second)

Random reads are slow because the disk head must physically move.

3. Solid-State Drives (SSD)¶

Solid-state drives store data using NAND flash memory rather than magnetic disks.

Because SSDs have no moving parts, they are much faster than HDDs.

How SSDs store data¶

Flash memory cells store electrical charge in floating-gate transistors.

Each cell can represent multiple bits depending on the technology:

Type	Bits per cell
SLC	1
MLC	2
TLC	3
QLC	4

Higher density increases capacity but may reduce performance and durability.

SSD memory structure¶

flowchart LR
    A[Flash cell] --> B[Flash page]
    B --> C[Flash block]
    C --> D[SSD controller]

Flash memory is written in pages but erased in blocks, which complicates memory management.

4. Flash Translation Layer (FTL)¶

SSDs use a software layer called the Flash Translation Layer (FTL).

The FTL maps logical block addresses (LBAs) used by the operating system to physical flash memory locations.

The FTL also handles:

wear leveling
garbage collection
bad block management
error correction

FTL mapping process¶

flowchart LR
    OS --> Logical_Block
    Logical_Block --> FTL
    FTL --> Physical_Flash_Page

This translation layer allows SSDs to behave like traditional disks while hiding flash-specific complexity.

5. SATA vs NVMe¶

SSDs can connect to the system using different interfaces.

SATA SSD¶

SATA SSDs use the same interface originally designed for hard drives.

Typical characteristics:

Metric	Value
Latency	80–200 µs
Bandwidth	~500 MB/s
Random IOPS	~90,000

SATA bandwidth is limited by the SATA protocol.

NVMe SSD¶

NVMe (Non-Volatile Memory Express) SSDs connect directly to the CPU using PCI Express (PCIe) lanes.

This eliminates the SATA bottleneck.

Typical characteristics:

Metric	Value
Latency	20–100 µs
Bandwidth	3–7 GB/s
Random IOPS	500,000–1,000,000

Storage interface comparison¶

flowchart LR
    CPU --> PCIe
    PCIe --> NVMe_SSD

    CPU --> SATA_Controller
    SATA_Controller --> SATA_SSD

NVMe SSDs provide dramatically higher throughput and lower latency.

6. Sequential vs Random Access¶

Storage performance depends heavily on access patterns.

Sequential access¶

Sequential access reads data in order.

Example:

read bytes 0 → 1 MB

This allows the device to stream data efficiently.

Random access¶

Random access reads data from many different locations.

Example:

read byte 0
read byte 1,000,000
read byte 42

Random access is slower because it prevents efficient prefetching and caching.

Access pattern visualization¶

flowchart LR
    A[Sequential read] --> B[High throughput]
    C[Random read] --> D[Lower performance]

7. File Formats and Data Loading¶

File format strongly affects performance when loading data.

CSV (text format)¶

CSV files store data as plain text.

Example:

42,3.14,hello

CSV disadvantages:

large file size
expensive parsing
no type information

Parquet (binary columnar format)¶

Parquet stores data in a binary column-oriented format.

Advantages:

smaller file sizes
faster loading
efficient compression
column-based reading

File format comparison¶

Format	Type	Speed
CSV	text	slow
Parquet	binary	fast

8. Python Data Loading¶

Example comparing CSV and Parquet loading speeds.

import pandas as pd
import time

start = time.perf_counter()
df = pd.read_csv("large_data.csv")
print("CSV:", time.perf_counter() - start)

start = time.perf_counter()
df = pd.read_parquet("large_data.parquet")
print("Parquet:", time.perf_counter() - start)

Binary formats typically load 5–10× faster than CSV.

9. OS Page Cache¶

Operating systems cache frequently accessed disk data in RAM.

This is called the page cache.

When a program reads a file:

the OS loads the file into RAM
subsequent reads may come directly from RAM

Page cache behavior¶

flowchart LR
    Disk --> Page_Cache
    Page_Cache --> Program

Because of this caching, repeated reads may appear much faster than the actual disk speed.

Benchmarking disk I/O requires files larger than available RAM.

10. Processing Data Larger Than RAM¶

Large datasets may exceed available memory.

Two common approaches allow programs to handle such data.

Chunked processing¶

Process the file in smaller pieces.

Example:

import pandas as pd

total = 0

for chunk in pd.read_csv("huge.csv", chunksize=100_000):
    total += chunk["value"].sum()

print(total)

This approach loads only a portion of the data at a time.

Memory-mapped files¶

Memory mapping treats a file as an array stored on disk.

Example:

import numpy as np

arr = np.memmap(
    "huge.dat",
    dtype="float64",
    mode="w+",
    shape=(100_000_000,)
)

arr[0] = 3.14
print(arr[0])

The OS automatically loads pages of the file into RAM when accessed.

Memory mapping visualization¶

flowchart LR
    Disk_File --> OS
    OS --> RAM_Page
    RAM_Page --> Program

This technique allows programs to work with datasets larger than physical memory.

11. Worked Examples¶

Example 1¶

Compare latency:

Device	Latency
RAM	~100 ns
SSD	~100 µs
HDD	~10 ms

An HDD access may be 100,000× slower than RAM.

Example 2¶

How long would it take to read 10 GB sequentially from a 5 GB/s NVMe SSD?

[ 10 / 5 = 2 \text{ seconds} ]

Example 3¶

Explain why Parquet loads faster than CSV.

Binary columnar storage reduces both file size and parsing overhead.

12. Exercises¶

What is the difference between volatile and non-volatile memory?
How do HDDs store data?
Why are SSDs faster than HDDs?
What is the Flash Translation Layer?
What is the difference between SATA and NVMe?
What is sequential access?
Why do binary file formats load faster than CSV?
What is the OS page cache?

13. Short Answers¶

Volatile memory loses data without power
Magnetic recording on spinning platters
No mechanical movement
Logical-to-physical flash mapping layer
NVMe uses PCIe; SATA uses older disk interface
Reading data in order
Less parsing and smaller files
RAM cache for recently accessed disk data

14. Summary¶

Storage devices provide persistent data storage.
HDDs use spinning magnetic disks and have high latency.
SSDs use flash memory and are much faster.
NVMe SSDs connect via PCIe and provide the highest bandwidth.
Storage performance depends heavily on access patterns.
Binary file formats such as Parquet load much faster than text formats like CSV.
The OS page cache stores frequently accessed disk data in RAM.
Techniques such as chunked reading and memory mapping allow programs to process datasets larger than available memory.

Understanding storage performance is crucial for designing efficient data pipelines and large-scale numerical workflows.