The von Neumann Architecture¶

The von Neumann architecture describes a model of a stored-program computer in which program instructions and data share the same memory system.

In this design, programs are stored in memory as binary instructions. The processor repeatedly fetches these instructions, interprets them, and executes them.

Although modern processors contain far more sophisticated microarchitectures—including pipelines, caches, vector units, and out-of-order execution—the von Neumann model remains the conceptual foundation of general-purpose computing.

A computer following this architecture consists of four main components:

Central Processing Unit (CPU)
Main memory
Input/Output devices
Communication system (bus)

flowchart TD
    CPU[CPU]
    BUS[System Bus]
    MEM[Memory]
    IO[I/O Devices]

    CPU <--> BUS
    BUS <--> MEM
    BUS <--> IO

The defining feature of this architecture is the stored-program model: both program instructions and data reside in the same memory and are represented as binary values defined by the processor's instruction set architecture (ISA).

1. The Stored-Program Concept¶

Before stored-program computers existed, early machines were programmed by physically rewiring circuits or configuring plugboards.

The stored-program concept introduced a major innovation:

Programs are stored as data in memory
The CPU reads instructions from memory
Programs can be modified dynamically

This allows software to be loaded, changed, and executed without altering hardware.

2. The Central Processing Unit¶

The CPU executes instructions and performs computations.

A simplified conceptual model of the CPU includes three components.

Component	Function
Control Unit (CU)	Fetches and decodes instructions
Arithmetic Logic Unit (ALU)	Performs arithmetic and logical operations
Registers	Small, fast storage inside the CPU

Although real CPUs include many more components, this model captures the essential structure.

Registers¶

Registers are the fastest storage locations accessible to the CPU.

They hold:

temporary values
instruction operands
addresses
intermediate results

Two particularly important registers are:

Register	Purpose
Program Counter (PC)	Address of the next instruction
Instruction Register (IR)	Instruction currently being executed

Normally, the Program Counter increments sequentially, but jumps and branches modify its value.

CPU operation visualization¶

flowchart LR
    A[Registers] --> B[ALU]
    B --> C[Registers]

Operations occur primarily on data stored in registers.

3. Memory¶

Main memory stores both:

program instructions
program data

Each location in memory has a unique numeric address.

Example layout:

Address      Content
0x1000       instruction
0x1004       instruction
0x1008       instruction

When executing a program, the CPU repeatedly reads instructions from memory.

4. The System Bus¶

The system bus connects the CPU, memory, and I/O devices.

Although modern systems use more advanced interconnects, the bus model remains a useful abstraction.

The bus consists of three groups of signals.

Address Bus¶

Specifies which memory location is accessed.

CPU ───── Address Bus ─────▶ Memory

This bus is typically unidirectional.

Data Bus¶

Transfers actual data values between components.

CPU ◀──── Data Bus ────▶ Memory

This bus is bidirectional.

Control Bus¶

Coordinates system operations.

Control signals include:

read
write
interrupt
clock

5. The Instruction Cycle¶

Processors execute programs through a repeating loop called the instruction cycle.

Fetch → Decode → Execute → Writeback

1. Fetch¶

The CPU reads the instruction located at the address stored in the Program Counter.

The address is placed on the address bus and the instruction is returned through the data bus.

2. Decode¶

The Control Unit interprets the instruction and determines which operation must be performed.

3. Execute¶

The CPU performs the operation.

Examples include:

arithmetic operations
logical operations
memory loads or stores
branch instructions

4. Writeback¶

The result of the operation is stored in:

registers
memory

5. Repeat¶

The Program Counter advances to the next instruction.

Instruction cycle visualization¶

flowchart LR
    A[Fetch] --> B[Decode]
    B --> C[Execute]
    C --> D[Writeback]
    D --> A

This loop continues until the program terminates.

6. Instruction Pipelining¶

Modern processors rarely execute instructions strictly one at a time.

Instead they use pipelining, which overlaps the execution of multiple instructions.

Example pipeline stages:

Fetch → Decode → Execute → Writeback

Pipeline behavior:

Cycle   Stage1  Stage2  Stage3  Stage4
1       I1
2       I2      I1
3       I3      I2      I1
4       I4      I3      I2      I1

Once the pipeline fills, the processor can complete roughly one instruction per cycle.

Pipeline hazards¶

Pipelining introduces several types of hazards.

Hazard	Description
Data hazard	instruction depends on previous result
Control hazard	branch changes instruction flow
Structural hazard	hardware resources conflict

Modern processors mitigate these issues using:

branch prediction
out-of-order execution
speculative execution

7. Memory Hierarchy¶

Memory systems are organized as a hierarchy of storage layers.

Smaller, faster memories are located close to the CPU, while larger memories are farther away.

flowchart TD
    CPU
    R[Registers]
    L1[L1 Cache]
    L2[L2 Cache]
    L3[L3 Cache]
    RAM[RAM]
    SSD[SSD]

    CPU <--> R
    R <--> L1
    L1 <--> L2
    L2 <--> L3
    L3 <--> RAM
    RAM <--> SSD

Typical latency:

Layer	Approximate Latency
Register	~1 cycle
L1 cache	~4 cycles
L2 cache	~12 cycles
L3 cache	~40–70 cycles
RAM	~100–300 cycles

Because main memory is slow relative to the CPU, programs must exploit memory locality.

8. Memory Locality¶

Programs typically access memory in predictable patterns.

Two forms of locality occur frequently.

Locality Type	Meaning
Temporal locality	recently used data is reused
Spatial locality	nearby memory locations are accessed

Caches exploit these patterns by storing recently used data close to the CPU.

9. The von Neumann Bottleneck¶

A key limitation of this architecture is the von Neumann bottleneck.

Both instructions and data must travel between the CPU and memory over the same communication path.

CPU  ◀══════════════════▶  Memory

This creates two major performance constraints:

limited memory bandwidth
high memory latency

As processors became faster, the gap between CPU speed and memory speed grew, producing what is sometimes called the memory wall.

Modern systems reduce this bottleneck using:

cache hierarchies
prefetching
pipelining

10. Harvard Architecture¶

The Harvard architecture separates instruction memory and data memory.

Instruction Memory → CPU ← Data Memory

This allows instruction fetches and data accesses to occur simultaneously.

Architecture comparison¶

Architecture	Memory Model	Advantage
von Neumann	shared instruction/data memory	simple programming model
Harvard	separate memories	higher throughput
Modified Harvard	unified memory + split caches	practical compromise

Most modern processors implement a modified Harvard architecture.

They maintain a unified memory model but use separate instruction and data caches.

Modified Harvard cache structure¶

flowchart TD
    CPU
    L1I[L1 Instruction Cache]
    L1D[L1 Data Cache]
    L2[L2 Cache]
    L3[L3 Cache]
    RAM

    CPU <--> L1I
    CPU <--> L1D
    L1I <--> L2
    L1D <--> L2
    L2 <--> L3
    L3 <--> RAM

11. Implications for Python Performance¶

Understanding the von Neumann architecture helps explain many Python performance characteristics.

Python lists and pointer chasing¶

Python lists store pointers to objects, not raw values.

Python list

[ptr] → PyObject
[ptr] → PyObject
[ptr] → PyObject

The referenced objects are scattered throughout memory.

Each access requires following a pointer to another memory location.

This pattern—called pointer chasing—reduces cache efficiency.

NumPy arrays¶

NumPy arrays store values in contiguous memory.

[value][value][value][value]

Advantages:

better spatial locality
efficient CPU cache usage
SIMD vectorization
reduced object overhead

Example¶

import numpy as np

arr = np.zeros(1_000_000)
lst = [0.0] * 1_000_000

NumPy arrays generally outperform Python lists for numerical computation.

12. Arithmetic Intensity and the Roofline Model¶

Performance is often limited not by computation but by memory bandwidth.

Arithmetic intensity¶

Arithmetic intensity measures how much computation is performed per byte of memory accessed.

\[ \text{Arithmetic intensity} =========================== \frac{\text{operations}}{\text{bytes transferred}} \]

Programs with low arithmetic intensity are memory-bound.

Programs with high arithmetic intensity are compute-bound.

Examples¶

Operation	Likely limit
Vector addition	memory bandwidth
Matrix multiplication	compute capability

Roofline model¶

The Roofline model visualizes performance limits.

Performance
     │
     │           ______  Compute limit
     │          /
     │         /
     │        /
     │_______/
     │
     └────────────────
        Arithmetic Intensity

Two regions appear:

memory-bound region
compute-bound region

Improving performance often requires increasing arithmetic intensity.

13. Historical Context¶

The architecture is named after John von Neumann, who described the stored-program model in the 1945 document First Draft of a Report on the EDVAC.

Other key contributors included:

J. Presper Eckert
John Mauchly
Alan Turing
Konrad Zuse

Early stored-program computers such as EDSAC (1949) implemented these ideas.

Modern computers still follow the same basic model.

14. Summary¶

Concept	Description
Stored Program	instructions and data share memory
CPU Components	control unit, ALU, registers
Bus System	connects CPU, memory, and I/O
Instruction Cycle	fetch → decode → execute → writeback
Pipelining	overlapping instruction execution
Memory Hierarchy	registers → cache → RAM → storage
Virtual Memory	programs use virtual addresses
von Neumann Bottleneck	shared memory channel limits speed
Modified Harvard	separate instruction and data caches
Arithmetic Intensity	operations per byte of memory
Roofline Model	compute vs memory performance limits

The von Neumann architecture remains the conceptual framework underlying modern computing. Understanding it provides the foundation for reasoning about program execution, memory behavior, and performance optimization.