How NumPy Bridges the Gap¶
Mental Model
NumPy lets you write one line of Python that triggers millions of operations in optimized C. The trick is contiguous, typed memory: instead of a Python list of boxed objects scattered across the heap, NumPy stores raw numbers side by side, enabling cache-friendly access and SIMD vectorization. You describe what to compute; NumPy handles how.
The NumPy Strategy¶
NumPy achieves C-like performance while staying in Python by:
- Storing data in contiguous, typed arrays (like C)
- Implementing operations in compiled C code
- Minimizing Python overhead by operating on whole arrays
``` The NumPy Bridge:
Python World C World ┌─────────────────────┐ ┌─────────────────────┐ │ │ │ │ │ arr = np.array() │───creates──▶│ Raw memory block │ │ arr * 2 │───calls────▶│ Compiled C loop │ │ np.sum(arr) │───calls────▶│ Optimized routine │ │ │ │ │ │ Easy syntax │ │ Fast execution │ └─────────────────────┘ └─────────────────────┘ ```
Memory Layout Comparison¶
Python List¶
```python
Python list of integers¶
py_list = [1, 2, 3, 4, 5] ```
``` Memory Layout (scattered):
py_list object ──▶ [ref1, ref2, ref3, ref4, ref5] │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ PyInt PyInt PyInt PyInt PyInt (28B) (28B) (28B) (28B) (28B) at at at at at 0x100 0x250 0x180 0x300 0x220
Total: 80 bytes (list) + 5×28 bytes (ints) = 220 bytes Plus: pointer chasing, poor cache locality ```
NumPy Array¶
```python
NumPy array of integers¶
np_arr = np.array([1, 2, 3, 4, 5], dtype=np.int64) ```
``` Memory Layout (contiguous):
np_arr object ──▶ metadata (dtype, shape, strides) │ ▼ ┌───┬───┬───┬───┬───┐ │ 1 │ 2 │ 3 │ 4 │ 5 │ Raw values └───┴───┴───┴───┴───┘ 8B 8B 8B 8B 8B
Contiguous block at 0x100
Total: ~100 bytes (metadata) + 5×8 bytes (data) = 140 bytes Plus: Sequential access, excellent cache locality ```
How NumPy Operations Work¶
Element-wise Operations¶
```python import numpy as np
a = np.array([1, 2, 3, 4, 5]) b = a * 2 # Element-wise multiplication ```
What happens:
``` Python level: b = a * 2 │ ▼ a.mul(2) │ ▼ numpy.multiply(a, 2)
C level: ┌────────────────────────────────────────────────────────┐ │ for (i = 0; i < n; i++) { │ │ result[i] = data[i] * 2; │ │ } │ │ │ │ - No type checking per element │ │ - No object creation per element │ │ - Contiguous memory access │ │ - SIMD vectorization possible │ └────────────────────────────────────────────────────────┘ ```
The BLAS/LAPACK Backend¶
For linear algebra, NumPy uses highly optimized libraries:
```python import numpy as np
A = np.random.rand(1000, 1000) B = np.random.rand(1000, 1000) C = A @ B # Matrix multiplication ```
``` Execution path:
np.matmul(A, B) │ ▼ NumPy wrapper │ ▼ BLAS library (OpenBLAS, MKL, or ATLAS) │ ▼ ┌────────────────────────────────────────────────────────┐ │ Highly optimized matrix multiplication: │ │ - Cache-blocked algorithms │ │ - SIMD instructions (AVX2, AVX512) │ │ - Multi-threaded (parallel) │ │ - CPU-specific tuning │ └────────────────────────────────────────────────────────┘ ```
Vectorization¶
NumPy's key concept: operate on arrays, not elements.
Slow: Python Loop¶
```python
Element-by-element in Python¶
def slow_magnitude(x, y): result = [] for i in range(len(x)): result.append(math.sqrt(x[i]2 + y[i]2)) return result ```
Fast: NumPy Vectorized¶
```python
Whole-array operation in C¶
def fast_magnitude(x, y): return np.sqrt(x2 + y2) ```
Performance Comparison¶
```python import numpy as np import time
n = 1_000_000 x = np.random.rand(n) y = np.random.rand(n)
Python loop¶
x_list, y_list = x.tolist(), y.tolist() start = time.perf_counter() result = [math.sqrt(x_list[i]2 + y_list[i]2) for i in range(n)] python_time = time.perf_counter() - start
NumPy vectorized¶
start = time.perf_counter() result = np.sqrt(x2 + y2) numpy_time = time.perf_counter() - start
print(f"Python: {python_time:.3f}s") print(f"NumPy: {numpy_time:.3f}s") print(f"Speedup: {python_time/numpy_time:.0f}x") ```
Typical output:
Python: 0.450s
NumPy: 0.005s
Speedup: 90x
Universal Functions (ufuncs)¶
NumPy's ufuncs are C-compiled functions that operate element-wise:
```python
These are all ufuncs - implemented in C¶
np.add(a, b) # Addition np.multiply(a, b) # Multiplication np.sin(a) # Trigonometry np.exp(a) # Exponential np.log(a) # Logarithm np.maximum(a, b) # Element-wise max ```
ufunc Features¶
```python
Broadcasting - different shapes work together¶
a = np.array([[1], [2], [3]]) # Shape (3, 1) b = np.array([10, 20, 30]) # Shape (3,) c = a + b # Shape (3, 3)!
Output array - avoid allocation¶
out = np.empty_like(a) np.multiply(a, 2, out=out) # Result stored in pre-allocated array
Reduction - aggregate operations¶
np.add.reduce([1, 2, 3, 4]) # 10 (sum) np.multiply.reduce([1, 2, 3, 4]) # 24 (product) ```
View vs Copy¶
NumPy avoids copying data when possible:
```python a = np.arange(10)
View - shares memory (fast)¶
b = a[::2] # Every other element b[0] = 99 # Modifies 'a' too!
Copy - new memory (slower but independent)¶
c = a[::2].copy() c[0] = 99 # Doesn't affect 'a' ```
``` View (no copy): a: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] ↑ ↑ ↑ ↑ ↑ b: └─────┴─────┴─────┴─────┘ (same memory, different stride)
Copy (new memory): a: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
c: [0, 2, 4, 6, 8] (separate memory) ```
NumPy + Cache Efficiency¶
```python import numpy as np import time
n = 10000
Row-major array (default)¶
a = np.random.rand(n, n)
Row-wise sum (cache-friendly)¶
start = time.perf_counter() row_sums = np.sum(a, axis=1) row_time = time.perf_counter() - start
Column-wise sum (cache-unfriendly for C-order)¶
start = time.perf_counter() col_sums = np.sum(a, axis=0) col_time = time.perf_counter() - start
print(f"Row-wise: {row_time:.4f}s") print(f"Col-wise: {col_time:.4f}s") ```
When NumPy Isn't Enough¶
Numba: JIT Compilation¶
```python from numba import jit import numpy as np
@jit(nopython=True) def custom_operation(arr): """JIT-compiled to machine code.""" result = 0.0 for i in range(len(arr)): result += arr[i] ** 2 + np.sin(arr[i]) return result
First call: compiles¶
Subsequent calls: fast as C¶
```
Cython: Static Typing¶
```python
In .pyx file¶
cimport numpy as np import numpy as np
def fast_operation(np.ndarray[np.float64_t, ndim=1] arr): cdef int i cdef double result = 0.0 for i in range(arr.shape[0]): result += arr[i] ** 2 return result ```
Summary¶
| Technique | How It Helps |
|---|---|
| Contiguous memory | Cache efficiency, SIMD possible |
| Typed arrays | No per-element type checking |
| C implementation | No interpreter overhead |
| Vectorization | One Python call → millions of C operations |
| BLAS/LAPACK | Decades of optimization for linear algebra |
| Views | Avoid copying data |
The NumPy pattern:
``` ┌─────────────────────────────────────────────────────────────┐ │ │ │ Python: Describe WHAT to compute │ │ (high-level, readable, flexible) │ │ │ │ NumPy: Handle HOW to compute │ │ (low-level, optimized, parallel) │ │ │ └─────────────────────────────────────────────────────────────┘
One line of NumPy = thousands of optimized C operations ```
Exercises¶
Exercise 1. Explain how NumPy achieves near-C performance despite being called from Python. What role do compiled C libraries play?
Solution to Exercise 1
```python
Conceptual solution - see page content for details¶
import sys import platform
print(f"Python version: {sys.version}") print(f"Platform: {platform.platform()}") print(f"Architecture: {platform.machine()}") ```
Exercise 2. Write Python code that compares the performance of a pure Python loop versus a NumPy vectorized operation for computing the sum of squares of 1 million numbers.
Solution to Exercise 2
See the main content for the detailed explanation. The key concept involves understanding the hardware-software interaction and how it affects Python performance.
Exercise 3. Explain what the 'Python overhead' means when calling NumPy functions on very small arrays. Why might a Python loop be faster for arrays of size 5?
Solution to Exercise 3
```python import time
Simple benchmark¶
n = 10_000_000 start = time.perf_counter() total = sum(range(n)) elapsed = time.perf_counter() - start print(f"Sum of {n} integers: {total}") print(f"Time: {elapsed:.4f} seconds") ```
Exercise 4. Write code that demonstrates the memory layout difference between a Python list and a NumPy array using sys.getsizeof().
Solution to Exercise 4
```python import numpy as np import time
n = 1_000_000
Python loop¶
start = time.perf_counter() result_py = sum(i * i for i in range(n)) time_py = time.perf_counter() - start
NumPy vectorized¶
arr = np.arange(n) start = time.perf_counter() result_np = np.sum(arr * arr) time_np = time.perf_counter() - start
print(f"Python: {time_py:.4f}s, NumPy: {time_np:.4f}s") print(f"Speedup: {time_py / time_np:.1f}x") ```