Memory Preallocation¶
Preallocate arrays before iterative population to avoid dynamic allocation overhead.
Mental Model
Growing an array inside a loop (via np.append or np.concatenate) copies the entire array on every iteration, turning O(n) work into O(n^2). Preallocating with np.empty or np.zeros and filling by index keeps work at O(n). Always allocate first, then fill.
The Problem¶
Dynamic allocation in loops causes catastrophic inefficiency.
1. Bad Pattern¶
```python import numpy as np
def build_array_bad(n): result = [] for i in range(n): result.append(i ** 2) return np.array(result) ```
2. Why It's Slow¶
- Each
appendmay trigger reallocation - Memory fragmentation increases
- Type conversion at the end adds overhead
3. Hidden Cost¶
Python lists store pointers, not contiguous data.
Preallocation Solution¶
Allocate the full array before the loop.
1. Good Pattern¶
```python import numpy as np
def build_array_good(n): result = np.empty(n, dtype=np.float64) for i in range(n): result[i] = i ** 2 return result ```
2. Why It's Fast¶
- Single allocation upfront
- Contiguous memory access
- No type conversion needed
3. Best Pattern¶
```python import numpy as np
def build_array_best(n): return np.arange(n) ** 2 # Fully vectorized ```
Timing Comparison¶
Measure the performance difference.
1. Benchmark Code¶
```python import numpy as np import time
def with_append(n): result = [] for i in range(n): result.append(i ** 2) return np.array(result)
def with_prealloc(n): result = np.empty(n, dtype=np.int64) for i in range(n): result[i] = i ** 2 return result
def main(): n = 100_000
start = time.perf_counter()
_ = with_append(n)
append_time = time.perf_counter() - start
start = time.perf_counter()
_ = with_prealloc(n)
prealloc_time = time.perf_counter() - start
print(f"Append time: {append_time:.4f} sec")
print(f"Prealloc time: {prealloc_time:.4f} sec")
print(f"Speedup: {append_time / prealloc_time:.1f}x")
if name == "main": main() ```
2. Typical Results¶
Append time: 0.0450 sec
Prealloc time: 0.0120 sec
Speedup: 3.8x
Allocation Functions¶
Choose the right function for preallocation.
1. np.empty¶
```python import numpy as np
arr = np.empty((1000, 1000)) # Fastest, uninitialized ```
No initialization; use when you'll overwrite all values.
2. np.zeros¶
```python import numpy as np
arr = np.zeros((1000, 1000)) # Initialized to zero ```
Safe default when partial filling is possible.
3. np.empty_like¶
```python import numpy as np
template = np.random.randn(100, 100) arr = np.empty_like(template) # Match shape and dtype ```
2D Preallocation¶
Apply preallocation to multi-dimensional arrays.
1. Matrix Building¶
```python import numpy as np
def build_matrix(rows, cols): result = np.empty((rows, cols), dtype=np.float64) for i in range(rows): for j in range(cols): result[i, j] = i * cols + j return result ```
2. Row-wise Building¶
```python import numpy as np
def build_by_rows(rows, cols): result = np.empty((rows, cols), dtype=np.float64) for i in range(rows): result[i, :] = np.arange(cols) + i * cols return result ```
3. Fully Vectorized¶
```python import numpy as np
def build_vectorized(rows, cols): return np.arange(rows * cols).reshape(rows, cols) ```
Best Practices¶
Guidelines for effective preallocation.
1. Know Final Size¶
Preallocation requires knowing dimensions upfront.
2. Specify dtype¶
Always specify dtype to avoid implicit conversions.
3. Prefer Vectorization¶
Preallocation helps loops, but vectorization eliminates them.
Exercises¶
Exercise 1. Write a vectorized NumPy solution and a pure Python loop solution for the same computation. Measure and compare their performance using time.perf_counter().
Solution to Exercise 1
```python import numpy as np import time
n = 1_000_000 data = np.random.default_rng(42).random(n)
Python loop¶
start = time.perf_counter() result_py = [x ** 2 for x in data] py_time = time.perf_counter() - start
NumPy vectorized¶
start = time.perf_counter() result_np = data ** 2 np_time = time.perf_counter() - start
print(f"Python: {py_time:.4f}s, NumPy: {np_time:.6f}s") print(f"Speedup: {py_time / np_time:.0f}x") ```
Exercise 2. Identify a potential performance pitfall in the following code and rewrite it using NumPy vectorization:
python
result = []
for i in range(len(data)):
result.append(data[i] ** 2 + 2 * data[i] + 1)
Solution to Exercise 2
```python import numpy as np
data = np.random.default_rng(42).random(100000)
Vectorized (fast)¶
result = data ** 2 + 2 * data + 1 ```
The loop version creates Python objects for each element and calls append repeatedly. The vectorized version computes everything in compiled C code on contiguous memory.
Exercise 3. Explain why NumPy vectorized operations are faster than Python loops. Reference memory layout, type checking overhead, and SIMD instructions in your answer.
Solution to Exercise 3
NumPy vectorized operations are faster because:
- Contiguous memory: NumPy arrays store elements in a contiguous block, enabling efficient CPU cache usage.
- No type checking: Python loops check types at each iteration; NumPy knows the dtype in advance.
- Compiled C loops: The actual computation runs in compiled C/Fortran code, not interpreted Python.
- SIMD instructions: Modern CPUs can process multiple array elements simultaneously using SIMD (Single Instruction, Multiple Data).
Exercise 4. Apply the concepts from this page to a practical problem: given a large array of temperatures in Celsius, convert them all to Fahrenheit and find the maximum. Compare vectorized and loop approaches.
Solution to Exercise 4
```python import numpy as np import time
rng = np.random.default_rng(42) celsius = rng.uniform(-40, 50, 1_000_000)
Vectorized¶
start = time.perf_counter() fahrenheit = celsius * 9/5 + 32 max_f = fahrenheit.max() vec_time = time.perf_counter() - start
Loop¶
start = time.perf_counter() max_f_loop = max(c * 9/5 + 32 for c in celsius) loop_time = time.perf_counter() - start
print(f"Vectorized: {vec_time:.6f}s, max={max_f:.1f}F") print(f"Loop: {loop_time:.4f}s, max={max_f_loop:.1f}F") ```