Acceleration Libraries¶

External libraries provide additional speedups beyond pure NumPy.

Mental Model

When NumPy alone is not fast enough, acceleration libraries extend the optimization hierarchy with a scaling layer that pushes computation closer to hardware. Each library targets a different bottleneck in the execution pipeline:

text Python loop → NumPy (C loops) → Acceleration │ ┌───────┼───────┐ Numba numexpr CuPy / Dask compile fuse offload loops expressions to GPU/cluster

Decision Framework — When to Use Which

Bottleneck	Tool	What it does
Python loops that resist vectorization	Numba	JIT-compiles Python to machine code
Large expressions with temporaries	numexpr	Fuses element-wise ops, avoids intermediate arrays
Data too large for one CPU	Dask	Parallelizes across cores or a cluster
Need GPU acceleration	CuPy	Drop-in NumPy replacement on NVIDIA GPUs
Already vectorized, still slow	Profile first	The bottleneck may be I/O or algorithm, not execution

numexpr¶

Optimized evaluation of array expressions.

1. Installation¶

bash pip install numexpr

2. Basic Usage¶

```python import numpy as np import numexpr as ne

def main(): a = np.random.randn(1_000_000) b = np.random.randn(1_000_000)

# NumPy (creates temporaries)
result_np = 2 * a + 3 * b ** 2

# numexpr (avoids temporaries)
result_ne = ne.evaluate("2 * a + 3 * b ** 2")

if name == "main": main() ```

3. Why It's Faster¶

Avoids intermediate array allocations
Uses multiple CPU cores automatically
Optimizes memory access patterns

numexpr Benchmark¶

Compare numexpr vs pure NumPy.

1. Timing Code¶

```python import numpy as np import numexpr as ne import time

def main(): n = 10_000_000 a = np.random.randn(n) b = np.random.randn(n)

# NumPy timing
start = time.perf_counter()
for _ in range(10):
    result = 2 * a + 3 * b ** 2
np_time = time.perf_counter() - start

# numexpr timing
start = time.perf_counter()
for _ in range(10):
    result = ne.evaluate("2 * a + 3 * b ** 2")
ne_time = time.perf_counter() - start

print(f"NumPy time:   {np_time:.4f} sec")
print(f"numexpr time: {ne_time:.4f} sec")
print(f"Speedup:      {np_time/ne_time:.1f}x")

if name == "main": main() ```

2. Typical Speedup¶

2-10x faster for complex expressions with large arrays.

Numba¶

Just-in-time compilation for Python functions.

1. Installation¶

bash pip install numba

2. Basic JIT¶

```python import numpy as np from numba import jit

@jit(nopython=True) def sum_squares(arr): total = 0.0 for i in range(len(arr)): total += arr[i] ** 2 return total

def main(): arr = np.random.randn(1_000_000) result = sum_squares(arr) # First call compiles result = sum_squares(arr) # Subsequent calls fast

if name == "main": main() ```

3. nopython Mode¶

nopython=True ensures full compilation without Python fallback.

Numba Parallel¶

Automatic parallelization with Numba.

1. Parallel Loop¶

```python import numpy as np from numba import jit, prange

@jit(nopython=True, parallel=True) def parallel_sum(arr): total = 0.0 for i in prange(len(arr)): total += arr[i] ** 2 return total

def main(): arr = np.random.randn(10_000_000) result = parallel_sum(arr)

if name == "main": main() ```

2. prange¶

Use prange instead of range for parallel iteration.

3. Multi-core Speedup¶

Automatically distributes work across CPU cores.

Cython¶

Static typing and C compilation for Python code.

1. Installation¶

bash pip install cython

2. Cython File (.pyx)¶

```cython

sum_squares.pyx¶

import numpy as np cimport numpy as np

def sum_squares_cy(np.ndarray[np.float64_t, ndim=1] arr): cdef double total = 0.0 cdef int i cdef int n = arr.shape[0]

for i in range(n):
    total += arr[i] ** 2

return total

```

3. Compilation¶

Requires setup.py or build configuration to compile.

Dask¶

Parallel computing for larger-than-memory datasets.

1. Installation¶

bash pip install dask[array]

2. Basic Usage¶

```python import numpy as np import dask.array as da

def main(): # Create Dask array from NumPy x_np = np.random.randn(10000, 10000) x_dask = da.from_array(x_np, chunks=(1000, 1000))

# Operations are lazy
result = (x_dask ** 2).sum()

# Compute triggers execution
print(f"Result: {result.compute():.4f}")

if name == "main": main() ```

3. Chunked Processing¶

```python import dask.array as da

def main(): # Create large array directly x = da.random.random((50000, 50000), chunks=(5000, 5000))

print(f"Array shape: {x.shape}")
print(f"Chunk shape: {x.chunksize}")

# Compute mean (processes in chunks)
mean = x.mean().compute()
print(f"Mean: {mean:.6f}")

if name == "main": main() ```

Dask Benefits¶

1. Out-of-Core Computing¶

Process data larger than RAM by working in chunks.

```python import dask.array as da

def main(): # 100GB array (doesn't fit in memory) x = da.random.random((100000, 100000), chunks=(10000, 10000))

# Still computable via chunking
result = x.mean().compute()
print(f"Mean of 100GB array: {result:.6f}")

if name == "main": main() ```

2. Parallel Execution¶

```python import dask.array as da from dask.distributed import Client

def main(): # Start local cluster client = Client() print(f"Dashboard: {client.dashboard_link}")

x = da.random.random((20000, 20000), chunks=(2000, 2000))
result = (x @ x.T).mean().compute()

print(f"Result: {result:.4f}")

if name == "main": main() ```

3. NumPy API Compatibility¶

Most NumPy operations work seamlessly with Dask arrays.

CuPy¶

GPU-accelerated NumPy-compatible arrays.

1. Installation¶

bash pip install cupy-cuda11x # Match your CUDA version

2. Basic Usage¶

```python import numpy as np import cupy as cp

def main(): # Create array on GPU x_gpu = cp.random.randn(10000, 10000)

# Operations run on GPU
result_gpu = x_gpu @ x_gpu.T

# Transfer back to CPU if needed
result_cpu = cp.asnumpy(result_gpu)

print(f"Result shape: {result_cpu.shape}")

if name == "main": main() ```

3. NumPy to CuPy¶

```python import numpy as np import cupy as cp

def main(): # NumPy array on CPU x_np = np.random.randn(5000, 5000)

# Transfer to GPU
x_gpu = cp.asarray(x_np)

# Compute on GPU
result_gpu = cp.linalg.svd(x_gpu, full_matrices=False)

# Transfer result back
U, s, Vt = [cp.asnumpy(arr) for arr in result_gpu]

print(f"Singular values shape: {s.shape}")

if name == "main": main() ```

CuPy Benchmark¶

1. Matrix Multiply¶

```python import numpy as np import cupy as cp import time

def main(): n = 5000

# CPU (NumPy)
a_np = np.random.randn(n, n).astype(np.float32)
b_np = np.random.randn(n, n).astype(np.float32)

start = time.perf_counter()
c_np = a_np @ b_np
cpu_time = time.perf_counter() - start

# GPU (CuPy)
a_gpu = cp.asarray(a_np)
b_gpu = cp.asarray(b_np)

# Warm-up
_ = a_gpu @ b_gpu
cp.cuda.Stream.null.synchronize()

start = time.perf_counter()
c_gpu = a_gpu @ b_gpu
cp.cuda.Stream.null.synchronize()
gpu_time = time.perf_counter() - start

print(f"CPU time: {cpu_time:.4f} sec")
print(f"GPU time: {gpu_time:.4f} sec")
print(f"Speedup:  {cpu_time/gpu_time:.1f}x")

if name == "main": main() ```

2. Typical Speedup¶

10-100x faster for large matrix operations on modern GPUs.

3. Memory Considerations¶

GPU memory is limited; check available memory with cp.cuda.runtime.memGetInfo().

Comparison Table¶

Choose the right tool for your needs.

1. Summary¶

Library	Use Case	Speedup	Hardware
numexpr	Array expressions	2-10x	CPU
Numba	Numerical loops	10-100x	CPU
Cython	General Python	10-100x	CPU
Dask	Large datasets	Parallel	CPU cluster
CuPy	Matrix operations	10-100x	GPU

2. Decision Guide¶

Small data, complex expressions: numexpr
Loops that can't vectorize: Numba
Data larger than RAM: Dask
Large matrices, have GPU: CuPy

3. Combinations¶

Libraries can be combined (e.g., Dask + CuPy for distributed GPU).

Best Practices¶

Guidelines for using acceleration libraries.

1. Profile First¶

Identify bottlenecks before adding dependencies.

2. Start Simple¶

Try numexpr before Numba; try Numba before Cython.

3. Test Correctness¶

Verify results match original implementation.

4. Consider Trade-offs¶

CuPy: Requires NVIDIA GPU
Dask: Adds complexity for small data
Numba: First call has compilation overhead

Exercises¶

Exercise 1. Write a vectorized NumPy solution and a pure Python loop solution for the same computation. Measure and compare their performance using time.perf_counter().

Solution to Exercise 1

```python import numpy as np import time

n = 1_000_000 data = np.random.default_rng(42).random(n)

Python loop¶

start = time.perf_counter() result_py = [x ** 2 for x in data] py_time = time.perf_counter() - start

NumPy vectorized¶

start = time.perf_counter() result_np = data ** 2 np_time = time.perf_counter() - start

print(f"Python: {py_time:.4f}s, NumPy: {np_time:.6f}s") print(f"Speedup: {py_time / np_time:.0f}x") ```

Exercise 2. Identify a potential performance pitfall in the following code and rewrite it using NumPy vectorization:

python result = [] for i in range(len(data)): result.append(data[i] ** 2 + 2 * data[i] + 1)

Solution to Exercise 2

```python import numpy as np

data = np.random.default_rng(42).random(100000)

Vectorized (fast)¶

result = data ** 2 + 2 * data + 1 ```

The loop version creates Python objects for each element and calls append repeatedly. The vectorized version computes everything in compiled C code on contiguous memory.

Exercise 3. Explain why NumPy vectorized operations are faster than Python loops. Reference memory layout, type checking overhead, and SIMD instructions in your answer.

Solution to Exercise 3

NumPy vectorized operations are faster because:

Contiguous memory: NumPy arrays store elements in a contiguous block, enabling efficient CPU cache usage.
No type checking: Python loops check types at each iteration; NumPy knows the dtype in advance.
Compiled C loops: The actual computation runs in compiled C/Fortran code, not interpreted Python.
SIMD instructions: Modern CPUs can process multiple array elements simultaneously using SIMD (Single Instruction, Multiple Data).

Exercise 4. Apply the concepts from this page to a practical problem: given a large array of temperatures in Celsius, convert them all to Fahrenheit and find the maximum. Compare vectorized and loop approaches.

Solution to Exercise 4

```python import numpy as np import time

rng = np.random.default_rng(42) celsius = rng.uniform(-40, 50, 1_000_000)

Vectorized¶

start = time.perf_counter() fahrenheit = celsius * 9/5 + 32 max_f = fahrenheit.max() vec_time = time.perf_counter() - start

Loop¶

start = time.perf_counter() max_f_loop = max(c * 9/5 + 32 for c in celsius) loop_time = time.perf_counter() - start

print(f"Vectorized: {vec_time:.6f}s, max={max_f:.1f}F") print(f"Loop: {loop_time:.4f}s, max={max_f_loop:.1f}F") ```