Profiling Tools¶

Empirical performance analysis guides optimization efforts.

Mental Model

Never guess where the bottleneck is -- measure it. %timeit gives quick micro-benchmarks in Jupyter, cProfile identifies which functions consume the most time, and line_profiler pinpoints the expensive lines within a function. Profile first, then optimize only the hot path.

Profiling is the decision engine of the optimization hierarchy: it tells you which level of the hierarchy to apply and where. Without measurement, you risk optimizing code that is not the bottleneck — or worse, making it slower through premature complexity. The workflow is always: profile → identify hot path → apply the highest applicable optimization → re-profile.

IPython %timeit¶

Quick timing in interactive environments.

1. Basic Usage¶

python %timeit arr ** 2

Output:

1.23 µs ± 45.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

2. Multi-line Timing¶

python %%timeit result = np.zeros(1000) for i in range(1000): result[i] = i ** 2

3. Control Runs¶

python %timeit -n 100 -r 5 arr ** 2 # 100 loops, 5 runs

timeit Module¶

Scripted benchmarks outside IPython.

1. Basic Script¶

```python import timeit import numpy as np

def main(): setup = "import numpy as np; arr = np.random.randn(10000)" stmt = "arr ** 2"

time = timeit.timeit(stmt, setup, number=1000)
print(f"Total time: {time:.4f} sec")
print(f"Per call:   {time/1000*1000:.4f} ms")

if name == "main": main() ```

2. Compare Functions¶

```python import timeit import numpy as np

def method_loop(arr): result = np.empty_like(arr) for i in range(len(arr)): result[i] = arr[i] ** 2 return result

def method_vectorized(arr): return arr ** 2

def main(): arr = np.random.randn(10000)

loop_time = timeit.timeit(
    lambda: method_loop(arr), number=100
)
vec_time = timeit.timeit(
    lambda: method_vectorized(arr), number=100
)

print(f"Loop time:       {loop_time:.4f} sec")
print(f"Vectorized time: {vec_time:.4f} sec")
print(f"Speedup:         {loop_time/vec_time:.0f}x")

if name == "main": main() ```

time.perf_counter¶

Manual timing with high precision.

1. Basic Pattern¶

```python import time import numpy as np

def main(): arr = np.random.randn(1_000_000)

start = time.perf_counter()
result = arr ** 2
elapsed = time.perf_counter() - start

print(f"Elapsed: {elapsed:.6f} sec")

if name == "main": main() ```

2. Multiple Runs¶

```python import time import numpy as np

def main(): arr = np.random.randn(1_000_000) times = []

for _ in range(10):
    start = time.perf_counter()
    result = arr ** 2
    times.append(time.perf_counter() - start)

print(f"Mean: {np.mean(times):.6f} sec")
print(f"Std:  {np.std(times):.6f} sec")

if name == "main": main() ```

cProfile Module¶

Function-level profiling for entire programs.

1. Command Line¶

bash python -m cProfile -s cumtime my_script.py

2. In Script¶

```python import cProfile import numpy as np

def my_function(): arr = np.random.randn(100000) for _ in range(100): result = arr ** 2

cProfile.run('my_function()') ```

3. Output Example¶

ncalls tottime percall cumtime percall filename:lineno(function) 100 0.050 0.001 0.050 0.001 {method 'random' ...} 100 0.030 0.000 0.030 0.000 {built-in method numpy...}

line_profiler¶

Line-by-line timing for detailed analysis.

1. Installation¶

bash pip install line_profiler

2. Decorator Usage¶

python @profile def my_function(): arr = np.random.randn(100000) # Line 1 result = arr ** 2 # Line 2 total = np.sum(result) # Line 3 return total

3. Run Command¶

bash kernprof -l -v my_script.py

memory_profiler¶

Track memory usage during execution.

1. Installation¶

bash pip install memory_profiler

2. Usage¶

```python from memory_profiler import profile

@profile def my_function(): arr = np.random.randn(1_000_000) result = arr ** 2 return result ```

3. Output Example¶

Line # Mem usage Increment Line Contents 3 50.0 MiB 50.0 MiB arr = np.random.randn(1_000_000) 4 57.6 MiB 7.6 MiB result = arr ** 2

Best Practices¶

Guidelines for effective profiling.

1. Profile First¶

Identify bottlenecks before optimizing.

2. Representative Data¶

Use realistic data sizes for meaningful results.

3. Multiple Runs¶

Average over multiple runs to reduce variance.

Exercises¶

Exercise 1. Write a vectorized NumPy solution and a pure Python loop solution for the same computation. Measure and compare their performance using time.perf_counter().

Solution to Exercise 1

```python import numpy as np import time

n = 1_000_000 data = np.random.default_rng(42).random(n)

Python loop¶

start = time.perf_counter() result_py = [x ** 2 for x in data] py_time = time.perf_counter() - start

NumPy vectorized¶

start = time.perf_counter() result_np = data ** 2 np_time = time.perf_counter() - start

print(f"Python: {py_time:.4f}s, NumPy: {np_time:.6f}s") print(f"Speedup: {py_time / np_time:.0f}x") ```

Exercise 2. Identify a potential performance pitfall in the following code and rewrite it using NumPy vectorization:

python result = [] for i in range(len(data)): result.append(data[i] ** 2 + 2 * data[i] + 1)

Solution to Exercise 2

```python import numpy as np

data = np.random.default_rng(42).random(100000)

Vectorized (fast)¶

result = data ** 2 + 2 * data + 1 ```

The loop version creates Python objects for each element and calls append repeatedly. The vectorized version computes everything in compiled C code on contiguous memory.

Exercise 3. Explain why NumPy vectorized operations are faster than Python loops. Reference memory layout, type checking overhead, and SIMD instructions in your answer.

Solution to Exercise 3

NumPy vectorized operations are faster because:

Contiguous memory: NumPy arrays store elements in a contiguous block, enabling efficient CPU cache usage.
No type checking: Python loops check types at each iteration; NumPy knows the dtype in advance.
Compiled C loops: The actual computation runs in compiled C/Fortran code, not interpreted Python.
SIMD instructions: Modern CPUs can process multiple array elements simultaneously using SIMD (Single Instruction, Multiple Data).

Exercise 4. Apply the concepts from this page to a practical problem: given a large array of temperatures in Celsius, convert them all to Fahrenheit and find the maximum. Compare vectorized and loop approaches.

Solution to Exercise 4

```python import numpy as np import time

rng = np.random.default_rng(42) celsius = rng.uniform(-40, 50, 1_000_000)

Vectorized¶

start = time.perf_counter() fahrenheit = celsius * 9/5 + 32 max_f = fahrenheit.max() vec_time = time.perf_counter() - start

Loop¶

start = time.perf_counter() max_f_loop = max(c * 9/5 + 32 for c in celsius) loop_time = time.perf_counter() - start

print(f"Vectorized: {vec_time:.6f}s, max={max_f:.1f}F") print(f"Loop: {loop_time:.4f}s, max={max_f_loop:.1f}F") ```