NumPy Arrays¶
Overview¶
NumPy (Numerical Python) is the foundational library for numerical computing in Python. At its core is the ndarray—a fast, memory-efficient, multidimensional array that supports vectorized arithmetic and broadcasting. Nearly every scientific Python library (Pandas, SciPy, Scikit-Learn, Matplotlib) is built on top of NumPy arrays.
import numpy as np
Creating Arrays¶
From Python Lists¶
# 1-D array
a = np.array([1, 2, 3, 4, 5])
print(a) # [1 2 3 4 5]
print(a.shape) # (5,)
print(a.dtype) # int64
# 2-D array (matrix)
M = np.array([[1, 2, 3],
[4, 5, 6]])
print(M.shape) # (2, 3)
With Built-in Constructors¶
np.zeros((3, 4)) # 3×4 matrix of zeros
np.ones((2, 2)) # 2×2 matrix of ones
np.full((3, 3), 7) # 3×3 matrix filled with 7
np.eye(4) # 4×4 identity matrix
np.arange(0, 10, 2) # array([0, 2, 4, 6, 8])
np.linspace(0, 1, 5) # array([0. , 0.25, 0.5 , 0.75, 1. ])
Random Arrays¶
rng = np.random.default_rng(42) # reproducible generator
rng.standard_normal((3, 3)) # 3×3 standard normal draws
rng.uniform(0, 1, size=(2, 5)) # 2×5 Uniform(0,1) draws
rng.integers(0, 10, size=6) # 6 random ints in [0, 10)
Array Attributes¶
| Attribute | Description | Example |
|---|---|---|
a.shape |
Dimensions | (2, 3) |
a.ndim |
Number of dimensions | 2 |
a.size |
Total number of elements | 6 |
a.dtype |
Element data type | float64 |
a.nbytes |
Memory consumption in bytes | 48 |
Indexing and Slicing¶
One-Dimensional¶
a = np.array([10, 20, 30, 40, 50])
a[0] # 10 — first element
a[-1] # 50 — last element
a[1:4] # [20 30 40] — slice (start inclusive, stop exclusive)
a[::2] # [10 30 50] — every other element
Two-Dimensional¶
M = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
M[0, 1] # 2 — row 0, col 1
M[1, :] # [4 5 6] — entire row 1
M[:, 2] # [3 6 9] — entire col 2
M[:2, :2] # [[1 2] — upper-left 2×2 sub-matrix
# [4 5]]
Boolean (Fancy) Indexing¶
a = np.array([3, 1, 4, 1, 5, 9])
mask = a > 3
print(mask) # [False False True False True True]
print(a[mask]) # [4 5 9]
Vectorized Operations¶
NumPy performs element-wise arithmetic without explicit loops, which is both faster and more readable than pure Python.
a = np.array([1, 2, 3, 4, 5])
# Element-wise operations
print(a + 10) # [11 12 13 14 15]
print(a * 2) # [ 2 4 6 8 10]
print(a ** 2) # [ 1 4 9 16 25]
# Element-wise between two arrays
b = np.array([10, 20, 30, 40, 50])
print(a + b) # [11 22 33 44 55]
print(a * b) # [ 10 40 90 160 250]
Comparison with Pure Python¶
import time
size = 1_000_000
py_list = list(range(size))
np_arr = np.arange(size)
# Pure Python
start = time.time()
result_py = [x ** 2 for x in py_list]
print(f"Python list: {time.time() - start:.4f}s")
# NumPy
start = time.time()
result_np = np_arr ** 2
print(f"NumPy array: {time.time() - start:.4f}s")
NumPy is typically 10–100× faster for array operations because it delegates computation to optimized C and Fortran routines.
Broadcasting¶
Broadcasting is NumPy's mechanism for performing arithmetic on arrays of different shapes.
Rules (simplified):
- If the arrays differ in the number of dimensions, the shape of the smaller array is padded with ones on the left.
- Arrays with size 1 along a particular dimension act as if they had the size of the array with the largest shape along that dimension.
# Scalar broadcast
a = np.array([1, 2, 3])
print(a + 100) # [101 102 103]
# Column vector + row vector → matrix
col = np.array([[1], [2], [3]]) # shape (3, 1)
row = np.array([10, 20, 30]) # shape (3,)
print(col + row)
# [[11 21 31]
# [12 22 32]
# [13 23 33]]
Statistical Application: Standardization¶
Broadcasting makes it natural to standardize a data matrix (subtract column means, divide by column standard deviations):
# X has shape (n, p) — n observations, p features
X = rng.standard_normal((100, 5))
X_standardized = (X - X.mean(axis=0)) / X.std(axis=0)
# Verify: each column now has mean ≈ 0 and std ≈ 1
print(X_standardized.mean(axis=0)) # ≈ [0, 0, 0, 0, 0]
print(X_standardized.std(axis=0)) # ≈ [1, 1, 1, 1, 1]
Aggregation Functions¶
a = np.array([4, 1, 7, 3, 9, 2])
a.sum() # 26
a.mean() # 4.333...
a.std() # 2.687...
a.var() # 7.222...
a.min() # 1
a.max() # 9
a.argmin() # 1 — index of min
a.argmax() # 4 — index of max
np.median(a) # 3.5
Aggregation Along an Axis¶
For a 2-D array, axis=0 aggregates down the rows (column-wise) and axis=1 aggregates across the columns (row-wise).
M = np.array([[1, 2, 3],
[4, 5, 6]])
M.sum(axis=0) # [5 7 9] — column sums
M.sum(axis=1) # [6 15] — row sums
M.mean(axis=0) # [2.5 3.5 4.5]
Linear Algebra¶
NumPy provides a comprehensive set of linear algebra operations essential for statistics.
Matrix Multiplication¶
A = np.array([[1, 2],
[3, 4]])
B = np.array([[2, 3],
[0, 1]])
# Three equivalent ways
C = A @ B # preferred syntax
C = np.matmul(A, B)
C = np.dot(A, B) # identical for 2-D arrays
print(C)
# [[ 2 5]
# [ 6 13]]
Other Operations¶
# Transpose
print(A.T)
# Determinant
print(np.linalg.det(A)) # -2.0
# Inverse
print(np.linalg.inv(A))
# Eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
# Solve Ax = b
b = np.array([5, 11])
x = np.linalg.solve(A, b)
print(x) # [1. 2.]
Statistical Application: OLS Estimator¶
The ordinary-least-squares estimator \(\hat{\beta} = (X^T X)^{-1} X^T y\) can be computed directly:
n, p = 50, 3
rng = np.random.default_rng(0)
X = rng.standard_normal((n, p))
X = np.column_stack([np.ones(n), X]) # add intercept column
beta_true = np.array([2, -1, 0.5, 3])
y = X @ beta_true + rng.standard_normal(n) * 0.5
# OLS estimate
beta_hat = np.linalg.solve(X.T @ X, X.T @ y)
print(beta_hat) # ≈ [2, -1, 0.5, 3]
Reshaping and Stacking¶
a = np.arange(12)
# Reshape into 3×4
M = a.reshape(3, 4)
# Flatten back
flat = M.ravel()
# Stack arrays
v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])
np.vstack([v1, v2]) # [[1 2 3], [4 5 6]]
np.hstack([v1, v2]) # [1 2 3 4 5 6]
np.column_stack([v1, v2]) # [[1 4], [2 5], [3 6]]
Random Number Generation for Statistics¶
rng = np.random.default_rng(seed=42)
# Standard distributions
rng.normal(loc=0, scale=1, size=1000) # Normal(μ=0, σ=1)
rng.uniform(low=0, high=1, size=1000) # Uniform(0, 1)
rng.binomial(n=10, p=0.3, size=1000) # Binomial(10, 0.3)
rng.poisson(lam=5, size=1000) # Poisson(λ=5)
rng.exponential(scale=2, size=1000) # Exponential(β=2)
# Sampling
population = np.arange(100)
sample = rng.choice(population, size=10, replace=False)
# Permutation
rng.shuffle(population) # in-place shuffle
perm = rng.permutation(100) # returns shuffled copy
Summary¶
| Concept | Key Takeaway |
|---|---|
ndarray |
Core data structure; homogeneous, fixed-size, N-dimensional |
| Vectorization | Element-wise operations replace explicit loops for speed |
| Broadcasting | Automatic shape expansion enables concise array arithmetic |
| Aggregation | sum, mean, std, var with optional axis parameter |
| Linear algebra | @ for matrix multiplication; np.linalg for decompositions and solvers |
| Random generation | Use default_rng(seed) for reproducible random sampling |