Basic Visualization with Matplotlib¶
Overview¶
Matplotlib is Python's foundational plotting library. It can produce publication-quality static figures in a wide variety of formats and integrates tightly with NumPy and pandas. This section covers the core plotting patterns used throughout the book for exploratory data analysis and statistical visualization.
import matplotlib.pyplot as plt
import numpy as np
The Figure–Axes Model¶
Every Matplotlib plot lives inside a Figure, which contains one or more Axes (individual plots). The recommended way to create figures is with plt.subplots().
fig, ax = plt.subplots() # single plot
fig, axes = plt.subplots(1, 2) # 1 row, 2 columns
fig, axes = plt.subplots(2, 3, figsize=(12, 6)) # 2×3 grid
Axes vs Axis
In Matplotlib terminology, an Axes object is an entire plot (with its own title, labels, and data). An axis (lowercase) refers to the x-axis or y-axis within that plot.
Line Plots¶
Line plots are the most basic visualization type and are commonly used for time series and function curves.
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
fig, ax = plt.subplots()
ax.plot(x, y, marker="o", linestyle="-", color="b", label="primes")
ax.set_title("Simple Line Plot")
ax.set_xlabel("X-axis")
ax.set_ylabel("Y-axis")
ax.legend()
plt.show()
Plotting Mathematical Functions¶
x = np.linspace(-2 * np.pi, 2 * np.pi, 201)
fig, ax = plt.subplots(figsize=(10, 3))
ax.plot(x, np.sin(x), label="sin(x)")
ax.plot(x, np.cos(x), label="cos(x)", linestyle="--")
# Custom x-ticks with π labels
ax.set_xticks([-2*np.pi, -np.pi, 0, np.pi, 2*np.pi])
ax.set_xticklabels([r"$-2\pi$", r"$-\pi$", "0", r"$\pi$", r"$2\pi$"])
# Move spines to origin
ax.spines["right"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_position("zero")
ax.spines["left"].set_position("zero")
ax.legend()
plt.show()
Scatter Plots¶
Scatter plots visualize the relationship between two continuous variables—central to correlation analysis and regression diagnostics.
rng = np.random.default_rng(42)
x = rng.normal(0, 1, 100)
y = 2 * x + rng.normal(0, 0.5, 100)
fig, ax = plt.subplots()
ax.scatter(x, y, alpha=0.6, edgecolors="k", linewidths=0.5)
ax.set_xlabel("X")
ax.set_ylabel("Y")
ax.set_title("Scatter Plot")
plt.show()
Histograms¶
Histograms display the distribution of a single variable—the visual counterpart to density estimation.
data = rng.normal(loc=50, scale=10, size=500)
fig, ax = plt.subplots()
ax.hist(data, bins=25, edgecolor="black", alpha=0.7)
ax.set_xlabel("Value")
ax.set_ylabel("Frequency")
ax.set_title("Histogram")
plt.show()
Normalized Histogram with Density Overlay¶
from scipy.stats import norm
fig, ax = plt.subplots()
ax.hist(data, bins=30, density=True, alpha=0.6, edgecolor="black", label="data")
# Overlay theoretical density
x_grid = np.linspace(data.min(), data.max(), 200)
ax.plot(x_grid, norm.pdf(x_grid, loc=50, scale=10), "r-", lw=2, label="N(50, 10²)")
ax.set_xlabel("Value")
ax.set_ylabel("Density")
ax.legend()
plt.show()
Bar Charts¶
Bar charts compare categorical quantities—useful for frequency tables and group comparisons.
categories = ["A", "B", "C", "D"]
values = [23, 45, 12, 37]
fig, ax = plt.subplots()
ax.bar(categories, values, color="steelblue", edgecolor="black")
ax.set_xlabel("Category")
ax.set_ylabel("Count")
ax.set_title("Bar Chart")
plt.show()
Box Plots¶
Box plots summarize the five-number summary (min, Q1, median, Q3, max) and flag outliers.
groups = [rng.normal(0, 1, 100),
rng.normal(1, 1.5, 100),
rng.normal(-0.5, 0.8, 100)]
fig, ax = plt.subplots()
ax.boxplot(groups, labels=["Group A", "Group B", "Group C"])
ax.set_ylabel("Value")
ax.set_title("Box Plot Comparison")
plt.show()
Subplots¶
Subplots allow multiple plots to share a single figure, enabling side-by-side comparisons.
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
# Plot 1: Histogram
axes[0].hist(rng.normal(0, 1, 500), bins=25, edgecolor="black")
axes[0].set_title("Histogram")
# Plot 2: Scatter
x = rng.normal(0, 1, 100)
axes[1].scatter(x, x + rng.normal(0, 0.3, 100), alpha=0.6)
axes[1].set_title("Scatter")
# Plot 3: Line
t = np.linspace(0, 4 * np.pi, 200)
axes[2].plot(t, np.sin(t))
axes[2].set_title("Sine Wave")
fig.suptitle("Three Subplots", fontsize=14)
fig.tight_layout()
plt.show()
Customization Reference¶
Colors, Markers, and Line Styles¶
# Named colors: "steelblue", "coral", "seagreen", "slategray"
# Hex colors: "#1f77b4"
# Markers: "o", "s", "^", "D", "x", "+"
# Line styles: "-", "--", "-.", ":"
Labels, Titles, and Legends¶
ax.set_title("Title", fontsize=14)
ax.set_xlabel("X Label", fontsize=12)
ax.set_ylabel("Y Label", fontsize=12)
ax.legend(loc="upper right", fontsize=10)
ax.set_xlim(0, 10)
ax.set_ylim(-1, 1)
Grid and Ticks¶
ax.grid(True, alpha=0.3)
ax.tick_params(axis="both", labelsize=10)
Plotting Directly from pandas¶
pandas DataFrames and Series have a built-in .plot() method that wraps Matplotlib, making quick exploratory plots convenient.
import pandas as pd
df = pd.DataFrame({
"A": rng.normal(0, 1, 200),
"B": rng.normal(1, 2, 200)
})
# Histogram of all columns
df.plot.hist(bins=30, alpha=0.5, edgecolor="black")
# Scatter plot
df.plot.scatter(x="A", y="B", alpha=0.5)
# Box plot
df.plot.box()
plt.show()
Saving Figures¶
fig.savefig("figure.png", dpi=150, bbox_inches="tight")
fig.savefig("figure.pdf", bbox_inches="tight")
fig.savefig("figure.svg", bbox_inches="tight")
The bbox_inches="tight" argument trims excess whitespace around the figure.
Statistical Plot Recipes¶
The following patterns recur throughout the book.
Empirical CDF¶
def plot_ecdf(data, ax, **kwargs):
"""Plot the empirical CDF of a 1-D array."""
sorted_data = np.sort(data)
ecdf = np.arange(1, len(sorted_data) + 1) / len(sorted_data)
ax.step(sorted_data, ecdf, where="post", **kwargs)
ax.set_ylabel("ECDF")
fig, ax = plt.subplots()
sample = rng.normal(0, 1, 300)
plot_ecdf(sample, ax, label="sample")
ax.legend()
plt.show()
Q-Q Plot (Manual)¶
from scipy.stats import norm
sample = np.sort(rng.normal(0, 1, 200))
theoretical = norm.ppf(np.linspace(0.005, 0.995, len(sample)))
fig, ax = plt.subplots()
ax.scatter(theoretical, sample, s=10, alpha=0.6)
lims = [min(theoretical.min(), sample.min()), max(theoretical.max(), sample.max())]
ax.plot(lims, lims, "r--", lw=1)
ax.set_xlabel("Theoretical Quantiles")
ax.set_ylabel("Sample Quantiles")
ax.set_title("Q-Q Plot")
plt.show()
Confidence Interval Visualization¶
means = [2.3, 3.1, 4.5]
ci_lower = [1.8, 2.5, 3.9]
ci_upper = [2.8, 3.7, 5.1]
labels = ["A", "B", "C"]
fig, ax = plt.subplots()
y_pos = range(len(means))
ax.errorbar(means, y_pos,
xerr=[[m - lo for m, lo in zip(means, ci_lower)],
[hi - m for m, hi in zip(means, ci_upper)]],
fmt="o", capsize=4)
ax.set_yticks(y_pos)
ax.set_yticklabels(labels)
ax.set_xlabel("Estimate")
ax.set_title("Confidence Intervals")
plt.show()
Summary¶
| Plot Type | When to Use | Key Function |
|---|---|---|
| Line plot | Trends, time series, function curves | ax.plot() |
| Scatter plot | Bivariate relationships | ax.scatter() |
| Histogram | Distribution of a single variable | ax.hist() |
| Bar chart | Categorical comparisons | ax.bar() |
| Box plot | Five-number summary and outliers | ax.boxplot() |
| Subplots | Side-by-side comparisons | plt.subplots(nrows, ncols) |
| ECDF | Non-parametric distribution view | Custom step function |
| Q-Q plot | Normality assessment | scatter + reference line |