Distribution Fitting¶

A common use case for histograms is visualizing empirical data alongside theoretical probability distributions. This involves estimating distribution parameters from data and overlaying the fitted PDF.

Mental Model

Distribution fitting overlays a smooth theoretical curve on your histogram to see how well a known distribution (e.g., Normal, Exponential) explains your data. Plot the histogram with density=True so the y-axis shows probability density, then draw the fitted PDF on top. A good fit means the curve hugs the bars closely.

Distribution fitting is a form of modeling: you are asking "what mathematical distribution could have generated this data?" The histogram shows empirical behavior; the PDF shows theoretical behavior. Fitting connects the two — but a visually good fit does not guarantee a statistically correct model. Always validate with statistical tests.

Histogram is not a density estimator

A histogram's shape depends heavily on the number of bins and their placement. Two different bin counts on the same data can suggest different distributions. Visual alignment between bars and a PDF curve is suggestive but not proof of a good fit. For rigorous assessment, use a goodness-of-fit test like the Kolmogorov-Smirnov test (scipy.stats.kstest) or compare log-likelihoods. Also consider that with small samples, almost any distribution can appear to fit — more data sharpens the comparison.

Fitting Normal Distribution¶

Manual PDF Formula¶

```python import matplotlib.pyplot as plt import numpy as np

def main(): # data generation n_samples = 10_000 data = np.random.randn(n_samples) # (10_000,)

# plot histogram with theoretical PDF (standard normal)
fig, ax = plt.subplots()
_, bins, _ = ax.hist(data, bins=100, density=True)
ax.plot(bins, np.exp(-bins**2 / 2) / np.sqrt(2 * np.pi), 
        '--r', alpha=0.9, lw=5)
plt.show()

if name == "main": main() ```

With Parameter Estimation¶

When data may not be standard normal, estimate parameters from the sample:

```python import matplotlib.pyplot as plt import numpy as np

def main(): # data generation n_samples = 10_000 data = np.random.randn(n_samples) # (10_000,)

# parameter estimation
mu = data.mean()
sigma = data.std()

# plot histogram with fitted PDF
fig, ax = plt.subplots()
_, bins, _ = ax.hist(data, bins=100, density=True)
pdf = np.exp(-(bins - mu)**2 / (2 * sigma**2)) / np.sqrt(2 * np.pi * sigma**2)
ax.plot(bins, pdf, '--r', alpha=0.9, lw=5)
plt.show()

if name == "main": main() ```

Using scipy.stats¶

The scipy.stats module provides a cleaner interface for distribution fitting.

```python import matplotlib.pyplot as plt import numpy as np import scipy.stats as stats

def main(): # data generation from non-standard normal n_samples = 10_000 data = stats.norm(loc=1, scale=2).rvs(n_samples) # mean=1, std=2

# parameter estimation
mu = data.mean()
sigma = data.std()

# plot histogram with fitted PDF
fig, ax = plt.subplots()
_, bins, _ = ax.hist(data, bins=100, density=True)
ax.plot(bins, stats.norm(loc=mu, scale=sigma).pdf(bins), 
        '--r', alpha=0.9, lw=5)
plt.show()

if name == "main": main() ```

General Workflow¶

Generate or load data: Obtain the empirical dataset
Estimate parameters: Use sample statistics (mean, std) or MLE
Plot histogram: Use density=True to normalize
Overlay PDF: Evaluate theoretical PDF at bin edges
Assess fit: Visual comparison of histogram and fitted curve

Fitting Other Distributions¶

The same pattern applies to other distributions:

```python import matplotlib.pyplot as plt import numpy as np import scipy.stats as stats

Exponential distribution¶

data = stats.expon(scale=2).rvs(10_000) scale_hat = data.mean()

fig, ax = plt.subplots() _, bins, _ = ax.hist(data, bins=100, density=True, alpha=0.5) ax.plot(bins, stats.expon(scale=scale_hat).pdf(bins), '--r', lw=2) plt.show() ```

Documentation¶

Exercises¶

Exercise 1. Write code that generates 2000 samples from a normal distribution, plots a histogram with density=True, and overlays the theoretical PDF curve using scipy.stats.norm.pdf().

Solution to Exercise 1

```python import matplotlib.pyplot as plt import numpy as np

np.random.seed(42) data = np.random.normal(5, 2, 1000)

fig, ax = plt.subplots() ax.hist(data, bins=30, density=True, alpha=0.7, color='steelblue', edgecolor='black') ax.set_xlabel('Value') ax.set_ylabel('Density') ax.set_title('Histogram') plt.show() ```

Exercise 2. Explain why density=True is necessary when overlaying a PDF curve on a histogram. What units does the y-axis represent?

Solution to Exercise 2

Without density=True, the y-axis shows raw counts (number of data points per bin). A PDF curve, however, is defined so that the total area under it equals 1, and its y-axis represents probability density (probability per unit of x). Overlaying a PDF on raw counts would produce a mismatch — the PDF curve would be invisibly small next to bars with heights in the hundreds or thousands.

Setting density=True normalizes the histogram so that the total area of all bars equals 1, making the y-axis units probability density — the same units as the PDF. This allows a meaningful visual comparison: if the bars and curve align, the distribution is a plausible model for the data.

Exercise 3. Create a figure that fits and overlays an exponential distribution PDF on histogram data generated from np.random.exponential().

Solution to Exercise 3

```python import matplotlib.pyplot as plt import numpy as np

np.random.seed(42) normal_data = np.random.normal(0, 1, 1000) exp_data = np.random.exponential(1, 1000)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.hist(normal_data, bins=30, density=True, alpha=0.7, color='steelblue') ax1.set_title('Normal Distribution')

ax2.hist(exp_data, bins=30, density=True, alpha=0.7, color='coral') ax2.set_title('Exponential Distribution')

plt.tight_layout() plt.show() ```

Exercise 4. Write code that generates data from a mixture of two normal distributions and overlays both individual component PDFs and the combined PDF on the histogram.

Solution to Exercise 4

```python import matplotlib.pyplot as plt import numpy as np

np.random.seed(42) data1 = np.random.normal(0, 1, 1000) data2 = np.random.normal(3, 1, 1000)

fig, ax = plt.subplots() ax.hist(data1, bins=30, alpha=0.5, label='N(0, 1)', color='blue') ax.hist(data2, bins=30, alpha=0.5, label='N(3, 1)', color='red') ax.legend() ax.set_title('Overlaid Histograms') plt.show() ```