rv_continuous and rv_discrete¶
All probability distributions in scipy.stats inherit from one of two base classes: rv_continuous for continuous distributions and rv_discrete for discrete distributions. Understanding these base classes reveals the unified design behind the entire scipy.stats distribution system.
The Distribution Class Hierarchy¶
Every distribution in scipy.stats is an instance of either rv_continuous or rv_discrete, both of which inherit from rv_generic:
rv_generic
├── rv_continuous → norm, expon, gamma, chi2, t, f, beta, ...
└── rv_discrete → binom, poisson, geom, hypergeom, nbinom, ...
When you write stats.norm(loc=3.0), you are calling rv_continuous.__call__() which returns a frozen distribution — an object with fixed parameters.
rv_continuous¶
Continuous distributions are defined on intervals of the real line and provide the .pdf() method for the probability density function:
import scipy.stats as stats
import numpy as np
# stats.norm is an instance of rv_continuous
a = stats.norm(loc=3.0) # frozen: mean=3, std=1
samples = a.rvs(size=(2, 3), random_state=1)
print(samples)
print(type(samples)) # <class 'numpy.ndarray'>
print(samples.shape) # (2, 3)
print(samples.dtype) # float64
# Key methods: pdf, cdf, sf, ppf, isf, rvs, fit, mean, var, std, entropy
rv_discrete¶
Discrete distributions are defined on countable sets (typically non-negative integers) and provide the .pmf() method for the probability mass function:
# stats.poisson is an instance of rv_discrete
b = stats.poisson(mu=3.0) # frozen: mean=3
print(b.pmf(3)) # P(X = 3) — exact probability
print(b.cdf(5)) # P(X ≤ 5)
The key difference: discrete distributions use .pmf(k) where continuous distributions use .pdf(x). All other methods (.cdf(), .sf(), .ppf(), .rvs(), etc.) work identically.
Common Interface¶
Both rv_continuous and rv_discrete share a consistent interface through rv_generic:
| Method | Continuous | Discrete | Description |
|---|---|---|---|
| Density/Mass | .pdf(x) |
.pmf(k) |
Density or mass at a point |
| Log density | .logpdf(x) |
.logpmf(k) |
Log of density/mass (numerically stable) |
| CDF | .cdf(x) |
.cdf(k) |
\(P(X \le x)\) |
| Survival | .sf(x) |
.sf(k) |
\(P(X > x)\) |
| Quantile | .ppf(q) |
.ppf(q) |
Inverse CDF |
| Sampling | .rvs(size) |
.rvs(size) |
Random variates |
| Moments | .mean(), .var() |
.mean(), .var() |
Theoretical moments |
| Fit | .fit(data) |
— | MLE parameter estimation |
Frozen vs Unfrozen¶
The base classes support two usage patterns. In the unfrozen pattern, parameters are passed to each method call: stats.norm.pdf(0, loc=3, scale=1). In the frozen pattern, a distribution object is created once and methods are called without parameters: a = stats.norm(loc=3, scale=1); a.pdf(0). The frozen pattern is preferred for clarity and efficiency.
Summary¶
The rv_continuous and rv_discrete base classes define the unified API that makes scipy.stats distributions interchangeable. By understanding this class hierarchy, you can write generic code that works with any distribution and leverage the full suite of methods consistently.
Runnable Example: basics_distributions.py¶
"""
Tutorial 01: Introduction to scipy.stats - Basics and Probability Distributions
===============================================================================
Level: Beginner
Topics: Installing scipy, basic imports, understanding distributions,
probability density functions (PDF), cumulative distribution functions (CDF)
This module introduces the scipy.stats package and fundamental concepts
of working with probability distributions in Python.
"""
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# =============================================================================
# SECTION 1: Introduction to scipy.stats
# =============================================================================
if __name__ == "__main__":
"""
scipy.stats is a sub-package of SciPy that provides:
- A large collection of probability distributions
- Statistical functions for descriptive statistics
- Statistical tests (hypothesis testing)
- Correlation and regression analysis
The package contains two main types of distribution objects:
1. Continuous distributions (e.g., normal, exponential, uniform)
2. Discrete distributions (e.g., binomial, Poisson, geometric)
"""
# =============================================================================
# SECTION 2: Understanding Distribution Objects
# =============================================================================
# Creating a normal distribution object
# ---------------------------------------
# The normal distribution is characterized by two parameters:
# - loc: mean (μ) - center of the distribution
# - scale: standard deviation (σ) - spread of the distribution
# Standard normal distribution (mean=0, std=1)
standard_normal = stats.norm(loc=0, scale=1)
print("Standard Normal Distribution:")
print(f" Mean: {standard_normal.mean()}") # Expected value
print(f" Variance: {standard_normal.var()}") # Variance
print(f" Standard Deviation: {standard_normal.std()}") # Standard deviation
print()
# Custom normal distribution (mean=10, std=2)
custom_normal = stats.norm(loc=10, scale=2)
print("Custom Normal Distribution (μ=10, σ=2):")
print(f" Mean: {custom_normal.mean()}")
print(f" Variance: {custom_normal.var()}")
print(f" Standard Deviation: {custom_normal.std()}")
print()
# =============================================================================
# SECTION 3: Probability Density Function (PDF)
# =============================================================================
"""
The PDF gives the relative likelihood of a continuous random variable
taking on a specific value. For a normal distribution, the PDF is:
f(x) = (1 / (σ√(2π))) * exp(-(x-μ)²/(2σ²))
Key point: PDF values are NOT probabilities (they can exceed 1)!
Probabilities for continuous distributions are computed over intervals.
"""
# Evaluate PDF at specific points
x_value = 0.0
pdf_value = standard_normal.pdf(x_value)
print(f"PDF of standard normal at x={x_value}: {pdf_value:.4f}")
# This tells us the height of the distribution curve at x=0
# Evaluate PDF at multiple points
x_values = np.array([-2, -1, 0, 1, 2])
pdf_values = standard_normal.pdf(x_values)
print(f"PDF values at x={x_values}: {pdf_values}")
print()
# Visualizing the PDF
# --------------------
# Generate 1000 points between -4 and 4
x_range = np.linspace(-4, 4, 1000)
# Calculate PDF for each point
pdf_standard = standard_normal.pdf(x_range)
pdf_custom = custom_normal.pdf(x_range)
# Create visualization
plt.figure(figsize=(12, 5))
# Plot 1: Standard Normal PDF
plt.subplot(1, 2, 1)
plt.plot(x_range, pdf_standard, 'b-', linewidth=2, label='N(0,1)')
plt.fill_between(x_range, pdf_standard, alpha=0.3)
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.title('Standard Normal Distribution PDF')
plt.grid(True, alpha=0.3)
plt.legend()
# Plot 2: Comparing PDFs
plt.subplot(1, 2, 2)
x_range2 = np.linspace(0, 20, 1000)
plt.plot(x_range, pdf_standard, 'b-', linewidth=2, label='N(0,1)')
plt.plot(x_range2, custom_normal.pdf(x_range2), 'r-', linewidth=2, label='N(10,2)')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.title('Comparing Normal Distribution PDFs')
plt.grid(True, alpha=0.3)
plt.legend()
plt.tight_layout()
plt.savefig('/home/claude/scipy_stats_course/01_pdf_visualization.png', dpi=300, bbox_inches='tight')
print("Saved: 01_pdf_visualization.png")
plt.close()
# =============================================================================
# SECTION 4: Cumulative Distribution Function (CDF)
# =============================================================================
"""
The CDF gives the probability that a random variable X is less than or
equal to a value x:
F(x) = P(X ≤ x)
Properties of CDF:
- Always increases (non-decreasing)
- Ranges from 0 to 1
- F(-∞) = 0 and F(∞) = 1
"""
# Calculate CDF values
x_test = 0.0
cdf_value = standard_normal.cdf(x_test)
print(f"CDF of standard normal at x={x_test}: {cdf_value:.4f}")
# This means P(X ≤ 0) = 0.5 (50% of values are below 0)
# CDF at multiple points
x_test_values = np.array([-2, -1, 0, 1, 2])
cdf_values = standard_normal.cdf(x_test_values)
print(f"CDF values at x={x_test_values}:")
for x, cdf in zip(x_test_values, cdf_values):
print(f" P(X ≤ {x:2.0f}) = {cdf:.4f} ({cdf*100:.2f}%)")
print()
# Visualizing the CDF
# --------------------
cdf_standard = standard_normal.cdf(x_range)
cdf_custom = custom_normal.cdf(x_range2)
plt.figure(figsize=(12, 5))
# Plot 1: Standard Normal CDF
plt.subplot(1, 2, 1)
plt.plot(x_range, cdf_standard, 'b-', linewidth=2)
plt.axhline(y=0.5, color='r', linestyle='--', alpha=0.5, label='Median (CDF=0.5)')
plt.axvline(x=0, color='r', linestyle='--', alpha=0.5)
plt.xlabel('x')
plt.ylabel('Cumulative Probability')
plt.title('Standard Normal Distribution CDF')
plt.grid(True, alpha=0.3)
plt.legend()
# Plot 2: Comparing CDFs
plt.subplot(1, 2, 2)
plt.plot(x_range, cdf_standard, 'b-', linewidth=2, label='N(0,1)')
plt.plot(x_range2, cdf_custom, 'r-', linewidth=2, label='N(10,2)')
plt.xlabel('x')
plt.ylabel('Cumulative Probability')
plt.title('Comparing Normal Distribution CDFs')
plt.grid(True, alpha=0.3)
plt.legend()
plt.tight_layout()
plt.savefig('/home/claude/scipy_stats_course/01_cdf_visualization.png', dpi=300, bbox_inches='tight')
print("Saved: 01_cdf_visualization.png")
plt.close()
# =============================================================================
# SECTION 5: Computing Probabilities Over Intervals
# =============================================================================
"""
For continuous distributions, we compute probabilities over intervals:
P(a < X ≤ b) = F(b) - F(a) = CDF(b) - CDF(a)
"""
# Example: P(-1 < X ≤ 1) for standard normal
a, b = -1, 1
prob_interval = standard_normal.cdf(b) - standard_normal.cdf(a)
print(f"P({a} < X ≤ {b}) = {prob_interval:.4f} ({prob_interval*100:.2f}%)")
# This is approximately 68% (the 68-95-99.7 rule!)
# About 68% of data falls within 1 standard deviation of the mean
# Example: P(X > 2) for standard normal
x_threshold = 2
prob_above = 1 - standard_normal.cdf(x_threshold)
print(f"P(X > {x_threshold}) = {prob_above:.4f} ({prob_above*100:.2f}%)")
# Alternative using survival function (sf)
# The survival function is defined as sf(x) = 1 - cdf(x) = P(X > x)
prob_above_sf = standard_normal.sf(x_threshold)
print(f"P(X > {x_threshold}) using sf = {prob_above_sf:.4f}")
print()
# =============================================================================
# SECTION 6: Percent Point Function (Inverse CDF/Quantiles)
# =============================================================================
"""
The percent point function (PPF) is the inverse of the CDF.
Given a probability p, it returns the value x such that P(X ≤ x) = p
ppf(p) = CDF^(-1)(p)
This is used to find quantiles and percentiles.
"""
# Find the median (50th percentile)
median = standard_normal.ppf(0.5)
print(f"Median (50th percentile): {median:.4f}")
# Find the 95th percentile
percentile_95 = standard_normal.ppf(0.95)
print(f"95th percentile: {percentile_95:.4f}")
# This means 95% of values are below this value
# Find the quartiles
q1 = standard_normal.ppf(0.25) # 25th percentile
q2 = standard_normal.ppf(0.50) # 50th percentile (median)
q3 = standard_normal.ppf(0.75) # 75th percentile
print(f"Quartiles: Q1={q1:.4f}, Q2={q2:.4f}, Q3={q3:.4f}")
# Verify: CDF and PPF are inverses
p_test = 0.75
x_from_ppf = standard_normal.ppf(p_test)
p_from_cdf = standard_normal.cdf(x_from_ppf)
print(f"\nVerification: ppf({p_test}) = {x_from_ppf:.4f}")
print(f" cdf({x_from_ppf:.4f}) = {p_from_cdf:.4f}")
print()
# =============================================================================
# SECTION 7: Random Variate Generation
# =============================================================================
"""
Generate random samples from a distribution using the rvs() method.
This is useful for simulations and Monte Carlo methods.
"""
# Generate random samples
np.random.seed(42) # Set seed for reproducibility
samples = standard_normal.rvs(size=1000) # Generate 1000 random samples
print(f"Generated {len(samples)} samples from N(0,1)")
print(f"Sample mean: {np.mean(samples):.4f} (theoretical: 0.0)")
print(f"Sample std: {np.std(samples, ddof=1):.4f} (theoretical: 1.0)")
print()
# Visualize the samples
plt.figure(figsize=(14, 5))
# Plot 1: Histogram of samples vs. theoretical PDF
plt.subplot(1, 3, 1)
plt.hist(samples, bins=30, density=True, alpha=0.7, color='skyblue', edgecolor='black')
plt.plot(x_range, pdf_standard, 'r-', linewidth=2, label='Theoretical PDF')
plt.xlabel('x')
plt.ylabel('Density')
plt.title('Histogram vs. Theoretical PDF')
plt.legend()
plt.grid(True, alpha=0.3)
# Plot 2: Empirical CDF vs. Theoretical CDF
plt.subplot(1, 3, 2)
sorted_samples = np.sort(samples)
empirical_cdf = np.arange(1, len(sorted_samples) + 1) / len(sorted_samples)
plt.plot(sorted_samples, empirical_cdf, 'b-', linewidth=1, alpha=0.7, label='Empirical CDF')
plt.plot(x_range, cdf_standard, 'r-', linewidth=2, label='Theoretical CDF')
plt.xlabel('x')
plt.ylabel('Cumulative Probability')
plt.title('Empirical vs. Theoretical CDF')
plt.legend()
plt.grid(True, alpha=0.3)
# Plot 3: Q-Q plot (Quantile-Quantile plot)
plt.subplot(1, 3, 3)
stats.probplot(samples, dist="norm", plot=plt)
plt.title('Q-Q Plot (Normal)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('/home/claude/scipy_stats_course/01_random_samples.png', dpi=300, bbox_inches='tight')
print("Saved: 01_random_samples.png")
plt.close()
# =============================================================================
# SECTION 8: Other Common Continuous Distributions
# =============================================================================
# Uniform Distribution
# ---------------------
# All values in [a, b] are equally likely
# Parameters: loc=a, scale=(b-a)
uniform = stats.uniform(loc=0, scale=10) # Uniform on [0, 10]
print("Uniform Distribution [0, 10]:")
print(f" Mean: {uniform.mean():.4f}")
print(f" Variance: {uniform.var():.4f}")
print(f" P(3 < X ≤ 7) = {uniform.cdf(7) - uniform.cdf(3):.4f}")
print()
# Exponential Distribution
# -------------------------
# Models time between events in a Poisson process
# Parameter: scale = 1/λ (where λ is the rate parameter)
exponential = stats.expon(scale=2) # Mean time = 2
print("Exponential Distribution (λ=0.5):")
print(f" Mean: {exponential.mean():.4f}")
print(f" Variance: {exponential.var():.4f}")
print(f" P(X > 3) = {exponential.sf(3):.4f}")
print()
# Student's t-Distribution
# --------------------------
# Similar to normal but with heavier tails
# Parameter: df (degrees of freedom)
t_dist = stats.t(df=5) # 5 degrees of freedom
print("Student's t-Distribution (df=5):")
print(f" Mean: {t_dist.mean():.4f}")
print(f" Variance: {t_dist.var():.4f}")
print()
# Visualize various distributions
plt.figure(figsize=(14, 5))
x_unif = np.linspace(-1, 11, 1000)
x_exp = np.linspace(0, 10, 1000)
x_t = np.linspace(-4, 4, 1000)
plt.subplot(1, 3, 1)
plt.plot(x_unif, uniform.pdf(x_unif), 'b-', linewidth=2)
plt.fill_between(x_unif, uniform.pdf(x_unif), alpha=0.3)
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.title('Uniform Distribution [0, 10]')
plt.grid(True, alpha=0.3)
plt.subplot(1, 3, 2)
plt.plot(x_exp, exponential.pdf(x_exp), 'r-', linewidth=2)
plt.fill_between(x_exp, exponential.pdf(x_exp), alpha=0.3)
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.title('Exponential Distribution (scale=2)')
plt.grid(True, alpha=0.3)
plt.subplot(1, 3, 3)
plt.plot(x_t, t_dist.pdf(x_t), 'g-', linewidth=2, label='t(df=5)')
plt.plot(x_range, standard_normal.pdf(x_range), 'b--', linewidth=2, label='N(0,1)')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.title("Student's t vs. Normal")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('/home/claude/scipy_stats_course/01_various_distributions.png', dpi=300, bbox_inches='tight')
print("Saved: 01_various_distributions.png")
plt.close()
# =============================================================================
# SECTION 9: Summary Statistics from Distributions
# =============================================================================
"""
Distribution objects provide methods to compute theoretical moments
and statistics without needing to generate samples.
"""
print("Summary Statistics:")
print("-" * 50)
distributions = [
("Normal(0,1)", standard_normal),
("Normal(10,2)", custom_normal),
("Uniform[0,10]", uniform),
("Exponential(scale=2)", exponential),
]
for name, dist in distributions:
print(f"\n{name}:")
print(f" Mean: {dist.mean():.4f}")
print(f" Variance: {dist.var():.4f}")
print(f" Std Dev: {dist.std():.4f}")
# Some distributions have median and entropy methods
if hasattr(dist, 'median'):
print(f" Median: {dist.median():.4f}")
if hasattr(dist, 'entropy'):
print(f" Entropy: {dist.entropy():.4f}")
print("\n" + "="*80)
print("Tutorial 01 Complete!")
print("="*80)
print("\nKey Takeaways:")
print("1. scipy.stats provides distribution objects with consistent interfaces")
print("2. PDF shows the shape of the distribution (not probability values)")
print("3. CDF gives cumulative probabilities: P(X ≤ x)")
print("4. PPF is the inverse of CDF, used for finding quantiles")
print("5. rvs() generates random samples from the distribution")
print("6. Distribution objects have methods for mean, variance, and other statistics")