Statistical Methods¶
DataFrame methods for computing statistical measures.
Central Tendency¶
Measures of central location.
1. mean¶
import pandas as pd
df = pd.read_csv('housing.csv')
# Single column
print(df['median_income'].mean())
# All numeric columns
print(df.mean())
2. median¶
print(df['median_income'].median())
print(df.median())
3. mode¶
print(df['ocean_proximity'].mode()) # Most frequent value
Dispersion¶
Measures of spread.
1. std (Standard Deviation)¶
print(df['median_income'].std())
print(df.std())
2. var (Variance)¶
print(df['median_income'].var())
print(df.var())
3. Range¶
# Calculate range manually
range_val = df['median_income'].max() - df['median_income'].min()
Correlation and Covariance¶
Relationships between columns.
1. corr (Correlation)¶
# Correlation matrix
print(df.corr())
housing_median_age total_rooms median_income
housing_median_age 1.000000 -0.361262 -0.119034
total_rooms -0.361262 1.000000 0.198050
median_income -0.119034 0.198050 1.000000
2. Two Columns¶
print(df['median_income'].corr(df['median_house_value']))
3. cov (Covariance)¶
print(df.cov())
Quantiles¶
Distribution percentiles.
1. quantile¶
# Single quantile
q1 = df['median_income'].quantile(0.25)
q2 = df['median_income'].quantile(0.50) # Median
q3 = df['median_income'].quantile(0.75)
print(f"Q1: {q1}, Q2: {q2}, Q3: {q3}")
2. Multiple Quantiles¶
quantiles = df['median_income'].quantile([0.25, 0.5, 0.75])
print(quantiles)
3. IQR (Interquartile Range)¶
iqr = q3 - q1
Cumulative Methods¶
Running totals and products.
1. cumsum¶
df['cumulative_sales'] = df['sales'].cumsum()
2. cumprod¶
# Cumulative returns
df['cumulative_return'] = (1 + df['daily_return']).cumprod()
3. cummax and cummin¶
df['running_max'] = df['price'].cummax()
df['running_min'] = df['price'].cummin()
Summary Statistics¶
1. describe¶
print(df.describe())
2. Custom Percentiles¶
print(df.describe(percentiles=[0.1, 0.25, 0.5, 0.75, 0.9]))
3. Include All Types¶
print(df.describe(include='all'))
Financial Example¶
Income distribution analysis.
1. Load Data¶
url = 'https://raw.githubusercontent.com/gedeck/practical-statistics-for-data-scientists/master/data/loans_income.csv'
loans = pd.read_csv(url)
2. Calculate Statistics¶
mean_income = loans['x'].mean()
median_income = loans['x'].median()
std_income = loans['x'].std()
print(f"Mean: {mean_income:.2f}")
print(f"Median: {median_income:.2f}")
print(f"Std: {std_income:.2f}")
3. Outlier Detection¶
# Values beyond 3 standard deviations
lower = mean_income - 3 * std_income
upper = mean_income + 3 * std_income
outliers = loans[(loans['x'] < lower) | (loans['x'] > upper)]
print(f"Outliers: {len(outliers)}")
Normal Distribution Check¶
Verify 68-95-99.7 rule.
1. One Standard Deviation¶
n = len(df['x'])
within_1std = len(df['x'][(mean - std < df['x']) & (df['x'] < mean + std)])
print(f"Within 1 std: {within_1std/n*100:.1f}%") # ~68%
2. Two Standard Deviations¶
within_2std = len(df['x'][(mean - 2*std < df['x']) & (df['x'] < mean + 2*std)])
print(f"Within 2 std: {within_2std/n*100:.1f}%") # ~95%
3. Three Standard Deviations¶
within_3std = len(df['x'][(mean - 3*std < df['x']) & (df['x'] < mean + 3*std)])
print(f"Within 3 std: {within_3std/n*100:.1f}%") # ~99.7%
Skewness and Kurtosis¶
Distribution shape.
1. skew¶
print(df['median_income'].skew())
# Positive: right-skewed
# Negative: left-skewed
# Zero: symmetric
2. kurtosis¶
print(df['median_income'].kurtosis())
# High: heavy tails
# Low: light tails
3. Interpretation¶
skewness = df['x'].skew()
if abs(skewness) < 0.5:
print("Approximately symmetric")
elif skewness > 0:
print("Right-skewed")
else:
print("Left-skewed")