Statistical Methods¶

DataFrame methods for computing statistical measures.

Mental Model

Statistical methods reduce a column to a single number: mean() for center, std() for spread, median() for a robust center, corr() for pairwise relationships. describe() bundles the most common statistics into one call. All of these skip NaN by default.

Central Tendency¶

Measures of central location.

1. mean¶

```python import pandas as pd

df = pd.read_csv('housing.csv')

Single column¶

print(df['median_income'].mean())

All numeric columns¶

print(df.mean()) ```

2. median¶

python print(df['median_income'].median()) print(df.median())

3. mode¶

python print(df['ocean_proximity'].mode()) # Most frequent value

Dispersion¶

Measures of spread.

1. std (Standard Deviation)¶

python print(df['median_income'].std()) print(df.std())

2. var (Variance)¶

python print(df['median_income'].var()) print(df.var())

3. Range¶

```python

Calculate range manually¶

range_val = df['median_income'].max() - df['median_income'].min() ```

Correlation and Covariance¶

Relationships between columns.

1. corr (Correlation)¶

```python

Correlation matrix¶

print(df.corr()) ```

housing_median_age total_rooms median_income housing_median_age 1.000000 -0.361262 -0.119034 total_rooms -0.361262 1.000000 0.198050 median_income -0.119034 0.198050 1.000000

2. Two Columns¶

python print(df['median_income'].corr(df['median_house_value']))

3. cov (Covariance)¶

python print(df.cov())

Quantiles¶

Distribution percentiles.

1. quantile¶

```python

Single quantile¶

q1 = df['median_income'].quantile(0.25) q2 = df['median_income'].quantile(0.50) # Median q3 = df['median_income'].quantile(0.75)

print(f"Q1: {q1}, Q2: {q2}, Q3: {q3}") ```

2. Multiple Quantiles¶

python quantiles = df['median_income'].quantile([0.25, 0.5, 0.75]) print(quantiles)

3. IQR (Interquartile Range)¶

python iqr = q3 - q1

Cumulative Methods¶

Running totals and products.

1. cumsum¶

python df['cumulative_sales'] = df['sales'].cumsum()

2. cumprod¶

```python

Cumulative returns¶

df['cumulative_return'] = (1 + df['daily_return']).cumprod() ```

3. cummax and cummin¶

python df['running_max'] = df['price'].cummax() df['running_min'] = df['price'].cummin()

Summary Statistics¶

1. describe¶

python print(df.describe())

2. Custom Percentiles¶

python print(df.describe(percentiles=[0.1, 0.25, 0.5, 0.75, 0.9]))

3. Include All Types¶

python print(df.describe(include='all'))

Financial Example¶

Income distribution analysis.

1. Load Data¶

python url = 'https://raw.githubusercontent.com/gedeck/practical-statistics-for-data-scientists/master/data/loans_income.csv' loans = pd.read_csv(url)

2. Calculate Statistics¶

```python mean_income = loans['x'].mean() median_income = loans['x'].median() std_income = loans['x'].std()

print(f"Mean: {mean_income:.2f}") print(f"Median: {median_income:.2f}") print(f"Std: {std_income:.2f}") ```

3. Outlier Detection¶

```python

Values beyond 3 standard deviations¶

lower = mean_income - 3 * std_income upper = mean_income + 3 * std_income

outliers = loans[(loans['x'] < lower) | (loans['x'] > upper)] print(f"Outliers: {len(outliers)}") ```

Normal Distribution Check¶

Verify 68-95-99.7 rule.

1. One Standard Deviation¶

python n = len(df['x']) within_1std = len(df['x'][(mean - std < df['x']) & (df['x'] < mean + std)]) print(f"Within 1 std: {within_1std/n*100:.1f}%") # ~68%

2. Two Standard Deviations¶

python within_2std = len(df['x'][(mean - 2*std < df['x']) & (df['x'] < mean + 2*std)]) print(f"Within 2 std: {within_2std/n*100:.1f}%") # ~95%

3. Three Standard Deviations¶

python within_3std = len(df['x'][(mean - 3*std < df['x']) & (df['x'] < mean + 3*std)]) print(f"Within 3 std: {within_3std/n*100:.1f}%") # ~99.7%

Skewness and Kurtosis¶

Distribution shape.

1. skew¶

```python print(df['median_income'].skew())

Positive: right-skewed¶

Negative: left-skewed¶

Zero: symmetric¶

```

2. kurtosis¶

```python print(df['median_income'].kurtosis())

High: heavy tails¶

Low: light tails¶

```

3. Interpretation¶

python skewness = df['x'].skew() if abs(skewness) < 0.5: print("Approximately symmetric") elif skewness > 0: print("Right-skewed") else: print("Left-skewed")

Exercises¶

Exercise 1. Create a DataFrame with columns 'A' and 'B' containing 1000 random values. Compute the mean, median, and standard deviation for each column. Verify that the median equals the 0.5 quantile using .quantile(0.5).

Solution to Exercise 1

Compute central tendency measures and verify quantile.

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame({
    'A': np.random.randn(1000),
    'B': np.random.randn(1000) * 2 + 5
})
print("Mean:\n", df.mean())
print("Median:\n", df.median())
print("Std:\n", df.std())
assert df['A'].median() == df['A'].quantile(0.5)
print("Median equals 0.5 quantile: True")

Exercise 2. Given a numeric Series, use .cumsum() to compute the cumulative sum and .cummax() to compute the running maximum. Create a new column that shows the difference between the running maximum and the current value (a "drawdown" measure).

Solution to Exercise 2

Compute cumulative sum, running max, and drawdown.

import pandas as pd
import numpy as np

np.random.seed(42)
s = pd.Series(np.random.randn(20).cumsum(), name='price')
df = pd.DataFrame({'price': s})
df['cumsum'] = df['price'].cumsum()
df['running_max'] = df['price'].cummax()
df['drawdown'] = df['running_max'] - df['price']
print(df)

Exercise 3. Create a DataFrame with two numeric columns. Compute the correlation between them using .corr(). Then compute skewness and kurtosis for one column and interpret whether the distribution is approximately symmetric.

Solution to Exercise 3

Compute correlation, skewness, and kurtosis.

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame({
    'x': np.random.randn(500),
    'y': np.random.randn(500) + 1
})
print("Correlation:\n", df.corr())
skew = df['x'].skew()
kurt = df['x'].kurtosis()
print(f"Skewness of x: {skew:.4f}")
print(f"Kurtosis of x: {kurt:.4f}")
if abs(skew) < 0.5:
    print("Distribution is approximately symmetric")
else:
    print("Distribution is skewed")