Skip to content

Statistical Methods

DataFrame methods for computing statistical measures.

Mental Model

Statistical methods reduce a column to a single number: mean() for center, std() for spread, median() for a robust center, corr() for pairwise relationships. describe() bundles the most common statistics into one call. All of these skip NaN by default.

Central Tendency

Measures of central location.

1. mean

```python import pandas as pd

df = pd.read_csv('housing.csv')

Single column

print(df['median_income'].mean())

All numeric columns

print(df.mean()) ```

2. median

python print(df['median_income'].median()) print(df.median())

3. mode

python print(df['ocean_proximity'].mode()) # Most frequent value

Dispersion

Measures of spread.

1. std (Standard Deviation)

python print(df['median_income'].std()) print(df.std())

2. var (Variance)

python print(df['median_income'].var()) print(df.var())

3. Range

```python

Calculate range manually

range_val = df['median_income'].max() - df['median_income'].min() ```

Correlation and Covariance

Relationships between columns.

1. corr (Correlation)

```python

Correlation matrix

print(df.corr()) ```

housing_median_age total_rooms median_income housing_median_age 1.000000 -0.361262 -0.119034 total_rooms -0.361262 1.000000 0.198050 median_income -0.119034 0.198050 1.000000

2. Two Columns

python print(df['median_income'].corr(df['median_house_value']))

3. cov (Covariance)

python print(df.cov())

Quantiles

Distribution percentiles.

1. quantile

```python

Single quantile

q1 = df['median_income'].quantile(0.25) q2 = df['median_income'].quantile(0.50) # Median q3 = df['median_income'].quantile(0.75)

print(f"Q1: {q1}, Q2: {q2}, Q3: {q3}") ```

2. Multiple Quantiles

python quantiles = df['median_income'].quantile([0.25, 0.5, 0.75]) print(quantiles)

3. IQR (Interquartile Range)

python iqr = q3 - q1

Cumulative Methods

Running totals and products.

1. cumsum

python df['cumulative_sales'] = df['sales'].cumsum()

2. cumprod

```python

Cumulative returns

df['cumulative_return'] = (1 + df['daily_return']).cumprod() ```

3. cummax and cummin

python df['running_max'] = df['price'].cummax() df['running_min'] = df['price'].cummin()

Summary Statistics

1. describe

python print(df.describe())

2. Custom Percentiles

python print(df.describe(percentiles=[0.1, 0.25, 0.5, 0.75, 0.9]))

3. Include All Types

python print(df.describe(include='all'))

Financial Example

Income distribution analysis.

1. Load Data

python url = 'https://raw.githubusercontent.com/gedeck/practical-statistics-for-data-scientists/master/data/loans_income.csv' loans = pd.read_csv(url)

2. Calculate Statistics

```python mean_income = loans['x'].mean() median_income = loans['x'].median() std_income = loans['x'].std()

print(f"Mean: {mean_income:.2f}") print(f"Median: {median_income:.2f}") print(f"Std: {std_income:.2f}") ```

3. Outlier Detection

```python

Values beyond 3 standard deviations

lower = mean_income - 3 * std_income upper = mean_income + 3 * std_income

outliers = loans[(loans['x'] < lower) | (loans['x'] > upper)] print(f"Outliers: {len(outliers)}") ```

Normal Distribution Check

Verify 68-95-99.7 rule.

1. One Standard Deviation

python n = len(df['x']) within_1std = len(df['x'][(mean - std < df['x']) & (df['x'] < mean + std)]) print(f"Within 1 std: {within_1std/n*100:.1f}%") # ~68%

2. Two Standard Deviations

python within_2std = len(df['x'][(mean - 2*std < df['x']) & (df['x'] < mean + 2*std)]) print(f"Within 2 std: {within_2std/n*100:.1f}%") # ~95%

3. Three Standard Deviations

python within_3std = len(df['x'][(mean - 3*std < df['x']) & (df['x'] < mean + 3*std)]) print(f"Within 3 std: {within_3std/n*100:.1f}%") # ~99.7%

Skewness and Kurtosis

Distribution shape.

1. skew

```python print(df['median_income'].skew())

Positive: right-skewed

Negative: left-skewed

Zero: symmetric

```

2. kurtosis

```python print(df['median_income'].kurtosis())

High: heavy tails

Low: light tails

```

3. Interpretation

python skewness = df['x'].skew() if abs(skewness) < 0.5: print("Approximately symmetric") elif skewness > 0: print("Right-skewed") else: print("Left-skewed")


Exercises

Exercise 1. Create a DataFrame with columns 'A' and 'B' containing 1000 random values. Compute the mean, median, and standard deviation for each column. Verify that the median equals the 0.5 quantile using .quantile(0.5).

Solution to Exercise 1

Compute central tendency measures and verify quantile.

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame({
    'A': np.random.randn(1000),
    'B': np.random.randn(1000) * 2 + 5
})
print("Mean:\n", df.mean())
print("Median:\n", df.median())
print("Std:\n", df.std())
assert df['A'].median() == df['A'].quantile(0.5)
print("Median equals 0.5 quantile: True")

Exercise 2. Given a numeric Series, use .cumsum() to compute the cumulative sum and .cummax() to compute the running maximum. Create a new column that shows the difference between the running maximum and the current value (a "drawdown" measure).

Solution to Exercise 2

Compute cumulative sum, running max, and drawdown.

import pandas as pd
import numpy as np

np.random.seed(42)
s = pd.Series(np.random.randn(20).cumsum(), name='price')
df = pd.DataFrame({'price': s})
df['cumsum'] = df['price'].cumsum()
df['running_max'] = df['price'].cummax()
df['drawdown'] = df['running_max'] - df['price']
print(df)

Exercise 3. Create a DataFrame with two numeric columns. Compute the correlation between them using .corr(). Then compute skewness and kurtosis for one column and interpret whether the distribution is approximately symmetric.

Solution to Exercise 3

Compute correlation, skewness, and kurtosis.

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame({
    'x': np.random.randn(500),
    'y': np.random.randn(500) + 1
})
print("Correlation:\n", df.corr())
skew = df['x'].skew()
kurt = df['x'].kurtosis()
print(f"Skewness of x: {skew:.4f}")
print(f"Kurtosis of x: {kurt:.4f}")
if abs(skew) < 0.5:
    print("Distribution is approximately symmetric")
else:
    print("Distribution is skewed")