boxplot() Method¶
The boxplot() method creates box-and-whisker plots that summarize the distribution of numeric data, showing median, quartiles, and outliers.
Mental Model
A box plot is a five-number summary in visual form: minimum, Q1, median, Q3, maximum, with dots for outliers beyond 1.5 x IQR. It reveals center, spread, skew, and outliers at a glance. Use the by parameter to compare distributions across categories side-by-side.
Anatomy of a Box Plot¶
┌─────────┐
│ │ ← Maximum (or upper fence)
│ ○ │ ← Outliers (beyond 1.5×IQR)
│ │ │
────┼────┼────┼── ← Q3 (75th percentile)
│ │ │
│ │ │ ← IQR (Interquartile Range)
│ │ │
────┼────┼────┼── ← Median (Q2, 50th percentile)
│ │ │
│ │ │
────┼────┼────┼── ← Q1 (25th percentile)
│ │ │
│ │ │
└────┴────┘ ← Minimum (or lower fence)
Basic Usage¶
Single Column¶
```python import pandas as pd import matplotlib.pyplot as plt
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv" df = pd.read_csv(url)
fig, ax = plt.subplots(figsize=(5, 4)) df.boxplot(column='Age', ax=ax) ax.set_title('Age Distribution') plt.show() ```
Multiple Columns¶
python
fig, ax = plt.subplots(figsize=(10, 5))
df.boxplot(column=['Age', 'Fare'], ax=ax)
ax.set_title('Age and Fare Distributions')
plt.show()
Key Parameters¶
column - Select Columns¶
```python
Single column¶
df.boxplot(column='Age')
Multiple columns¶
df.boxplot(column=['Age', 'Fare', 'SibSp']) ```
by - Group by Category¶
Create separate box plots for each category:
python
fig, ax = plt.subplots(figsize=(10, 5))
df.boxplot(column='Age', by='Pclass', ax=ax)
ax.set_title('Age Distribution by Passenger Class')
plt.suptitle('') # Remove automatic title
plt.show()
vert - Orientation¶
```python
Vertical (default)¶
df.boxplot(column='Age', vert=True)
Horizontal¶
fig, ax = plt.subplots(figsize=(8, 3)) df['Age'].plot(kind='box', vert=False, ax=ax) ax.set_title('Horizontal Box Plot') plt.show() ```
figsize - Figure Size¶
python
df.boxplot(column='Age', figsize=(8, 5))
Practical Example: Titanic Passenger Ages¶
```python url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv" df = pd.read_csv(url, index_col='PassengerId')
fig, ax = plt.subplots(figsize=(8, 4)) df['Age'].plot(kind='box', ax=ax, vert=False) ax.set_title('Horizontal Boxplot of Passenger Ages on Titanic') ax.set_xlabel('Age') ax.spines[['top', 'left', 'right']].set_visible(False) plt.show() ```
Grouped Box Plots¶
By Single Category¶
python
fig, ax = plt.subplots(figsize=(10, 5))
df.boxplot(column='Fare', by='Pclass', ax=ax)
ax.set_ylabel('Fare')
plt.suptitle('')
ax.set_title('Fare Distribution by Class')
plt.show()
By Multiple Categories¶
python
fig, ax = plt.subplots(figsize=(12, 5))
df.boxplot(column='Age', by=['Pclass', 'Survived'], ax=ax)
plt.suptitle('')
ax.set_title('Age by Class and Survival')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Comparing Multiple Variables¶
```python url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv' df = pd.read_csv(url)
fig, ax = plt.subplots(figsize=(10, 5)) df[['beer_servings', 'spirit_servings', 'wine_servings']].boxplot(ax=ax) ax.set_title('Alcohol Servings by Type') ax.set_ylabel('Servings') plt.show() ```
Customization¶
Using Return Value¶
```python fig, ax = plt.subplots(figsize=(8, 5)) bp = df.boxplot(column='Age', ax=ax, return_type='dict')
Customize colors¶
for box in bp['boxes']: box.set_color('steelblue') for whisker in bp['whiskers']: whisker.set_color('gray') for cap in bp['caps']: cap.set_color('gray') for median in bp['medians']: median.set_color('red') median.set_linewidth(2)
plt.show() ```
Grid and Appearance¶
python
fig, ax = plt.subplots(figsize=(8, 5))
df.boxplot(
column='Age',
ax=ax,
grid=False,
notch=True, # Confidence interval notch
patch_artist=True # Enable fill color
)
plt.show()
boxplot() vs plot(kind='box')¶
Both methods create box plots, but with slightly different interfaces:
```python
Using boxplot()¶
df.boxplot(column='Age') df.boxplot(column='Age', by='Pclass')
Using plot(kind='box')¶
df['Age'].plot(kind='box') df[['Age', 'Fare']].plot(kind='box') ```
| Feature | boxplot() | plot(kind='box') |
|---|---|---|
| by parameter | ✅ Direct support | ❌ Manual grouping |
| Multiple columns | column=list | Select columns first |
| return_type | ✅ Supported | ❌ Not supported |
| Method location | DataFrame only | Series and DataFrame |
Interpreting Box Plots¶
```python
Get the statistics shown in box plot¶
stats = df['Age'].describe() print(stats) ```
count 714.000000
mean 29.699118
std 14.526497
min 0.420000
25% 20.125000 ← Q1 (box bottom)
50% 28.000000 ← Median (line in box)
75% 38.000000 ← Q3 (box top)
max 80.000000
Identifying Outliers¶
Outliers are points beyond:
- Upper fence: Q3 + 1.5 × IQR
- Lower fence: Q1 - 1.5 × IQR
```python Q1 = df['Age'].quantile(0.25) Q3 = df['Age'].quantile(0.75) IQR = Q3 - Q1
lower_fence = Q1 - 1.5 * IQR upper_fence = Q3 + 1.5 * IQR
outliers = df[(df['Age'] < lower_fence) | (df['Age'] > upper_fence)] print(f"Number of outliers: {len(outliers)}") ```
Method Signature¶
python
DataFrame.boxplot(
column=None, # Column(s) to plot
by=None, # Group by column
ax=None, # Matplotlib axes
fontsize=None, # Tick label font size
rot=0, # Tick label rotation
grid=True, # Show grid
figsize=None, # Figure size
layout=None, # Subplot layout
return_type=None, # Return type ('axes', 'dict', 'both')
**kwargs # Additional kwargs
)
Summary¶
```python
Single column¶
df.boxplot(column='Age')
Multiple columns¶
df.boxplot(column=['Age', 'Fare'])
Grouped by category¶
df.boxplot(column='Age', by='Pclass')
Horizontal (via plot)¶
df['Age'].plot(kind='box', vert=False)
Customized¶
df.boxplot(column='Age', grid=False, notch=True) ```
Exercises¶
Exercise 1. Write code that creates a box plot from a DataFrame using df.plot.box() or df.boxplot().
Solution to Exercise 1
```python import pandas as pd import numpy as np
Solution for the specific exercise¶
np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(10), 'B': np.random.randn(10)}) print(df.head()) ```
Exercise 2. Explain what the box, whiskers, and outlier points represent in a box plot.
Solution to Exercise 2
See the main content for the detailed explanation. The key concept involves understanding the Pandas API and its behavior for this specific operation.
Exercise 3. Create a grouped box plot using df.boxplot(by='group_column') to compare distributions across groups.
Solution to Exercise 3
```python import pandas as pd import numpy as np
np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(20), 'B': np.random.randn(20)}) result = df.describe() print(result) ```
Exercise 4. Write code that customizes a box plot by changing colors, whisker style, and outlier markers.
Solution to Exercise 4
```python import pandas as pd import numpy as np
np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(50), 'group': np.random.choice(['X', 'Y'], 50)}) result = df.groupby('group').mean() print(result) ```