Skip to content

boxplot() Method

The boxplot() method creates box-and-whisker plots that summarize the distribution of numeric data, showing median, quartiles, and outliers.

Mental Model

A box plot is a five-number summary in visual form: minimum, Q1, median, Q3, maximum, with dots for outliers beyond 1.5 x IQR. It reveals center, spread, skew, and outliers at a glance. Use the by parameter to compare distributions across categories side-by-side.

Anatomy of a Box Plot

┌─────────┐ │ │ ← Maximum (or upper fence) │ ○ │ ← Outliers (beyond 1.5×IQR) │ │ │ ────┼────┼────┼── ← Q3 (75th percentile) │ │ │ │ │ │ ← IQR (Interquartile Range) │ │ │ ────┼────┼────┼── ← Median (Q2, 50th percentile) │ │ │ │ │ │ ────┼────┼────┼── ← Q1 (25th percentile) │ │ │ │ │ │ └────┴────┘ ← Minimum (or lower fence)

Basic Usage

Single Column

```python import pandas as pd import matplotlib.pyplot as plt

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv" df = pd.read_csv(url)

fig, ax = plt.subplots(figsize=(5, 4)) df.boxplot(column='Age', ax=ax) ax.set_title('Age Distribution') plt.show() ```

Multiple Columns

python fig, ax = plt.subplots(figsize=(10, 5)) df.boxplot(column=['Age', 'Fare'], ax=ax) ax.set_title('Age and Fare Distributions') plt.show()

Key Parameters

column - Select Columns

```python

Single column

df.boxplot(column='Age')

Multiple columns

df.boxplot(column=['Age', 'Fare', 'SibSp']) ```

by - Group by Category

Create separate box plots for each category:

python fig, ax = plt.subplots(figsize=(10, 5)) df.boxplot(column='Age', by='Pclass', ax=ax) ax.set_title('Age Distribution by Passenger Class') plt.suptitle('') # Remove automatic title plt.show()

vert - Orientation

```python

Vertical (default)

df.boxplot(column='Age', vert=True)

Horizontal

fig, ax = plt.subplots(figsize=(8, 3)) df['Age'].plot(kind='box', vert=False, ax=ax) ax.set_title('Horizontal Box Plot') plt.show() ```

figsize - Figure Size

python df.boxplot(column='Age', figsize=(8, 5))

Practical Example: Titanic Passenger Ages

```python url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv" df = pd.read_csv(url, index_col='PassengerId')

fig, ax = plt.subplots(figsize=(8, 4)) df['Age'].plot(kind='box', ax=ax, vert=False) ax.set_title('Horizontal Boxplot of Passenger Ages on Titanic') ax.set_xlabel('Age') ax.spines[['top', 'left', 'right']].set_visible(False) plt.show() ```

Grouped Box Plots

By Single Category

python fig, ax = plt.subplots(figsize=(10, 5)) df.boxplot(column='Fare', by='Pclass', ax=ax) ax.set_ylabel('Fare') plt.suptitle('') ax.set_title('Fare Distribution by Class') plt.show()

By Multiple Categories

python fig, ax = plt.subplots(figsize=(12, 5)) df.boxplot(column='Age', by=['Pclass', 'Survived'], ax=ax) plt.suptitle('') ax.set_title('Age by Class and Survival') plt.xticks(rotation=45) plt.tight_layout() plt.show()

Comparing Multiple Variables

```python url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv' df = pd.read_csv(url)

fig, ax = plt.subplots(figsize=(10, 5)) df[['beer_servings', 'spirit_servings', 'wine_servings']].boxplot(ax=ax) ax.set_title('Alcohol Servings by Type') ax.set_ylabel('Servings') plt.show() ```

Customization

Using Return Value

```python fig, ax = plt.subplots(figsize=(8, 5)) bp = df.boxplot(column='Age', ax=ax, return_type='dict')

Customize colors

for box in bp['boxes']: box.set_color('steelblue') for whisker in bp['whiskers']: whisker.set_color('gray') for cap in bp['caps']: cap.set_color('gray') for median in bp['medians']: median.set_color('red') median.set_linewidth(2)

plt.show() ```

Grid and Appearance

python fig, ax = plt.subplots(figsize=(8, 5)) df.boxplot( column='Age', ax=ax, grid=False, notch=True, # Confidence interval notch patch_artist=True # Enable fill color ) plt.show()

boxplot() vs plot(kind='box')

Both methods create box plots, but with slightly different interfaces:

```python

Using boxplot()

df.boxplot(column='Age') df.boxplot(column='Age', by='Pclass')

Using plot(kind='box')

df['Age'].plot(kind='box') df[['Age', 'Fare']].plot(kind='box') ```

Feature boxplot() plot(kind='box')
by parameter ✅ Direct support ❌ Manual grouping
Multiple columns column=list Select columns first
return_type ✅ Supported ❌ Not supported
Method location DataFrame only Series and DataFrame

Interpreting Box Plots

```python

Get the statistics shown in box plot

stats = df['Age'].describe() print(stats) ```

count 714.000000 mean 29.699118 std 14.526497 min 0.420000 25% 20.125000 ← Q1 (box bottom) 50% 28.000000 ← Median (line in box) 75% 38.000000 ← Q3 (box top) max 80.000000

Identifying Outliers

Outliers are points beyond:

  • Upper fence: Q3 + 1.5 × IQR
  • Lower fence: Q1 - 1.5 × IQR

```python Q1 = df['Age'].quantile(0.25) Q3 = df['Age'].quantile(0.75) IQR = Q3 - Q1

lower_fence = Q1 - 1.5 * IQR upper_fence = Q3 + 1.5 * IQR

outliers = df[(df['Age'] < lower_fence) | (df['Age'] > upper_fence)] print(f"Number of outliers: {len(outliers)}") ```

Method Signature

python DataFrame.boxplot( column=None, # Column(s) to plot by=None, # Group by column ax=None, # Matplotlib axes fontsize=None, # Tick label font size rot=0, # Tick label rotation grid=True, # Show grid figsize=None, # Figure size layout=None, # Subplot layout return_type=None, # Return type ('axes', 'dict', 'both') **kwargs # Additional kwargs )

Summary

```python

Single column

df.boxplot(column='Age')

Multiple columns

df.boxplot(column=['Age', 'Fare'])

Grouped by category

df.boxplot(column='Age', by='Pclass')

Horizontal (via plot)

df['Age'].plot(kind='box', vert=False)

Customized

df.boxplot(column='Age', grid=False, notch=True) ```


Exercises

Exercise 1. Write code that creates a box plot from a DataFrame using df.plot.box() or df.boxplot().

Solution to Exercise 1

```python import pandas as pd import numpy as np

Solution for the specific exercise

np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(10), 'B': np.random.randn(10)}) print(df.head()) ```


Exercise 2. Explain what the box, whiskers, and outlier points represent in a box plot.

Solution to Exercise 2

See the main content for the detailed explanation. The key concept involves understanding the Pandas API and its behavior for this specific operation.


Exercise 3. Create a grouped box plot using df.boxplot(by='group_column') to compare distributions across groups.

Solution to Exercise 3

```python import pandas as pd import numpy as np

np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(20), 'B': np.random.randn(20)}) result = df.describe() print(result) ```


Exercise 4. Write code that customizes a box plot by changing colors, whisker style, and outlier markers.

Solution to Exercise 4

```python import pandas as pd import numpy as np

np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(50), 'group': np.random.choice(['X', 'Y'], 50)}) result = df.groupby('group').mean() print(result) ```