hist() Method¶
The hist() method creates histograms for numeric columns in a DataFrame. Unlike plot(kind='hist'), this method is specifically designed for histogram creation with additional features.
Basic Usage¶
DataFrame.hist()¶
Creates histograms for all numeric columns:
import pandas as pd
import matplotlib.pyplot as plt
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)
# Histogram for all numeric columns
df.hist(figsize=(12, 8))
plt.tight_layout()
plt.show()
Series.hist()¶
df['Age'].hist(bins=20)
plt.show()
Key Parameters¶
bins - Number of Bins¶
fig, axes = plt.subplots(1, 3, figsize=(12, 3))
df['Age'].hist(bins=10, ax=axes[0])
axes[0].set_title('10 bins')
df['Age'].hist(bins=30, ax=axes[1])
axes[1].set_title('30 bins')
df['Age'].hist(bins=50, ax=axes[2])
axes[2].set_title('50 bins')
plt.tight_layout()
plt.show()
column - Specific Column¶
df.hist(column='Age', bins=20)
plt.show()
by - Group by Category¶
Create separate histograms for each category:
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv'
df = pd.read_csv(url)
# Histogram of beer servings by continent
df.hist(column='beer_servings', by='continent', figsize=(12, 8), sharex=True, sharey=True)
plt.tight_layout()
plt.show()
layout - Subplot Arrangement¶
# Arrange histograms in 2 rows, 3 columns
df.hist(column='beer_servings', by='continent', layout=(2, 3), figsize=(12, 6))
plt.tight_layout()
plt.show()
sharex and sharey - Share Axes¶
df.hist(
column='beer_servings',
by='continent',
sharex=True, # Same x-axis scale
sharey=True # Same y-axis scale
)
figsize - Figure Size¶
df.hist(figsize=(15, 10))
ax - Specify Axes¶
fig, axes = plt.subplots(3, 3, figsize=(12, 9))
df.hist(ax=axes)
plt.tight_layout()
plt.show()
Practical Example: Housing Data¶
import os
import tarfile
import urllib.request
import pandas as pd
import matplotlib.pyplot as plt
# Download and load housing data
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
def fetch_housing_data():
if not os.path.isdir(HOUSING_PATH):
os.makedirs(HOUSING_PATH)
tgz_path = os.path.join(HOUSING_PATH, "housing.tgz")
urllib.request.urlretrieve(HOUSING_URL, tgz_path)
with tarfile.open(tgz_path) as tgz:
tgz.extractall(path=HOUSING_PATH)
def load_housing_data():
return pd.read_csv(os.path.join(HOUSING_PATH, "housing.csv"))
fetch_housing_data()
df = load_housing_data()
# Create histogram grid
fig, axes = plt.subplots(3, 3, figsize=(12, 9))
df.hist(bins=50, ax=axes)
# Clean up appearance
for ax in axes.flatten():
ax.grid(False)
ax.spines[['top', 'right']].set_visible(False)
plt.tight_layout()
plt.show()
Practical Example: Titanic Data¶
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)
# Create histograms for specific columns
fig, axes = plt.subplots(1, 5, figsize=(15, 3))
columns = ['Survived', 'Pclass', 'Age', 'SibSp', 'Fare']
for ax, col in zip(axes, columns):
df[col].hist(ax=ax, density=True, edgecolor='black', alpha=0.7)
ax.set_title(col)
ax.set_ylabel('Density')
plt.tight_layout()
plt.show()
Customization Options¶
density - Normalize to Probability¶
df['Age'].hist(density=True) # Y-axis shows probability density
edgecolor - Bar Borders¶
df['Age'].hist(edgecolor='black')
alpha - Transparency¶
df['Age'].hist(alpha=0.7)
color - Bar Color¶
df['Age'].hist(color='steelblue')
Comparing Distributions¶
Overlapping Histograms¶
fig, ax = plt.subplots(figsize=(10, 5))
# Survived vs not survived
df[df['Survived'] == 1]['Age'].hist(alpha=0.5, label='Survived', ax=ax)
df[df['Survived'] == 0]['Age'].hist(alpha=0.5, label='Did not survive', ax=ax)
ax.legend()
ax.set_xlabel('Age')
ax.set_title('Age Distribution by Survival')
plt.show()
Side-by-Side with by Parameter¶
df.hist(column='Age', by='Survived', figsize=(10, 4), sharex=True)
plt.tight_layout()
plt.show()
hist() vs plot(kind='hist')¶
| Feature | df.hist() | df.plot(kind='hist') |
|---|---|---|
| Multiple columns | Automatic grid | Manual iteration |
| by parameter | ✅ Supported | ❌ Not supported |
| layout control | ✅ Built-in | Manual |
| Single column | Use Series | Use Series |
# Both produce similar output for single column:
df['Age'].hist(bins=20)
df['Age'].plot(kind='hist', bins=20)
Method Signature¶
DataFrame.hist(
column=None, # Column(s) to plot
by=None, # Group by this column
grid=True, # Show grid
xlabelsize=None, # X label font size
ylabelsize=None, # Y label font size
ax=None, # Matplotlib axes
sharex=False, # Share x-axis
sharey=False, # Share y-axis
figsize=None, # Figure size
layout=None, # Subplot layout (rows, cols)
bins=10, # Number of bins
**kwargs # Additional hist kwargs
)
Summary¶
# All numeric columns
df.hist()
# Specific column
df.hist(column='Age')
df['Age'].hist()
# Grouped by category
df.hist(column='Age', by='Survived')
# Customized
df.hist(bins=30, figsize=(12, 8), edgecolor='black', alpha=0.7)
# With layout
df.hist(column='beer', by='continent', layout=(2, 3), sharex=True)