Binning with cut and qcut¶
Binning (discretization) converts continuous data into discrete categories. pandas provides two primary functions: pd.cut() for equal-width bins and pd.qcut() for equal-frequency bins.
pd.cut - Equal-Width Bins¶
Divides data into bins of equal width (range).
Basic Usage¶
import pandas as pd
import numpy as np
ages = pd.Series([22, 35, 45, 28, 65, 52, 19, 38, 72, 33])
# Create 3 equal-width bins
binned = pd.cut(ages, bins=3)
print(binned)
0 (18.947, 36.667]
1 (18.947, 36.667]
2 (36.667, 54.333]
3 (18.947, 36.667]
4 (54.333, 72.0]
5 (36.667, 54.333]
6 (18.947, 36.667]
7 (36.667, 54.333]
8 (54.333, 72.0]
9 (18.947, 36.667]
dtype: category
Categories (3, interval[float64, right]): [(18.947, 36.667] < (36.667, 54.333] < (54.333, 72.0]]
Custom Bin Edges¶
# Define explicit bin boundaries
bins = [0, 18, 35, 50, 65, 100]
labels = ['Child', 'Young Adult', 'Middle Age', 'Senior', 'Elderly']
age_groups = pd.cut(ages, bins=bins, labels=labels)
print(age_groups)
0 Young Adult
1 Young Adult
2 Middle Age
3 Young Adult
4 Senior
5 Middle Age
6 Child
7 Middle Age
8 Elderly
9 Young Adult
dtype: category
Categories (5, object): ['Child' < 'Young Adult' < 'Middle Age' < 'Senior' < 'Elderly']
Include Lowest Value¶
By default, the leftmost bin edge is exclusive. Use include_lowest=True to include it.
values = pd.Series([0, 10, 20, 30])
bins = [0, 10, 20, 30]
# Without include_lowest: 0 becomes NaN
print(pd.cut(values, bins=bins))
# With include_lowest: 0 is included
print(pd.cut(values, bins=bins, include_lowest=True))
Right vs Left Closed¶
values = pd.Series([1, 5, 10, 15, 20])
bins = [0, 10, 20]
# right=True (default): intervals are (a, b]
print(pd.cut(values, bins=bins, right=True))
# 10 goes into (0, 10], 20 goes into (10, 20]
# right=False: intervals are [a, b)
print(pd.cut(values, bins=bins, right=False))
# 10 goes into [10, 20), 20 becomes NaN
Return Bin Information¶
ages = pd.Series([22, 35, 45, 28, 65, 52])
# Get bins and bin edges
binned, bin_edges = pd.cut(ages, bins=3, retbins=True)
print(f"Bin edges: {bin_edges}")
Bin edges: [21.957 36.333 50.667 65. ]
pd.qcut - Quantile-Based Bins¶
Divides data into bins with approximately equal numbers of observations.
Basic Usage¶
# Highly skewed data
salaries = pd.Series([30000, 35000, 40000, 45000, 50000,
100000, 150000, 200000, 500000, 1000000])
# Equal-width bins (pd.cut) - most values in one bin
print("cut (equal-width):")
print(pd.cut(salaries, bins=4).value_counts())
# Equal-frequency bins (pd.qcut) - same count per bin
print("\nqcut (equal-frequency):")
print(pd.qcut(salaries, q=4).value_counts())
cut (equal-width):
(29030.0, 272500.0] 8
(272500.0, 515000.0] 1
(515000.0, 757500.0] 0
(757500.0, 1000000.0] 1
dtype: int64
qcut (equal-frequency):
(29999.999, 42500.0] 3
(42500.0, 75000.0] 2
(75000.0, 175000.0] 3
(175000.0, 1000000.0] 2
dtype: int64
Custom Quantiles¶
# Create quartiles
quartiles = pd.qcut(salaries, q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
print(quartiles)
# Custom percentiles
percentiles = pd.qcut(salaries, q=[0, 0.25, 0.5, 0.75, 0.9, 1.0],
labels=['Bottom 25%', '25-50%', '50-75%', '75-90%', 'Top 10%'])
print(percentiles)
Handling Duplicates¶
When data has many duplicate values, qcut may fail because it can't create distinct bins.
# Data with many duplicates
values = pd.Series([1, 1, 1, 2, 2, 2, 3, 3, 3])
# This will raise an error
# pd.qcut(values, q=4) # ValueError: Bin edges must be unique
# Solution 1: Use duplicates='drop'
print(pd.qcut(values, q=4, duplicates='drop'))
# Solution 2: Use pd.cut instead
print(pd.cut(values, bins=4))
cut vs qcut Comparison¶
# Normally distributed data
np.random.seed(42)
normal_data = pd.Series(np.random.randn(1000) * 10 + 50)
# Skewed data
skewed_data = pd.Series(np.random.exponential(scale=10, size=1000))
print("=== Normal Data ===")
print("cut bins:", pd.cut(normal_data, bins=5).value_counts().sort_index())
print("qcut bins:", pd.qcut(normal_data, q=5).value_counts().sort_index())
print("\n=== Skewed Data ===")
print("cut bins:", pd.cut(skewed_data, bins=5).value_counts().sort_index())
print("qcut bins:", pd.qcut(skewed_data, q=5).value_counts().sort_index())
| Scenario | Use cut |
Use qcut |
|---|---|---|
| Equal-width ranges needed | ✅ | |
| Equal sample sizes per bin | ✅ | |
| Predefined boundaries | ✅ | |
| Percentile-based analysis | ✅ | |
| Skewed distributions | ✅ (usually) | |
| Custom business rules | ✅ |
Practical Examples¶
1. Customer Segmentation by Spending¶
# Customer purchase data
customers = pd.DataFrame({
'customer_id': range(1, 101),
'total_spend': np.random.exponential(scale=500, size=100)
})
# Segment into spending tiers using qcut (equal customer count)
customers['spend_tier'] = pd.qcut(
customers['total_spend'],
q=4,
labels=['Bronze', 'Silver', 'Gold', 'Platinum']
)
# Check distribution
print(customers['spend_tier'].value_counts())
print(customers.groupby('spend_tier')['total_spend'].agg(['min', 'max', 'mean']))
2. Grade Assignment¶
# Student scores
scores = pd.Series([95, 87, 76, 65, 58, 92, 73, 81, 45, 88])
# Fixed grade boundaries
bins = [0, 60, 70, 80, 90, 100]
labels = ['F', 'D', 'C', 'B', 'A']
grades = pd.cut(scores, bins=bins, labels=labels, right=False)
print(grades)
3. Age Groups for Analysis¶
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Carol', 'David', 'Eve'],
'age': [25, 35, 45, 55, 65],
'income': [50000, 75000, 90000, 120000, 80000]
})
# Create age groups
df['age_group'] = pd.cut(
df['age'],
bins=[18, 30, 45, 60, 100],
labels=['Young', 'Middle', 'Senior', 'Retired']
)
# Analyze income by age group
print(df.groupby('age_group')['income'].mean())
4. Stock Price Volatility Buckets¶
import yfinance as yf
# Get stock data
stock = yf.Ticker('AAPL').history(period='1y')
stock['daily_return'] = stock['Close'].pct_change()
# Bin returns into volatility categories
stock['return_category'] = pd.cut(
stock['daily_return'],
bins=[-np.inf, -0.02, -0.01, 0, 0.01, 0.02, np.inf],
labels=['Large Loss', 'Loss', 'Slight Loss', 'Slight Gain', 'Gain', 'Large Gain']
)
print(stock['return_category'].value_counts())
Key Parameters Summary¶
pd.cut Parameters¶
| Parameter | Description | Default |
|---|---|---|
x |
Input array to bin | Required |
bins |
Number of bins or bin edges | Required |
labels |
Labels for bins | None (interval notation) |
right |
Include right edge | True |
include_lowest |
Include lowest edge | False |
retbins |
Return bin edges | False |
precision |
Decimal precision | 3 |
pd.qcut Parameters¶
| Parameter | Description | Default |
|---|---|---|
x |
Input array to bin | Required |
q |
Number of quantiles or quantile edges | Required |
labels |
Labels for bins | None |
retbins |
Return bin edges | False |
precision |
Decimal precision | 3 |
duplicates |
Handle duplicate edges | 'raise' |
Common Pitfalls¶
1. NaN Values in Input¶
values = pd.Series([1, 2, np.nan, 4, 5])
binned = pd.cut(values, bins=3)
print(binned) # NaN remains NaN
2. Out-of-Range Values¶
values = pd.Series([5, 15, 25, 35])
bins = [10, 20, 30]
result = pd.cut(values, bins=bins)
print(result)
# 5 and 35 become NaN (outside bin range)
3. Too Few Unique Values for qcut¶
# When data has fewer unique values than requested quantiles
values = pd.Series([1, 1, 2, 2])
# pd.qcut(values, q=4) # Error!
pd.qcut(values, q=4, duplicates='drop') # Works