Skip to content

Binning with cut and qcut

Binning (discretization) converts continuous data into discrete categories. pandas provides two primary functions: pd.cut() for equal-width bins and pd.qcut() for equal-frequency bins.

Mental Model

Imagine a number line chopped into buckets. cut chops by equal width (every bucket spans the same range), while qcut chops by equal count (every bucket holds roughly the same number of data points). Choose cut when the bin boundaries matter, and qcut when balanced group sizes matter.

pd.cut - Equal-Width Bins

Divides data into bins of equal width (range).

Basic Usage

```python import pandas as pd import numpy as np

ages = pd.Series([22, 35, 45, 28, 65, 52, 19, 38, 72, 33])

Create 3 equal-width bins

binned = pd.cut(ages, bins=3) print(binned) ```

0 (18.947, 36.667] 1 (18.947, 36.667] 2 (36.667, 54.333] 3 (18.947, 36.667] 4 (54.333, 72.0] 5 (36.667, 54.333] 6 (18.947, 36.667] 7 (36.667, 54.333] 8 (54.333, 72.0] 9 (18.947, 36.667] dtype: category Categories (3, interval[float64, right]): [(18.947, 36.667] < (36.667, 54.333] < (54.333, 72.0]]

Custom Bin Edges

```python

Define explicit bin boundaries

bins = [0, 18, 35, 50, 65, 100] labels = ['Child', 'Young Adult', 'Middle Age', 'Senior', 'Elderly']

age_groups = pd.cut(ages, bins=bins, labels=labels) print(age_groups) ```

0 Young Adult 1 Young Adult 2 Middle Age 3 Young Adult 4 Senior 5 Middle Age 6 Child 7 Middle Age 8 Elderly 9 Young Adult dtype: category Categories (5, object): ['Child' < 'Young Adult' < 'Middle Age' < 'Senior' < 'Elderly']

Include Lowest Value

By default, the leftmost bin edge is exclusive. Use include_lowest=True to include it.

```python values = pd.Series([0, 10, 20, 30]) bins = [0, 10, 20, 30]

Without include_lowest: 0 becomes NaN

print(pd.cut(values, bins=bins))

With include_lowest: 0 is included

print(pd.cut(values, bins=bins, include_lowest=True)) ```

Right vs Left Closed

```python values = pd.Series([1, 5, 10, 15, 20]) bins = [0, 10, 20]

right=True (default): intervals are (a, b]

print(pd.cut(values, bins=bins, right=True))

10 goes into (0, 10], 20 goes into (10, 20]

right=False: intervals are [a, b)

print(pd.cut(values, bins=bins, right=False))

10 goes into [10, 20), 20 becomes NaN

```

Return Bin Information

```python ages = pd.Series([22, 35, 45, 28, 65, 52])

Get bins and bin edges

binned, bin_edges = pd.cut(ages, bins=3, retbins=True) print(f"Bin edges: {bin_edges}") ```

Bin edges: [21.957 36.333 50.667 65. ]

pd.qcut - Quantile-Based Bins

Divides data into bins with approximately equal numbers of observations.

Basic Usage

```python

Highly skewed data

salaries = pd.Series([30000, 35000, 40000, 45000, 50000, 100000, 150000, 200000, 500000, 1000000])

Equal-width bins (pd.cut) - most values in one bin

print("cut (equal-width):") print(pd.cut(salaries, bins=4).value_counts())

Equal-frequency bins (pd.qcut) - same count per bin

print("\nqcut (equal-frequency):") print(pd.qcut(salaries, q=4).value_counts()) ```

``` cut (equal-width): (29030.0, 272500.0] 8 (272500.0, 515000.0] 1 (515000.0, 757500.0] 0 (757500.0, 1000000.0] 1 dtype: int64

qcut (equal-frequency): (29999.999, 42500.0] 3 (42500.0, 75000.0] 2 (75000.0, 175000.0] 3 (175000.0, 1000000.0] 2 dtype: int64 ```

Custom Quantiles

```python

Create quartiles

quartiles = pd.qcut(salaries, q=4, labels=['Q1', 'Q2', 'Q3', 'Q4']) print(quartiles)

Custom percentiles

percentiles = pd.qcut(salaries, q=[0, 0.25, 0.5, 0.75, 0.9, 1.0], labels=['Bottom 25%', '25-50%', '50-75%', '75-90%', 'Top 10%']) print(percentiles) ```

Handling Duplicates

When data has many duplicate values, qcut may fail because it can't create distinct bins.

```python

Data with many duplicates

values = pd.Series([1, 1, 1, 2, 2, 2, 3, 3, 3])

This will raise an error

pd.qcut(values, q=4) # ValueError: Bin edges must be unique

Solution 1: Use duplicates='drop'

print(pd.qcut(values, q=4, duplicates='drop'))

Solution 2: Use pd.cut instead

print(pd.cut(values, bins=4)) ```

cut vs qcut Comparison

```python

Normally distributed data

np.random.seed(42) normal_data = pd.Series(np.random.randn(1000) * 10 + 50)

Skewed data

skewed_data = pd.Series(np.random.exponential(scale=10, size=1000))

print("=== Normal Data ===") print("cut bins:", pd.cut(normal_data, bins=5).value_counts().sort_index()) print("qcut bins:", pd.qcut(normal_data, q=5).value_counts().sort_index())

print("\n=== Skewed Data ===") print("cut bins:", pd.cut(skewed_data, bins=5).value_counts().sort_index()) print("qcut bins:", pd.qcut(skewed_data, q=5).value_counts().sort_index()) ```

Scenario Use cut Use qcut
Equal-width ranges needed
Equal sample sizes per bin
Predefined boundaries
Percentile-based analysis
Skewed distributions ✅ (usually)
Custom business rules

Practical Examples

1. Customer Segmentation by Spending

```python

Customer purchase data

customers = pd.DataFrame({ 'customer_id': range(1, 101), 'total_spend': np.random.exponential(scale=500, size=100) })

Segment into spending tiers using qcut (equal customer count)

customers['spend_tier'] = pd.qcut( customers['total_spend'], q=4, labels=['Bronze', 'Silver', 'Gold', 'Platinum'] )

Check distribution

print(customers['spend_tier'].value_counts()) print(customers.groupby('spend_tier')['total_spend'].agg(['min', 'max', 'mean'])) ```

2. Grade Assignment

```python

Student scores

scores = pd.Series([95, 87, 76, 65, 58, 92, 73, 81, 45, 88])

Fixed grade boundaries

bins = [0, 60, 70, 80, 90, 100] labels = ['F', 'D', 'C', 'B', 'A']

grades = pd.cut(scores, bins=bins, labels=labels, right=False) print(grades) ```

3. Age Groups for Analysis

```python df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Carol', 'David', 'Eve'], 'age': [25, 35, 45, 55, 65], 'income': [50000, 75000, 90000, 120000, 80000] })

Create age groups

df['age_group'] = pd.cut( df['age'], bins=[18, 30, 45, 60, 100], labels=['Young', 'Middle', 'Senior', 'Retired'] )

Analyze income by age group

print(df.groupby('age_group')['income'].mean()) ```

4. Stock Price Volatility Buckets

```python import yfinance as yf

Get stock data

stock = yf.Ticker('AAPL').history(period='1y') stock['daily_return'] = stock['Close'].pct_change()

Bin returns into volatility categories

stock['return_category'] = pd.cut( stock['daily_return'], bins=[-np.inf, -0.02, -0.01, 0, 0.01, 0.02, np.inf], labels=['Large Loss', 'Loss', 'Slight Loss', 'Slight Gain', 'Gain', 'Large Gain'] )

print(stock['return_category'].value_counts()) ```

Key Parameters Summary

pd.cut Parameters

Parameter Description Default
x Input array to bin Required
bins Number of bins or bin edges Required
labels Labels for bins None (interval notation)
right Include right edge True
include_lowest Include lowest edge False
retbins Return bin edges False
precision Decimal precision 3

pd.qcut Parameters

Parameter Description Default
x Input array to bin Required
q Number of quantiles or quantile edges Required
labels Labels for bins None
retbins Return bin edges False
precision Decimal precision 3
duplicates Handle duplicate edges 'raise'

Common Pitfalls

1. NaN Values in Input

python values = pd.Series([1, 2, np.nan, 4, 5]) binned = pd.cut(values, bins=3) print(binned) # NaN remains NaN

2. Out-of-Range Values

```python values = pd.Series([5, 15, 25, 35]) bins = [10, 20, 30]

result = pd.cut(values, bins=bins) print(result)

5 and 35 become NaN (outside bin range)

```

3. Too Few Unique Values for qcut

```python

When data has fewer unique values than requested quantiles

values = pd.Series([1, 1, 2, 2])

pd.qcut(values, q=4) # Error!

pd.qcut(values, q=4, duplicates='drop') # Works ```


Exercises

Exercise 1. Write code that uses pd.cut() to bin a list of ages [5, 15, 25, 35, 45, 55, 65, 75] into categories: 'Child', 'Young Adult', 'Adult', 'Senior' with boundaries [0, 18, 35, 60, 100].

Solution to Exercise 1

```python import pandas as pd

ages = [5, 15, 25, 35, 45, 55, 65, 75] bins = [0, 18, 35, 60, 100] labels = ['Child', 'Young Adult', 'Adult', 'Senior'] categories = pd.cut(ages, bins=bins, labels=labels) print(categories) print(categories.value_counts()) ```


Exercise 2. Explain the difference between pd.cut() and pd.qcut(). When would you use each?

Solution to Exercise 2

pd.cut() bins data into intervals of equal width (or custom boundaries). pd.qcut() bins data into intervals of equal frequency (each bin has approximately the same number of observations). Use cut() when you have meaningful boundary values (e.g., age groups). Use qcut() when you want to split data into quantiles (e.g., quartiles, deciles).


Exercise 3. Write code that uses pd.qcut() to divide 100 random values into 4 equal-frequency bins. Print the value counts of each bin.

Solution to Exercise 3

```python import pandas as pd import numpy as np

np.random.seed(42) values = np.random.randn(100) quartiles = pd.qcut(values, q=4, labels=['Q1', 'Q2', 'Q3', 'Q4']) print(quartiles.value_counts()) ```


Exercise 4. Create a DataFrame with a 'score' column of random integers from 0 to 100. Use pd.cut() to add a 'grade' column with categories A (90-100), B (80-89), C (70-79), D (60-69), F (0-59).

Solution to Exercise 4

```python import pandas as pd import numpy as np

np.random.seed(42) df = pd.DataFrame({'score': np.random.randint(0, 101, 50)}) bins = [0, 60, 70, 80, 90, 101] labels = ['F', 'D', 'C', 'B', 'A'] df['grade'] = pd.cut(df['score'], bins=bins, labels=labels, right=False) print(df.head(10)) print(df['grade'].value_counts()) ```