Categorical Operations¶
This document covers common operations on categorical data, including groupby performance, merging, and manipulation patterns.
GroupBy Performance¶
Categorical columns provide significant speedup for groupby operations.
import pandas as pd
import numpy as np
import time
# Create test data
np.random.seed(42)
n = 2_000_000
sectors = ['Technology', 'Finance', 'Healthcare', 'Energy', 'Consumer']
df = pd.DataFrame({
'sector_string': np.random.choice(sectors, n),
'returns': np.random.randn(n) * 0.02
})
df['sector_cat'] = df['sector_string'].astype('category')
# Benchmark string groupby
start = time.time()
result_string = df.groupby('sector_string')['returns'].mean()
time_string = time.time() - start
# Benchmark categorical groupby
start = time.time()
result_cat = df.groupby('sector_cat')['returns'].mean()
time_cat = time.time() - start
print(f"String groupby: {time_string:.3f}s")
print(f"Categorical groupby: {time_cat:.3f}s")
print(f"Speedup: {time_string/time_cat:.1f}x")
Why Categorical is Faster¶
- Integer-based hashing: Codes are integers, faster to hash than strings
- Pre-computed categories: No need to discover unique values
- Efficient memory access: Contiguous integer array
Value Counts¶
s = pd.Series(['a', 'b', 'a', 'c', 'a', 'b'], dtype='category')
# Value counts respects category order
counts = s.value_counts()
print(counts)
Including Empty Categories¶
# Add category 'd' that doesn't appear in data
s = s.cat.add_categories(['d'])
# Default: only counts existing values
print(s.value_counts())
# Include empty categories
print(s.value_counts(dropna=False))
Merging with Categoricals¶
Same Categories - Fast Merge¶
# Both DataFrames have same categorical type
sector_dtype = pd.CategoricalDtype(categories=['Tech', 'Finance', 'Health'])
df1 = pd.DataFrame({
'sector': pd.Categorical(['Tech', 'Finance'], dtype=sector_dtype),
'value1': [100, 200]
})
df2 = pd.DataFrame({
'sector': pd.Categorical(['Tech', 'Health'], dtype=sector_dtype),
'value2': [10, 30]
})
# Merge preserves categorical type
result = df1.merge(df2, on='sector', how='outer')
print(result)
print(result['sector'].dtype) # category
Different Categories - May Convert to Object¶
df1 = pd.DataFrame({
'sector': pd.Categorical(['Tech', 'Finance']),
'value1': [100, 200]
})
df2 = pd.DataFrame({
'sector': pd.Categorical(['Tech', 'Health']), # Different categories
'value2': [10, 30]
})
result = df1.merge(df2, on='sector', how='outer')
print(result['sector'].dtype) # May be object or unified category
Best Practice: Unify Categories Before Merge¶
# Define common dtype
common_dtype = pd.CategoricalDtype(categories=['Tech', 'Finance', 'Health'])
df1['sector'] = df1['sector'].astype(common_dtype)
df2['sector'] = df2['sector'].astype(common_dtype)
result = df1.merge(df2, on='sector', how='outer')
print(result['sector'].dtype) # category (preserved)
Concatenation¶
s1 = pd.Series(['a', 'b'], dtype='category')
s2 = pd.Series(['b', 'c'], dtype='category')
# Concatenation unifies categories
result = pd.concat([s1, s2], ignore_index=True)
print(result)
print(result.cat.categories) # ['a', 'b', 'c']
Pivot Tables with Categoricals¶
df = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=6),
'sector': pd.Categorical(['Tech', 'Finance', 'Tech', 'Finance', 'Tech', 'Finance']),
'region': pd.Categorical(['East', 'East', 'West', 'West', 'East', 'West']),
'sales': [100, 200, 150, 250, 120, 180]
})
# Pivot preserves categorical in result
pivot = df.pivot_table(
values='sales',
index='sector',
columns='region',
aggfunc='sum'
)
print(pivot)
String Operations on Categoricals¶
To use string methods, convert to string first:
s = pd.Series(['apple', 'banana', 'cherry'], dtype='category')
# Direct string methods don't work on categories
# s.str.upper() # Works but operates on categories
# Convert to string for string operations
s_upper = s.astype(str).str.upper()
print(s_upper)
# Or rename categories
s_upper = s.cat.rename_categories(str.upper)
print(s_upper)
Arithmetic Not Supported¶
Categorical data doesn't support arithmetic operations:
s = pd.Series([1, 2, 3], dtype='category')
# These will fail or produce unexpected results
# s + 1 # TypeError
# s * 2 # TypeError
# Convert to numeric first
s_numeric = s.astype(int)
print(s_numeric + 1) # Works
Replacing Values¶
Using cat.rename_categories()¶
s = pd.Series(['old_a', 'old_b', 'old_a'], dtype='category')
s = s.cat.rename_categories({'old_a': 'new_a', 'old_b': 'new_b'})
print(s)
Using replace() - Converts to Object¶
s = pd.Series(['a', 'b', 'a'], dtype='category')
s_replaced = s.replace({'a': 'x'})
print(s_replaced.dtype) # object (no longer categorical)
# To preserve categorical, use rename_categories instead
Filtering Patterns¶
Using isin()¶
s = pd.Series(['Tech', 'Finance', 'Health', 'Energy', 'Tech'], dtype='category')
# Filter multiple categories
tech_finance = s[s.isin(['Tech', 'Finance'])]
print(tech_finance)
Using Boolean Masks¶
df = pd.DataFrame({
'sector': pd.Categorical(['Tech', 'Finance', 'Health', 'Tech']),
'value': [100, 200, 150, 120]
})
# Single category
tech = df[df['sector'] == 'Tech']
# Multiple categories
selected = df[df['sector'].isin(['Tech', 'Health'])]
Handling Missing Categories After Filter¶
After filtering, unused categories remain:
s = pd.Series(['a', 'b', 'c', 'a', 'b'], dtype='category')
filtered = s[s != 'c']
print(filtered.cat.categories) # Still ['a', 'b', 'c']
# Remove unused categories
filtered = filtered.cat.remove_unused_categories()
print(filtered.cat.categories) # ['a', 'b']
Aggregation Functions¶
Most aggregation functions work with categoricals:
df = pd.DataFrame({
'category': pd.Categorical(['A', 'B', 'A', 'B', 'A']),
'value': [10, 20, 15, 25, 12]
})
# These work
print(df.groupby('category')['value'].sum())
print(df.groupby('category')['value'].mean())
print(df.groupby('category')['value'].std())
print(df.groupby('category')['value'].min())
print(df.groupby('category')['value'].max())
print(df.groupby('category')['value'].count())
Transform with Categoricals¶
df = pd.DataFrame({
'category': pd.Categorical(['A', 'B', 'A', 'B', 'A']),
'value': [10, 20, 15, 25, 12]
})
# Transform preserves original index
df['category_mean'] = df.groupby('category')['value'].transform('mean')
print(df)
Categorical in MultiIndex¶
# Create MultiIndex with categorical level
arrays = [
pd.Categorical(['A', 'A', 'B', 'B']),
[1, 2, 1, 2]
]
index = pd.MultiIndex.from_arrays(arrays, names=['category', 'number'])
df = pd.DataFrame({'value': [10, 20, 30, 40]}, index=index)
print(df)
# Groupby on categorical level is fast
print(df.groupby(level='category').sum())
Summary of Operations¶
| Operation | Categorical Support | Notes |
|---|---|---|
| GroupBy | ✅ Excellent | Faster than string |
| Value Counts | ✅ Good | Respects category order |
| Merge | ✅ Good | Best with same categories |
| Concat | ✅ Good | Categories unified |
| Pivot | ✅ Good | Preserves categorical |
| String methods | ⚠️ Limited | Use rename_categories |
| Arithmetic | ❌ No | Convert to numeric |
| Replace | ⚠️ Limited | Use rename_categories |
| Filtering | ✅ Good | Remember to clean unused |
| Aggregation | ✅ Excellent | All standard functions |