Categorical Operations¶
This document covers common operations on categorical data, including groupby performance, merging, and manipulation patterns.
Mental Model
Because categoricals store integer codes internally, operations like groupby and merge can work on integers instead of strings, which is dramatically faster. Think of categoricals as pre-indexed data: the expensive string-comparison work is done once at creation time, and every subsequent operation benefits.
GroupBy Performance¶
Categorical columns provide significant speedup for groupby operations.
```python import pandas as pd import numpy as np import time
Create test data¶
np.random.seed(42) n = 2_000_000 sectors = ['Technology', 'Finance', 'Healthcare', 'Energy', 'Consumer']
df = pd.DataFrame({ 'sector_string': np.random.choice(sectors, n), 'returns': np.random.randn(n) * 0.02 })
df['sector_cat'] = df['sector_string'].astype('category')
Benchmark string groupby¶
start = time.time() result_string = df.groupby('sector_string')['returns'].mean() time_string = time.time() - start
Benchmark categorical groupby¶
start = time.time() result_cat = df.groupby('sector_cat')['returns'].mean() time_cat = time.time() - start
print(f"String groupby: {time_string:.3f}s") print(f"Categorical groupby: {time_cat:.3f}s") print(f"Speedup: {time_string/time_cat:.1f}x") ```
Why Categorical is Faster¶
- Integer-based hashing: Codes are integers, faster to hash than strings
- Pre-computed categories: No need to discover unique values
- Efficient memory access: Contiguous integer array
Value Counts¶
```python s = pd.Series(['a', 'b', 'a', 'c', 'a', 'b'], dtype='category')
Value counts respects category order¶
counts = s.value_counts() print(counts) ```
Including Empty Categories¶
```python
Add category 'd' that doesn't appear in data¶
s = s.cat.add_categories(['d'])
Default: only counts existing values¶
print(s.value_counts())
Include empty categories¶
print(s.value_counts(dropna=False)) ```
Merging with Categoricals¶
Same Categories - Fast Merge¶
```python
Both DataFrames have same categorical type¶
sector_dtype = pd.CategoricalDtype(categories=['Tech', 'Finance', 'Health'])
df1 = pd.DataFrame({ 'sector': pd.Categorical(['Tech', 'Finance'], dtype=sector_dtype), 'value1': [100, 200] })
df2 = pd.DataFrame({ 'sector': pd.Categorical(['Tech', 'Health'], dtype=sector_dtype), 'value2': [10, 30] })
Merge preserves categorical type¶
result = df1.merge(df2, on='sector', how='outer') print(result) print(result['sector'].dtype) # category ```
Different Categories - May Convert to Object¶
```python df1 = pd.DataFrame({ 'sector': pd.Categorical(['Tech', 'Finance']), 'value1': [100, 200] })
df2 = pd.DataFrame({ 'sector': pd.Categorical(['Tech', 'Health']), # Different categories 'value2': [10, 30] })
result = df1.merge(df2, on='sector', how='outer') print(result['sector'].dtype) # May be object or unified category ```
Best Practice: Unify Categories Before Merge¶
```python
Define common dtype¶
common_dtype = pd.CategoricalDtype(categories=['Tech', 'Finance', 'Health'])
df1['sector'] = df1['sector'].astype(common_dtype) df2['sector'] = df2['sector'].astype(common_dtype)
result = df1.merge(df2, on='sector', how='outer') print(result['sector'].dtype) # category (preserved) ```
Concatenation¶
```python s1 = pd.Series(['a', 'b'], dtype='category') s2 = pd.Series(['b', 'c'], dtype='category')
Concatenation unifies categories¶
result = pd.concat([s1, s2], ignore_index=True) print(result) print(result.cat.categories) # ['a', 'b', 'c'] ```
Pivot Tables with Categoricals¶
```python df = pd.DataFrame({ 'date': pd.date_range('2024-01-01', periods=6), 'sector': pd.Categorical(['Tech', 'Finance', 'Tech', 'Finance', 'Tech', 'Finance']), 'region': pd.Categorical(['East', 'East', 'West', 'West', 'East', 'West']), 'sales': [100, 200, 150, 250, 120, 180] })
Pivot preserves categorical in result¶
pivot = df.pivot_table( values='sales', index='sector', columns='region', aggfunc='sum' ) print(pivot) ```
String Operations on Categoricals¶
To use string methods, convert to string first:
```python s = pd.Series(['apple', 'banana', 'cherry'], dtype='category')
Direct string methods don't work on categories¶
s.str.upper() # Works but operates on categories¶
Convert to string for string operations¶
s_upper = s.astype(str).str.upper() print(s_upper)
Or rename categories¶
s_upper = s.cat.rename_categories(str.upper) print(s_upper) ```
Arithmetic Not Supported¶
Categorical data doesn't support arithmetic operations:
```python s = pd.Series([1, 2, 3], dtype='category')
These will fail or produce unexpected results¶
s + 1 # TypeError¶
s * 2 # TypeError¶
Convert to numeric first¶
s_numeric = s.astype(int) print(s_numeric + 1) # Works ```
Replacing Values¶
Using cat.rename_categories()¶
python
s = pd.Series(['old_a', 'old_b', 'old_a'], dtype='category')
s = s.cat.rename_categories({'old_a': 'new_a', 'old_b': 'new_b'})
print(s)
Using replace() - Converts to Object¶
```python s = pd.Series(['a', 'b', 'a'], dtype='category') s_replaced = s.replace({'a': 'x'}) print(s_replaced.dtype) # object (no longer categorical)
To preserve categorical, use rename_categories instead¶
```
Filtering Patterns¶
Using isin()¶
```python s = pd.Series(['Tech', 'Finance', 'Health', 'Energy', 'Tech'], dtype='category')
Filter multiple categories¶
tech_finance = s[s.isin(['Tech', 'Finance'])] print(tech_finance) ```
Using Boolean Masks¶
```python df = pd.DataFrame({ 'sector': pd.Categorical(['Tech', 'Finance', 'Health', 'Tech']), 'value': [100, 200, 150, 120] })
Single category¶
tech = df[df['sector'] == 'Tech']
Multiple categories¶
selected = df[df['sector'].isin(['Tech', 'Health'])] ```
Handling Missing Categories After Filter¶
After filtering, unused categories remain:
```python s = pd.Series(['a', 'b', 'c', 'a', 'b'], dtype='category') filtered = s[s != 'c'] print(filtered.cat.categories) # Still ['a', 'b', 'c']
Remove unused categories¶
filtered = filtered.cat.remove_unused_categories() print(filtered.cat.categories) # ['a', 'b'] ```
Aggregation Functions¶
Most aggregation functions work with categoricals:
```python df = pd.DataFrame({ 'category': pd.Categorical(['A', 'B', 'A', 'B', 'A']), 'value': [10, 20, 15, 25, 12] })
These work¶
print(df.groupby('category')['value'].sum()) print(df.groupby('category')['value'].mean()) print(df.groupby('category')['value'].std()) print(df.groupby('category')['value'].min()) print(df.groupby('category')['value'].max()) print(df.groupby('category')['value'].count()) ```
Transform with Categoricals¶
```python df = pd.DataFrame({ 'category': pd.Categorical(['A', 'B', 'A', 'B', 'A']), 'value': [10, 20, 15, 25, 12] })
Transform preserves original index¶
df['category_mean'] = df.groupby('category')['value'].transform('mean') print(df) ```
Categorical in MultiIndex¶
```python
Create MultiIndex with categorical level¶
arrays = [ pd.Categorical(['A', 'A', 'B', 'B']), [1, 2, 1, 2] ] index = pd.MultiIndex.from_arrays(arrays, names=['category', 'number'])
df = pd.DataFrame({'value': [10, 20, 30, 40]}, index=index) print(df)
Groupby on categorical level is fast¶
print(df.groupby(level='category').sum()) ```
Summary of Operations¶
| Operation | Categorical Support | Notes |
|---|---|---|
| GroupBy | ✅ Excellent | Faster than string |
| Value Counts | ✅ Good | Respects category order |
| Merge | ✅ Good | Best with same categories |
| Concat | ✅ Good | Categories unified |
| Pivot | ✅ Good | Preserves categorical |
| String methods | ⚠️ Limited | Use rename_categories |
| Arithmetic | ❌ No | Convert to numeric |
| Replace | ⚠️ Limited | Use rename_categories |
| Filtering | ✅ Good | Remember to clean unused |
| Aggregation | ✅ Excellent | All standard functions |
Exercises¶
Exercise 1. Write code that adds and removes categories from a categorical Series using .cat.add_categories() and .cat.remove_categories().
Solution to Exercise 1
```python import pandas as pd
See page content for relevant API details¶
s = pd.Series(['a', 'b', 'c', 'a', 'b'], dtype='category') print(s) print(s.cat.categories) print(s.cat.codes) ```
Exercise 2. Explain what happens when you try to assign a value to a categorical Series that is not in the current categories.
Solution to Exercise 2
See the explanation in the main content of this page. The key concept involves understanding the categorical data type and its internal representation in Pandas.
Exercise 3. Write code that renames categories using .cat.rename_categories(). Change ['a', 'b', 'c'] to ['Alpha', 'Beta', 'Gamma'].
Solution to Exercise 3
```python import pandas as pd import numpy as np
np.random.seed(42) df = pd.DataFrame({'col': np.random.choice(['A', 'B', 'C'], 1000)}) df['col'] = df['col'].astype('category') print(df.dtypes) print(df['col'].value_counts()) ```
Exercise 4. Create a categorical Series and use .cat.reorder_categories() to change the category ordering. Show that comparison operators reflect the new order.
Solution to Exercise 4
```python import pandas as pd
s = pd.Categorical(['low', 'medium', 'high', 'low'], categories=['low', 'medium', 'high'], ordered=True) print(s) print(s > 'low') ```