Categorical Accessor (cat)¶
The cat accessor in pandas provides methods and properties for working with categorical data. It allows you to inspect, modify, and manipulate the categories of a Categorical Series.
Mental Model
Think of .cat as a control panel attached to any categorical Series. Just as a TV remote only works with a TV, .cat only activates on categorical dtype -- but once you have it, you can rename, reorder, add, or remove the category labels that define the column's vocabulary.
Overview¶
```python import pandas as pd
s = pd.Series(['low', 'medium', 'high', 'low'], dtype='category')
Access categorical methods via .cat accessor¶
print(s.cat.categories) ```
Index(['high', 'low', 'medium'], dtype='object')
Prerequisites¶
The cat accessor only works with categorical dtype:
```python
String column - cat accessor NOT available¶
s = pd.Series(['a', 'b', 'c'])
s.cat.categories # AttributeError¶
Convert to categorical first¶
s = s.astype('category') print(s.cat.categories) # Now works ```
Properties¶
categories¶
Returns the categories of the categorical.
python
s = pd.Series(['apple', 'banana', 'apple', 'cherry'], dtype='category')
print(s.cat.categories)
Index(['apple', 'banana', 'cherry'], dtype='object')
codes¶
Returns the integer codes representing each category.
python
s = pd.Series(['apple', 'banana', 'apple', 'cherry'], dtype='category')
print(s.cat.codes)
0 0
1 1
2 0
3 2
dtype: int8
The codes are integers that index into the categories array. This is how categorical data achieves memory efficiency.
ordered¶
Returns whether the categorical has an order.
```python s = pd.Series(['low', 'medium', 'high'], dtype='category') print(s.cat.ordered) # False
Create ordered categorical¶
s_ordered = pd.Categorical(['low', 'medium', 'high'], categories=['low', 'medium', 'high'], ordered=True) print(pd.Series(s_ordered).cat.ordered) # True ```
Category Management Methods¶
add_categories()¶
Add new categories.
```python s = pd.Series(['a', 'b', 'a'], dtype='category') print(s.cat.categories) # ['a', 'b']
s = s.cat.add_categories(['c', 'd']) print(s.cat.categories) # ['a', 'b', 'c', 'd'] ```
Note: Adding categories doesn't add data values—it just expands the allowed categories.
remove_categories()¶
Remove categories (values become NaN).
python
s = pd.Series(['a', 'b', 'c', 'a'], dtype='category')
s = s.cat.remove_categories(['c'])
print(s)
0 a
1 b
2 NaN
3 a
dtype: category
Categories (2, object): ['a', 'b']
⚠️ Warning: Removing a category doesn't remove rows—it converts those values to NaN.
remove_unused_categories()¶
Remove categories that don't appear in the data.
```python s = pd.Series(['a', 'b', 'a'], dtype='category') s = s.cat.add_categories(['c', 'd']) # Add unused categories print(s.cat.categories) # ['a', 'b', 'c', 'd']
s = s.cat.remove_unused_categories() print(s.cat.categories) # ['a', 'b'] ```
set_categories()¶
Set categories to a new list (replaces all).
python
s = pd.Series(['a', 'b', 'c'], dtype='category')
s = s.cat.set_categories(['a', 'b', 'c', 'd', 'e'])
print(s.cat.categories)
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
rename_categories()¶
Rename existing categories.
```python s = pd.Series(['a', 'b', 'c', 'a'], dtype='category')
Using a dictionary¶
s = s.cat.rename_categories({'a': 'alpha', 'b': 'beta', 'c': 'gamma'}) print(s) ```
0 alpha
1 beta
2 gamma
3 alpha
dtype: category
Categories (3, object): ['alpha', 'beta', 'gamma']
```python
Using a function¶
s = pd.Series(['a', 'b', 'c'], dtype='category') s = s.cat.rename_categories(lambda x: x.upper()) print(s.cat.categories) # ['A', 'B', 'C'] ```
reorder_categories()¶
Reorder categories (for ordered categoricals).
python
s = pd.Series(['low', 'medium', 'high'], dtype='category')
s = s.cat.reorder_categories(['low', 'medium', 'high'], ordered=True)
print(s.cat.categories)
Index(['low', 'medium', 'high'], dtype='object')
Ordering Methods¶
as_ordered()¶
Make the categorical ordered.
```python s = pd.Series(['low', 'medium', 'high'], dtype='category') print(s.cat.ordered) # False
s = s.cat.as_ordered() print(s.cat.ordered) # True ```
as_unordered()¶
Make the categorical unordered.
python
s = s.cat.as_unordered()
print(s.cat.ordered) # False
Practical Examples¶
Stock Sector Analysis¶
```python import pandas as pd import numpy as np
Create stock data¶
np.random.seed(42) sectors = ['Technology', 'Finance', 'Healthcare', 'Energy', 'Consumer'] df = pd.DataFrame({ 'ticker': [f'STOCK_{i}' for i in range(1000)], 'sector': np.random.choice(sectors, 1000), 'returns': np.random.randn(1000) * 0.02 })
Convert to categorical¶
df['sector'] = df['sector'].astype('category')
Check categories¶
print(df['sector'].cat.categories)
Reorder for logical grouping¶
df['sector'] = df['sector'].cat.reorder_categories( ['Technology', 'Healthcare', 'Finance', 'Consumer', 'Energy'] )
Group analysis is now faster¶
sector_returns = df.groupby('sector')['returns'].mean() print(sector_returns) ```
Credit Rating Analysis¶
```python
Credit ratings have natural order¶
ratings = pd.Series(['BBB', 'AA', 'AAA', 'BB', 'A', 'BBB', 'AA'])
Convert to ordered categorical¶
rating_order = ['BB', 'BBB', 'A', 'AA', 'AAA'] ratings = pd.Categorical(ratings, categories=rating_order, ordered=True) ratings = pd.Series(ratings)
Now comparisons work¶
print(ratings > 'BBB') ```
0 False
1 True
2 True
3 False
4 True
5 False
6 True
dtype: bool
```python
Filter investment grade (BBB and above)¶
investment_grade = ratings[ratings >= 'BBB'] print(investment_grade) ```
Survey Response Analysis¶
```python
Survey responses with natural order¶
responses = pd.Series([ 'Strongly Disagree', 'Disagree', 'Neutral', 'Agree', 'Strongly Agree', 'Agree', 'Neutral' ])
Define order¶
response_order = [ 'Strongly Disagree', 'Disagree', 'Neutral', 'Agree', 'Strongly Agree' ]
responses = pd.Categorical(responses, categories=response_order, ordered=True) responses = pd.Series(responses)
Find positive responses¶
positive = responses[responses > 'Neutral'] print(positive) ```
Memory Comparison¶
```python import pandas as pd import numpy as np
Create large dataset¶
n = 1_000_000 categories = ['Cat_A', 'Cat_B', 'Cat_C', 'Cat_D', 'Cat_E'] data = np.random.choice(categories, n)
As string (object)¶
s_string = pd.Series(data) print(f"String memory: {s_string.memory_usage(deep=True) / 1e6:.2f} MB")
As categorical¶
s_cat = pd.Series(data, dtype='category') print(f"Categorical memory: {s_cat.memory_usage(deep=True) / 1e6:.2f} MB")
Ratio¶
ratio = s_string.memory_usage(deep=True) / s_cat.memory_usage(deep=True) print(f"Memory reduction: {ratio:.1f}x") ```
String memory: 57.00 MB
Categorical memory: 1.00 MB
Memory reduction: 57.0x
Summary of cat Methods¶
| Method/Property | Description |
|---|---|
cat.categories |
Get/set categories |
cat.codes |
Integer codes for values |
cat.ordered |
Check if ordered |
cat.add_categories() |
Add new categories |
cat.remove_categories() |
Remove categories |
cat.remove_unused_categories() |
Remove unused categories |
cat.set_categories() |
Replace all categories |
cat.rename_categories() |
Rename categories |
cat.reorder_categories() |
Reorder categories |
cat.as_ordered() |
Make ordered |
cat.as_unordered() |
Make unordered |
Exercises¶
Exercise 1.
Create a Series with values ['small', 'medium', 'large', 'medium', 'small'] and convert it to a categorical type. Use .cat.codes to print the integer codes and .cat.categories to print the category labels.
Solution to Exercise 1
Convert to categorical and inspect codes and categories.
import pandas as pd
s = pd.Series(['small', 'medium', 'large', 'medium', 'small'], dtype='category')
print("Codes:", s.cat.codes.tolist())
print("Categories:", s.cat.categories.tolist())
Exercise 2.
Create an ordered categorical Series with the custom order ['bronze', 'silver', 'gold']. Add a new category 'platinum' using .cat.add_categories(). Then verify that 'platinum' appears in the categories even though no element has that value.
Solution to Exercise 2
Create ordered categorical and add a new category.
import pandas as pd
from pandas.api.types import CategoricalDtype
cat_type = CategoricalDtype(
categories=['bronze', 'silver', 'gold'],
ordered=True
)
s = pd.Series(['bronze', 'silver', 'gold', 'silver'], dtype=cat_type)
s = s.cat.add_categories('platinum')
print(s.cat.categories) # ['bronze', 'silver', 'gold', 'platinum']
print('platinum' in s.cat.categories) # True
Exercise 3.
Given a categorical Series with categories ['A', 'B', 'C', 'D'] where category 'D' is never used, use .cat.remove_unused_categories() to clean it up. Then rename the remaining categories to ['Alpha', 'Beta', 'Gamma'] using .cat.rename_categories().
Solution to Exercise 3
Remove unused categories and rename the rest.
import pandas as pd
s = pd.Series(
pd.Categorical(['A', 'B', 'C', 'A', 'B'],
categories=['A', 'B', 'C', 'D'])
)
s = s.cat.remove_unused_categories()
print("After removing unused:", s.cat.categories.tolist()) # ['A', 'B', 'C']
s = s.cat.rename_categories({'A': 'Alpha', 'B': 'Beta', 'C': 'Gamma'})
print(s)