Creating Categoricals¶
There are several ways to create categorical data in pandas, from simple type conversion to explicit construction with custom categories and ordering.
Mental Model
Creating a categorical is like defining an enum: you declare the allowed values (categories) and then each element is just a pointer into that list. The simplest path is astype('category') which infers the categories from the data, but CategoricalDtype gives you full control over which values are allowed and their order.
Method 1: Using astype('category')¶
The simplest way to convert existing data to categorical.
```python import pandas as pd
From a Series¶
s = pd.Series(['apple', 'banana', 'apple', 'cherry']) s_cat = s.astype('category') print(s_cat) ```
0 apple
1 banana
2 apple
3 cherry
dtype: category
Categories (3, object): ['apple', 'banana', 'cherry']
```python
From a DataFrame column¶
df = pd.DataFrame({ 'product': ['A', 'B', 'A', 'C', 'B'], 'price': [100, 200, 100, 300, 200] })
df['product'] = df['product'].astype('category') print(df['product'].dtype) # category ```
Categories are Automatically Inferred¶
When using astype('category'), pandas automatically:
- Identifies unique values
- Creates categories in sorted order (alphabetical for strings)
- Assigns integer codes to each value
python
s = pd.Series(['zebra', 'apple', 'mango', 'apple'])
s_cat = s.astype('category')
print(s_cat.cat.categories) # ['apple', 'mango', 'zebra'] (sorted)
Method 2: Using pd.Categorical()¶
For explicit control over categories and ordering.
Basic Construction¶
python
cat = pd.Categorical(['a', 'b', 'c', 'a', 'b'])
print(cat)
['a', 'b', 'c', 'a', 'b']
Categories (3, object): ['a', 'b', 'c']
Specifying Categories¶
Define the allowed categories explicitly:
```python
Only these categories are valid¶
cat = pd.Categorical( ['small', 'medium', 'large', 'small'], categories=['small', 'medium', 'large', 'extra-large'] ) print(cat) ```
['small', 'medium', 'large', 'small']
Categories (4, object): ['small', 'medium', 'large', 'extra-large']
Note: 'extra-large' is a valid category even though it doesn't appear in the data.
Ordered Categories¶
Create categories with logical ordering:
python
cat = pd.Categorical(
['medium', 'small', 'large', 'small'],
categories=['small', 'medium', 'large'],
ordered=True
)
print(cat)
['medium', 'small', 'large', 'small']
Categories (3, object): ['small' < 'medium' < 'large']
The < symbols indicate ordering.
Handling Invalid Values¶
Values not in categories become NaN:
python
cat = pd.Categorical(
['a', 'b', 'c', 'd'], # 'd' not in categories
categories=['a', 'b', 'c']
)
print(cat)
['a', 'b', 'c', NaN]
Categories (3, object): ['a', 'b', 'c']
Method 3: Using pd.CategoricalDtype¶
Define a categorical type for reuse across multiple columns.
```python
Define the dtype once¶
size_dtype = pd.CategoricalDtype( categories=['S', 'M', 'L', 'XL'], ordered=True )
Apply to multiple columns¶
df = pd.DataFrame({ 'shirt_size': ['M', 'L', 'S', 'XL'], 'pants_size': ['L', 'M', 'M', 'L'] })
df['shirt_size'] = df['shirt_size'].astype(size_dtype) df['pants_size'] = df['pants_size'].astype(size_dtype)
print(df.dtypes) ```
shirt_size category
pants_size category
dtype: object
CategoricalDtype with read_csv¶
```python
Define dtype before reading¶
rating_dtype = pd.CategoricalDtype( categories=['AAA', 'AA', 'A', 'BBB', 'BB', 'B'], ordered=True )
Apply during read¶
df = pd.read_csv('bonds.csv', dtype={'rating': rating_dtype}) ```
Method 4: During DataFrame Creation¶
Specify categorical dtype when creating a DataFrame:
```python df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie'], 'grade': pd.Categorical(['A', 'B', 'A'], ordered=True) })
print(df['grade'].dtype) # category ```
Method 5: Using Series Constructor¶
python
s = pd.Series(
pd.Categorical(['low', 'medium', 'high', 'low']),
name='priority'
)
print(s)
Practical Examples¶
Stock Sectors¶
```python import numpy as np
Define valid sectors¶
sector_dtype = pd.CategoricalDtype(categories=[ 'Technology', 'Healthcare', 'Finance', 'Energy', 'Consumer Discretionary', 'Consumer Staples', 'Industrials', 'Materials', 'Utilities', 'Real Estate', 'Communication Services' ])
Create stock data¶
stocks = pd.DataFrame({ 'ticker': ['AAPL', 'JNJ', 'JPM', 'XOM', 'AMZN'], 'sector': ['Technology', 'Healthcare', 'Finance', 'Energy', 'Consumer Discretionary'] })
stocks['sector'] = stocks['sector'].astype(sector_dtype) print(stocks['sector'].cat.categories) ```
Credit Ratings¶
```python
Ratings with natural order (AAA is best)¶
rating_dtype = pd.CategoricalDtype( categories=['D', 'C', 'CC', 'CCC', 'B', 'BB', 'BBB', 'A', 'AA', 'AAA'], ordered=True )
bonds = pd.DataFrame({ 'issuer': ['Company A', 'Company B', 'Company C'], 'rating': ['AA', 'BBB', 'A'] })
bonds['rating'] = bonds['rating'].astype(rating_dtype)
Now we can compare ratings¶
print(bonds[bonds['rating'] >= 'A']) # Investment grade ```
Survey Responses¶
```python
Likert scale with order¶
likert_dtype = pd.CategoricalDtype( categories=[ 'Strongly Disagree', 'Disagree', 'Neutral', 'Agree', 'Strongly Agree' ], ordered=True )
survey = pd.DataFrame({ 'respondent_id': [1, 2, 3, 4, 5], 'satisfaction': ['Agree', 'Neutral', 'Strongly Agree', 'Disagree', 'Agree'] })
survey['satisfaction'] = survey['satisfaction'].astype(likert_dtype)
Find positive responses¶
positive = survey[survey['satisfaction'] > 'Neutral'] print(positive) ```
Summary of Creation Methods¶
| Method | Use Case | Example |
|---|---|---|
astype('category') |
Quick conversion | s.astype('category') |
pd.Categorical() |
Custom categories/order | pd.Categorical(data, categories=[...]) |
pd.CategoricalDtype |
Reusable type definition | dtype = pd.CategoricalDtype(...) |
| In DataFrame creation | Direct specification | pd.DataFrame({'col': pd.Categorical(...)}) |
Best Practices¶
- Define categories explicitly when you know the valid values
- Use ordered=True when categories have logical order
- Use CategoricalDtype for consistency across multiple columns
- Specify dtype in read_csv to save memory on large files
- Handle invalid values - they become NaN silently
Exercises¶
Exercise 1. Write code that creates a Pandas Categorical from the list ['low', 'medium', 'high', 'medium', 'low'] with categories ['low', 'medium', 'high'] and ordered=True.
Solution to Exercise 1
```python import pandas as pd
cat = pd.Categorical( ['low', 'medium', 'high', 'medium', 'low'], categories=['low', 'medium', 'high'], ordered=True ) print(cat) print(f'Ordered: {cat.ordered}') ```
Exercise 2. Explain the difference between pd.Categorical() and astype('category'). When would you use each?
Solution to Exercise 2
pd.Categorical() creates a standalone categorical array where you can specify the categories and ordering explicitly at creation time. .astype('category') converts an existing Series to a categorical dtype, inferring the categories from the data. Use pd.Categorical() when you need to set custom categories or ordering upfront. Use .astype('category') for quick conversion of an existing column.
Exercise 3. Create a DataFrame with a string column and convert it to categorical using .astype('category'). Print the category codes and the memory usage before and after conversion.
Solution to Exercise 3
```python import pandas as pd
df = pd.DataFrame({'color': ['red', 'blue', 'green', 'red', 'blue'] * 100}) print(f'Before: {df["color"].memory_usage()} bytes') df['color'] = df['color'].astype('category') print(f'After: {df["color"].memory_usage()} bytes') print(df['color'].cat.codes[:5]) ```
Exercise 4. Write code that creates a categorical with custom categories that include a category not present in the data (e.g., 'very_high'). Show that this category appears in .cat.categories.
Solution to Exercise 4
```python import pandas as pd
cat = pd.Categorical( ['low', 'medium', 'high'], categories=['low', 'medium', 'high', 'very_high'] ) print(cat) print(f'Categories: {cat.categories.tolist()}') ```