Creating Categoricals¶
There are several ways to create categorical data in pandas, from simple type conversion to explicit construction with custom categories and ordering.
Method 1: Using astype('category')¶
The simplest way to convert existing data to categorical.
import pandas as pd
# From a Series
s = pd.Series(['apple', 'banana', 'apple', 'cherry'])
s_cat = s.astype('category')
print(s_cat)
0 apple
1 banana
2 apple
3 cherry
dtype: category
Categories (3, object): ['apple', 'banana', 'cherry']
# From a DataFrame column
df = pd.DataFrame({
'product': ['A', 'B', 'A', 'C', 'B'],
'price': [100, 200, 100, 300, 200]
})
df['product'] = df['product'].astype('category')
print(df['product'].dtype) # category
Categories are Automatically Inferred¶
When using astype('category'), pandas automatically:
- Identifies unique values
- Creates categories in sorted order (alphabetical for strings)
- Assigns integer codes to each value
s = pd.Series(['zebra', 'apple', 'mango', 'apple'])
s_cat = s.astype('category')
print(s_cat.cat.categories) # ['apple', 'mango', 'zebra'] (sorted)
Method 2: Using pd.Categorical()¶
For explicit control over categories and ordering.
Basic Construction¶
cat = pd.Categorical(['a', 'b', 'c', 'a', 'b'])
print(cat)
['a', 'b', 'c', 'a', 'b']
Categories (3, object): ['a', 'b', 'c']
Specifying Categories¶
Define the allowed categories explicitly:
# Only these categories are valid
cat = pd.Categorical(
['small', 'medium', 'large', 'small'],
categories=['small', 'medium', 'large', 'extra-large']
)
print(cat)
['small', 'medium', 'large', 'small']
Categories (4, object): ['small', 'medium', 'large', 'extra-large']
Note: 'extra-large' is a valid category even though it doesn't appear in the data.
Ordered Categories¶
Create categories with logical ordering:
cat = pd.Categorical(
['medium', 'small', 'large', 'small'],
categories=['small', 'medium', 'large'],
ordered=True
)
print(cat)
['medium', 'small', 'large', 'small']
Categories (3, object): ['small' < 'medium' < 'large']
The < symbols indicate ordering.
Handling Invalid Values¶
Values not in categories become NaN:
cat = pd.Categorical(
['a', 'b', 'c', 'd'], # 'd' not in categories
categories=['a', 'b', 'c']
)
print(cat)
['a', 'b', 'c', NaN]
Categories (3, object): ['a', 'b', 'c']
Method 3: Using pd.CategoricalDtype¶
Define a categorical type for reuse across multiple columns.
# Define the dtype once
size_dtype = pd.CategoricalDtype(
categories=['S', 'M', 'L', 'XL'],
ordered=True
)
# Apply to multiple columns
df = pd.DataFrame({
'shirt_size': ['M', 'L', 'S', 'XL'],
'pants_size': ['L', 'M', 'M', 'L']
})
df['shirt_size'] = df['shirt_size'].astype(size_dtype)
df['pants_size'] = df['pants_size'].astype(size_dtype)
print(df.dtypes)
shirt_size category
pants_size category
dtype: object
CategoricalDtype with read_csv¶
# Define dtype before reading
rating_dtype = pd.CategoricalDtype(
categories=['AAA', 'AA', 'A', 'BBB', 'BB', 'B'],
ordered=True
)
# Apply during read
df = pd.read_csv('bonds.csv', dtype={'rating': rating_dtype})
Method 4: During DataFrame Creation¶
Specify categorical dtype when creating a DataFrame:
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'grade': pd.Categorical(['A', 'B', 'A'], ordered=True)
})
print(df['grade'].dtype) # category
Method 5: Using Series Constructor¶
s = pd.Series(
pd.Categorical(['low', 'medium', 'high', 'low']),
name='priority'
)
print(s)
Practical Examples¶
Stock Sectors¶
import numpy as np
# Define valid sectors
sector_dtype = pd.CategoricalDtype(categories=[
'Technology', 'Healthcare', 'Finance', 'Energy',
'Consumer Discretionary', 'Consumer Staples',
'Industrials', 'Materials', 'Utilities',
'Real Estate', 'Communication Services'
])
# Create stock data
stocks = pd.DataFrame({
'ticker': ['AAPL', 'JNJ', 'JPM', 'XOM', 'AMZN'],
'sector': ['Technology', 'Healthcare', 'Finance', 'Energy', 'Consumer Discretionary']
})
stocks['sector'] = stocks['sector'].astype(sector_dtype)
print(stocks['sector'].cat.categories)
Credit Ratings¶
# Ratings with natural order (AAA is best)
rating_dtype = pd.CategoricalDtype(
categories=['D', 'C', 'CC', 'CCC', 'B', 'BB', 'BBB', 'A', 'AA', 'AAA'],
ordered=True
)
bonds = pd.DataFrame({
'issuer': ['Company A', 'Company B', 'Company C'],
'rating': ['AA', 'BBB', 'A']
})
bonds['rating'] = bonds['rating'].astype(rating_dtype)
# Now we can compare ratings
print(bonds[bonds['rating'] >= 'A']) # Investment grade
Survey Responses¶
# Likert scale with order
likert_dtype = pd.CategoricalDtype(
categories=[
'Strongly Disagree',
'Disagree',
'Neutral',
'Agree',
'Strongly Agree'
],
ordered=True
)
survey = pd.DataFrame({
'respondent_id': [1, 2, 3, 4, 5],
'satisfaction': ['Agree', 'Neutral', 'Strongly Agree', 'Disagree', 'Agree']
})
survey['satisfaction'] = survey['satisfaction'].astype(likert_dtype)
# Find positive responses
positive = survey[survey['satisfaction'] > 'Neutral']
print(positive)
Summary of Creation Methods¶
| Method | Use Case | Example |
|---|---|---|
astype('category') |
Quick conversion | s.astype('category') |
pd.Categorical() |
Custom categories/order | pd.Categorical(data, categories=[...]) |
pd.CategoricalDtype |
Reusable type definition | dtype = pd.CategoricalDtype(...) |
| In DataFrame creation | Direct specification | pd.DataFrame({'col': pd.Categorical(...)}) |
Best Practices¶
- Define categories explicitly when you know the valid values
- Use ordered=True when categories have logical order
- Use CategoricalDtype for consistency across multiple columns
- Specify dtype in read_csv to save memory on large files
- Handle invalid values - they become NaN silently