Introduction to Categoricals¶
In real-world datasets, many columns take on only a limited number of unique values, even if stored as strings or numbers. Such columns are ideal candidates for categorical encoding.
What is Categorical Data?¶
Categorical data represents variables that can take on a limited, fixed number of possible values. Examples include:
| Domain | Example | Possible Values |
|---|---|---|
| Finance | Stock sector | Technology, Finance, Healthcare, Energy, ... |
| Surveys | Agreement scale | Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree |
| Credit | Rating | AAA, AA, A, BBB, BB, B, CCC, ... |
| Retail | Size | Small, Medium, Large, XL |
| Demographics | Education | High School, Bachelor's, Master's, PhD |
How Pandas Stores Categoricals¶
A Categorical column in pandas is:
- Internally stored as integers pointing to a category lookup table
- Memory efficient, especially for repeated values
- Faster for comparisons, groupby, and filtering operations
┌─────────────────────────────────────────────────────────┐
│ String Storage │
├─────────────────────────────────────────────────────────┤
│ Row 0: "Technology" (10 bytes) │
│ Row 1: "Finance" (7 bytes) │
│ Row 2: "Technology" (10 bytes) ← Duplicate stored │
│ Row 3: "Healthcare" (10 bytes) │
│ Row 4: "Technology" (10 bytes) ← Duplicate stored │
│ ... │
│ Total: N × avg_string_length bytes │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Categorical Storage │
├─────────────────────────────────────────────────────────┤
│ Categories: ["Technology", "Finance", "Healthcare"] │
│ (stored once) │
├─────────────────────────────────────────────────────────┤
│ Codes (integers): │
│ Row 0: 0 (1 byte) → "Technology" │
│ Row 1: 1 (1 byte) → "Finance" │
│ Row 2: 0 (1 byte) → "Technology" │
│ Row 3: 2 (1 byte) → "Healthcare" │
│ Row 4: 0 (1 byte) → "Technology" │
│ ... │
│ Total: N bytes + category_table │
└─────────────────────────────────────────────────────────┘
Quick Example¶
import pandas as pd
# Regular string column
s_string = pd.Series(['apple', 'banana', 'apple', 'cherry', 'apple'])
print(f"String dtype: {s_string.dtype}") # object
# Categorical column
s_cat = s_string.astype('category')
print(f"Categorical dtype: {s_cat.dtype}") # category
# View internal structure
print(f"Categories: {s_cat.cat.categories.tolist()}")
print(f"Codes: {s_cat.cat.codes.tolist()}")
String dtype: object
Categorical dtype: category
Categories: ['apple', 'banana', 'cherry']
Codes: [0, 1, 0, 2, 0]
Benefits of Categorical Data¶
1. Memory Efficiency¶
Categorical data uses dramatically less memory for columns with repeated values.
import numpy as np
# 1 million rows with 10 sectors
sectors = ['Tech', 'Finance', 'Healthcare', 'Retail', 'Energy',
'Utilities', 'Media', 'Aerospace', 'Banks', 'Insurance']
data = np.random.choice(sectors, size=1_000_000)
df = pd.DataFrame({'Sector': data})
# Before: string storage
print(f"String: {df['Sector'].memory_usage(deep=True) / 1e6:.1f} MB")
# After: categorical storage
df['Sector'] = df['Sector'].astype('category')
print(f"Categorical: {df['Sector'].memory_usage(deep=True) / 1e6:.1f} MB")
String: 57.0 MB
Categorical: 1.0 MB
2. Faster Operations¶
GroupBy and other operations are significantly faster on categorical columns.
import time
df['Returns'] = np.random.randn(1_000_000)
# Time groupby with string column
df_string = df.copy()
df_string['Sector'] = df_string['Sector'].astype(str)
start = time.time()
_ = df_string.groupby('Sector')['Returns'].mean()
string_time = time.time() - start
# Time groupby with categorical column
start = time.time()
_ = df.groupby('Sector')['Returns'].mean()
cat_time = time.time() - start
print(f"String groupby: {string_time:.3f}s")
print(f"Categorical groupby: {cat_time:.3f}s")
print(f"Speedup: {string_time/cat_time:.1f}x")
3. Logical Ordering¶
Ordered categoricals enable meaningful comparisons.
# Without order: comparison fails
sizes = pd.Series(['medium', 'small', 'large'])
# sizes > 'small' # TypeError or meaningless result
# With order: comparison works
sizes = pd.Categorical(
['medium', 'small', 'large'],
categories=['small', 'medium', 'large'],
ordered=True
)
sizes = pd.Series(sizes)
print(sizes > 'small') # False, False, True
4. Data Validation¶
Categories enforce valid values—invalid data is caught early.
valid_ratings = ['AAA', 'AA', 'A', 'BBB', 'BB', 'B']
ratings = pd.Categorical(['AA', 'BBB', 'A'], categories=valid_ratings)
# Adding invalid value raises error or becomes NaN
# ratings = pd.Categorical(['AA', 'INVALID'], categories=valid_ratings)
# ValueError or NaN depending on method
When to Use Categorical¶
| Scenario | Use Categorical? | Reason |
|---|---|---|
| Limited unique values (< 50% of rows) | ✅ Yes | Memory savings |
| Frequent repeated values | ✅ Yes | Memory + speed |
| Natural order exists | ✅ Yes | Enable comparisons |
| Heavy groupby/filtering | ✅ Yes | Performance boost |
| All unique values | ❌ No | No benefit |
| Need string operations | ⚠️ Maybe | Convert back if needed |
Categorical vs Other Types¶
| Type | Use Case | Memory | Ordered Comparison |
|---|---|---|---|
object (string) |
Free-form text | High | No |
category |
Fixed set of values | Low | Optional |
int / float |
Numeric codes | Medium | Yes (numeric) |
Real-World Applications¶
- Stock Sectors: Technology, Healthcare, Finance, ...
- Credit Ratings: AAA, AA, A, BBB, BB, B, ...
- Survey Responses: Strongly Agree, Agree, Neutral, ...
- Product Categories: Electronics, Clothing, Food, ...
- Geographic Regions: North, South, East, West
- Time Periods: Q1, Q2, Q3, Q4
- Risk Levels: Low, Medium, High, Critical