Skip to content

Introduction to Categoricals

In real-world datasets, many columns take on only a limited number of unique values, even if stored as strings or numbers. Such columns are ideal candidates for categorical encoding.

What is Categorical Data?

Categorical data represents variables that can take on a limited, fixed number of possible values. Examples include:

Domain Example Possible Values
Finance Stock sector Technology, Finance, Healthcare, Energy, ...
Surveys Agreement scale Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree
Credit Rating AAA, AA, A, BBB, BB, B, CCC, ...
Retail Size Small, Medium, Large, XL
Demographics Education High School, Bachelor's, Master's, PhD

How Pandas Stores Categoricals

A Categorical column in pandas is:

  1. Internally stored as integers pointing to a category lookup table
  2. Memory efficient, especially for repeated values
  3. Faster for comparisons, groupby, and filtering operations
┌─────────────────────────────────────────────────────────┐
│                    String Storage                        │
├─────────────────────────────────────────────────────────┤
│ Row 0: "Technology"  (10 bytes)                         │
│ Row 1: "Finance"     (7 bytes)                          │
│ Row 2: "Technology"  (10 bytes)  ← Duplicate stored     │
│ Row 3: "Healthcare"  (10 bytes)                         │
│ Row 4: "Technology"  (10 bytes)  ← Duplicate stored     │
│ ...                                                      │
│ Total: N × avg_string_length bytes                      │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│                  Categorical Storage                     │
├─────────────────────────────────────────────────────────┤
│ Categories: ["Technology", "Finance", "Healthcare"]     │
│             (stored once)                                │
├─────────────────────────────────────────────────────────┤
│ Codes (integers):                                        │
│ Row 0: 0  (1 byte)  → "Technology"                      │
│ Row 1: 1  (1 byte)  → "Finance"                         │
│ Row 2: 0  (1 byte)  → "Technology"                      │
│ Row 3: 2  (1 byte)  → "Healthcare"                      │
│ Row 4: 0  (1 byte)  → "Technology"                      │
│ ...                                                      │
│ Total: N bytes + category_table                         │
└─────────────────────────────────────────────────────────┘

Quick Example

import pandas as pd

# Regular string column
s_string = pd.Series(['apple', 'banana', 'apple', 'cherry', 'apple'])
print(f"String dtype: {s_string.dtype}")  # object

# Categorical column
s_cat = s_string.astype('category')
print(f"Categorical dtype: {s_cat.dtype}")  # category

# View internal structure
print(f"Categories: {s_cat.cat.categories.tolist()}")
print(f"Codes: {s_cat.cat.codes.tolist()}")
String dtype: object
Categorical dtype: category
Categories: ['apple', 'banana', 'cherry']
Codes: [0, 1, 0, 2, 0]

Benefits of Categorical Data

1. Memory Efficiency

Categorical data uses dramatically less memory for columns with repeated values.

import numpy as np

# 1 million rows with 10 sectors
sectors = ['Tech', 'Finance', 'Healthcare', 'Retail', 'Energy', 
           'Utilities', 'Media', 'Aerospace', 'Banks', 'Insurance']
data = np.random.choice(sectors, size=1_000_000)

df = pd.DataFrame({'Sector': data})

# Before: string storage
print(f"String: {df['Sector'].memory_usage(deep=True) / 1e6:.1f} MB")

# After: categorical storage
df['Sector'] = df['Sector'].astype('category')
print(f"Categorical: {df['Sector'].memory_usage(deep=True) / 1e6:.1f} MB")
String: 57.0 MB
Categorical: 1.0 MB

2. Faster Operations

GroupBy and other operations are significantly faster on categorical columns.

import time

df['Returns'] = np.random.randn(1_000_000)

# Time groupby with string column
df_string = df.copy()
df_string['Sector'] = df_string['Sector'].astype(str)

start = time.time()
_ = df_string.groupby('Sector')['Returns'].mean()
string_time = time.time() - start

# Time groupby with categorical column
start = time.time()
_ = df.groupby('Sector')['Returns'].mean()
cat_time = time.time() - start

print(f"String groupby: {string_time:.3f}s")
print(f"Categorical groupby: {cat_time:.3f}s")
print(f"Speedup: {string_time/cat_time:.1f}x")

3. Logical Ordering

Ordered categoricals enable meaningful comparisons.

# Without order: comparison fails
sizes = pd.Series(['medium', 'small', 'large'])
# sizes > 'small'  # TypeError or meaningless result

# With order: comparison works
sizes = pd.Categorical(
    ['medium', 'small', 'large'],
    categories=['small', 'medium', 'large'],
    ordered=True
)
sizes = pd.Series(sizes)

print(sizes > 'small')  # False, False, True

4. Data Validation

Categories enforce valid values—invalid data is caught early.

valid_ratings = ['AAA', 'AA', 'A', 'BBB', 'BB', 'B']
ratings = pd.Categorical(['AA', 'BBB', 'A'], categories=valid_ratings)

# Adding invalid value raises error or becomes NaN
# ratings = pd.Categorical(['AA', 'INVALID'], categories=valid_ratings)
# ValueError or NaN depending on method

When to Use Categorical

Scenario Use Categorical? Reason
Limited unique values (< 50% of rows) ✅ Yes Memory savings
Frequent repeated values ✅ Yes Memory + speed
Natural order exists ✅ Yes Enable comparisons
Heavy groupby/filtering ✅ Yes Performance boost
All unique values ❌ No No benefit
Need string operations ⚠️ Maybe Convert back if needed

Categorical vs Other Types

Type Use Case Memory Ordered Comparison
object (string) Free-form text High No
category Fixed set of values Low Optional
int / float Numeric codes Medium Yes (numeric)

Real-World Applications

  1. Stock Sectors: Technology, Healthcare, Finance, ...
  2. Credit Ratings: AAA, AA, A, BBB, BB, B, ...
  3. Survey Responses: Strongly Agree, Agree, Neutral, ...
  4. Product Categories: Electronics, Clothing, Food, ...
  5. Geographic Regions: North, South, East, West
  6. Time Periods: Q1, Q2, Q3, Q4
  7. Risk Levels: Low, Medium, High, Critical