Introduction to Categoricals¶

In real-world datasets, many columns take on only a limited number of unique values, even if stored as strings or numbers. Such columns are ideal candidates for categorical encoding.

Mental Model

A categorical column is like a dropdown menu: the data can only take values from a fixed list. Pandas stores an integer code for each row and a separate table of unique labels, so repeated strings are stored once instead of millions of times. This insight drives both memory savings and faster groupby operations.

What is Categorical Data?¶

Categorical data represents variables that can take on a limited, fixed number of possible values. Examples include:

Domain	Example	Possible Values
Finance	Stock sector	Technology, Finance, Healthcare, Energy, ...
Surveys	Agreement scale	Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree
Credit	Rating	AAA, AA, A, BBB, BB, B, CCC, ...
Retail	Size	Small, Medium, Large, XL
Demographics	Education	High School, Bachelor's, Master's, PhD

How Pandas Stores Categoricals¶

A Categorical column in pandas is:

Internally stored as integers pointing to a category lookup table
Memory efficient, especially for repeated values
Faster for comparisons, groupby, and filtering operations

``` ┌─────────────────────────────────────────────────────────┐ │ String Storage │ ├─────────────────────────────────────────────────────────┤ │ Row 0: "Technology" (10 bytes) │ │ Row 1: "Finance" (7 bytes) │ │ Row 2: "Technology" (10 bytes) ← Duplicate stored │ │ Row 3: "Healthcare" (10 bytes) │ │ Row 4: "Technology" (10 bytes) ← Duplicate stored │ │ ... │ │ Total: N × avg_string_length bytes │ └─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐ │ Categorical Storage │ ├─────────────────────────────────────────────────────────┤ │ Categories: ["Technology", "Finance", "Healthcare"] │ │ (stored once) │ ├─────────────────────────────────────────────────────────┤ │ Codes (integers): │ │ Row 0: 0 (1 byte) → "Technology" │ │ Row 1: 1 (1 byte) → "Finance" │ │ Row 2: 0 (1 byte) → "Technology" │ │ Row 3: 2 (1 byte) → "Healthcare" │ │ Row 4: 0 (1 byte) → "Technology" │ │ ... │ │ Total: N bytes + category_table │ └─────────────────────────────────────────────────────────┘ ```

Quick Example¶

```python import pandas as pd

Regular string column¶

s_string = pd.Series(['apple', 'banana', 'apple', 'cherry', 'apple']) print(f"String dtype: {s_string.dtype}") # object

Categorical column¶

s_cat = s_string.astype('category') print(f"Categorical dtype: {s_cat.dtype}") # category

View internal structure¶

print(f"Categories: {s_cat.cat.categories.tolist()}") print(f"Codes: {s_cat.cat.codes.tolist()}") ```

String dtype: object Categorical dtype: category Categories: ['apple', 'banana', 'cherry'] Codes: [0, 1, 0, 2, 0]

Benefits of Categorical Data¶

1. Memory Efficiency¶

Categorical data uses dramatically less memory for columns with repeated values.

```python import numpy as np

1 million rows with 10 sectors¶

sectors = ['Tech', 'Finance', 'Healthcare', 'Retail', 'Energy', 'Utilities', 'Media', 'Aerospace', 'Banks', 'Insurance'] data = np.random.choice(sectors, size=1_000_000)

df = pd.DataFrame({'Sector': data})

Before: string storage¶

print(f"String: {df['Sector'].memory_usage(deep=True) / 1e6:.1f} MB")

After: categorical storage¶

df['Sector'] = df['Sector'].astype('category') print(f"Categorical: {df['Sector'].memory_usage(deep=True) / 1e6:.1f} MB") ```

String: 57.0 MB Categorical: 1.0 MB

2. Faster Operations¶

GroupBy and other operations are significantly faster on categorical columns.

```python import time

df['Returns'] = np.random.randn(1_000_000)

Time groupby with string column¶

df_string = df.copy() df_string['Sector'] = df_string['Sector'].astype(str)

start = time.time() _ = df_string.groupby('Sector')['Returns'].mean() string_time = time.time() - start

Time groupby with categorical column¶

start = time.time() _ = df.groupby('Sector')['Returns'].mean() cat_time = time.time() - start

print(f"String groupby: {string_time:.3f}s") print(f"Categorical groupby: {cat_time:.3f}s") print(f"Speedup: {string_time/cat_time:.1f}x") ```

3. Logical Ordering¶

Ordered categoricals enable meaningful comparisons.

```python

Without order: comparison fails¶

sizes = pd.Series(['medium', 'small', 'large'])

sizes > 'small' # TypeError or meaningless result¶

With order: comparison works¶

sizes = pd.Categorical( ['medium', 'small', 'large'], categories=['small', 'medium', 'large'], ordered=True ) sizes = pd.Series(sizes)

print(sizes > 'small') # False, False, True ```

4. Data Validation¶

Categories enforce valid values—invalid data is caught early.

```python valid_ratings = ['AAA', 'AA', 'A', 'BBB', 'BB', 'B'] ratings = pd.Categorical(['AA', 'BBB', 'A'], categories=valid_ratings)

Adding invalid value raises error or becomes NaN¶

ratings = pd.Categorical(['AA', 'INVALID'], categories=valid_ratings)¶

ValueError or NaN depending on method¶

```

When to Use Categorical¶

Scenario	Use Categorical?	Reason
Limited unique values (< 50% of rows)	✅ Yes	Memory savings
Frequent repeated values	✅ Yes	Memory + speed
Natural order exists	✅ Yes	Enable comparisons
Heavy groupby/filtering	✅ Yes	Performance boost
All unique values	❌ No	No benefit
Need string operations	⚠️ Maybe	Convert back if needed

Categorical vs Other Types¶

Type	Use Case	Memory	Ordered Comparison
`object` (string)	Free-form text	High	No
`category`	Fixed set of values	Low	Optional
`int` / `float`	Numeric codes	Medium	Yes (numeric)

Real-World Applications¶

Stock Sectors: Technology, Healthcare, Finance, ...
Credit Ratings: AAA, AA, A, BBB, BB, B, ...
Survey Responses: Strongly Agree, Agree, Neutral, ...
Product Categories: Electronics, Clothing, Food, ...
Geographic Regions: North, South, East, West
Time Periods: Q1, Q2, Q3, Q4
Risk Levels: Low, Medium, High, Critical

Exercises¶

Exercise 1. Create a Series with categorical dtype and print its dtype, cat.categories, and cat.codes.

Solution to Exercise 1

```python import pandas as pd

See page content for relevant API details¶

s = pd.Series(['a', 'b', 'c', 'a', 'b'], dtype='category') print(s) print(s.cat.categories) print(s.cat.codes) ```

Exercise 2. Explain why categorical data types are useful in Pandas. Name two benefits.

Solution to Exercise 2

See the explanation in the main content of this page. The key concept involves understanding the categorical data type and its internal representation in Pandas.

Exercise 3. Write code that creates a DataFrame with 10000 rows of a column with only 3 unique values. Compare the memory usage before and after converting to categorical.

Solution to Exercise 3

```python import pandas as pd import numpy as np

np.random.seed(42) df = pd.DataFrame({'col': np.random.choice(['A', 'B', 'C'], 1000)}) df['col'] = df['col'].astype('category') print(df.dtypes) print(df['col'].value_counts()) ```

Exercise 4. Convert a column of repeated strings to categorical type and demonstrate that comparisons still work correctly (e.g., filtering with ==).

Solution to Exercise 4

```python import pandas as pd

s = pd.Categorical(['low', 'medium', 'high', 'low'], categories=['low', 'medium', 'high'], ordered=True) print(s) print(s > 'low') ```