Memory Efficiency with Categoricals¶

One of the primary benefits of categorical data is dramatic memory savings. This document demonstrates the memory characteristics and optimization strategies.

Mental Model

Object-dtype columns store a full Python string per cell -- a million rows of "Technology" means a million copies. Categorical dtype stores the string once in a lookup table and keeps only a small integer code per row. The fewer unique values relative to total rows, the larger the memory savings.

How Memory is Saved¶

String Storage (object dtype)¶

Each string value is stored separately in memory, even if repeated:

```python import pandas as pd import numpy as np

1 million rows with 10 unique sectors¶

sectors = ['Technology', 'Finance', 'Healthcare', 'Retail', 'Energy', 'Utilities', 'Media', 'Aerospace', 'Banks', 'Insurance']

np.random.seed(42) data = np.random.choice(sectors, size=1_000_000)

String storage¶

s_string = pd.Series(data) string_memory = s_string.memory_usage(deep=True) print(f"String storage: {string_memory / 1e6:.2f} MB") ```

String storage: 57.89 MB

Categorical Storage¶

Categories are stored once; data stores only integer codes:

```python

Categorical storage¶

s_cat = s_string.astype('category') cat_memory = s_cat.memory_usage(deep=True) print(f"Categorical storage: {cat_memory / 1e6:.2f} MB") print(f"Memory reduction: {string_memory / cat_memory:.1f}x") ```

Categorical storage: 1.00 MB Memory reduction: 57.9x

Memory Breakdown¶

```python def analyze_categorical_memory(s_cat): """Analyze memory components of a categorical Series.""" # Category table size categories = s_cat.cat.categories cat_memory = categories.memory_usage(deep=True)

# Codes array size (integer array)
codes_memory = s_cat.cat.codes.nbytes

# Total
total = s_cat.memory_usage(deep=True)

print(f"Categories ({len(categories)} unique): {cat_memory:,} bytes")
print(f"Codes ({len(s_cat):,} values): {codes_memory:,} bytes")
print(f"Total: {total:,} bytes")

return cat_memory, codes_memory, total

s_cat = pd.Series(np.random.choice(['A', 'B', 'C'], 1_000_000), dtype='category') analyze_categorical_memory(s_cat) ```

Categories (3 unique): 248 bytes Codes (1,000,000 values): 1,000,000 bytes Total: 1,000,376 bytes

Memory Comparison Table¶

```python def compare_memory(n_rows, n_categories, avg_string_length=10): """Compare string vs categorical memory usage.""" # Generate data categories = [f'Cat_{i:0{len(str(n_categories))}d}' for i in range(n_categories)] data = np.random.choice(categories, n_rows)

# String
s_string = pd.Series(data)
string_mem = s_string.memory_usage(deep=True)

# Categorical
s_cat = s_string.astype('category')
cat_mem = s_cat.memory_usage(deep=True)

return string_mem, cat_mem, string_mem / cat_mem

Test different scenarios¶

scenarios = [ (100_000, 5), (100_000, 50), (100_000, 500), (1_000_000, 10), (1_000_000, 100), (1_000_000, 1000), ]

print(f"{'Rows':>12} {'Categories':>12} {'String MB':>12} {'Cat MB':>12} {'Ratio':>8}") print("-" * 60)

for n_rows, n_cats in scenarios: str_mem, cat_mem, ratio = compare_memory(n_rows, n_cats) print(f"{n_rows:>12,} {n_cats:>12} {str_mem/1e6:>12.2f} {cat_mem/1e6:>12.2f} {ratio:>8.1f}x") ```

``` Rows Categories String MB Cat MB Ratio

 100,000            5         5.79         0.10    57.9x
 100,000           50         5.79         0.11    52.6x
 100,000          500         6.30         0.15    42.0x

1,000,000 10 57.89 1.00 57.9x 1,000,000 100 57.89 1.01 57.3x 1,000,000 1000 62.89 1.10 57.2x ```

When Categoricals Save Memory¶

High Savings (Use Categorical)¶

Few unique values relative to total rows
Long string values
Many repeated values

```python

Ideal case: 1M rows, 10 categories, long strings¶

countries = ['United States of America', 'United Kingdom', 'Germany', 'France', 'Japan', 'China', 'India', 'Brazil', 'Canada', 'Australia']

data = np.random.choice(countries, 1_000_000) s_string = pd.Series(data) s_cat = s_string.astype('category')

print(f"String: {s_string.memory_usage(deep=True) / 1e6:.1f} MB") print(f"Categorical: {s_cat.memory_usage(deep=True) / 1e6:.1f} MB") ```

Low Savings (May Not Be Worth It)¶

Many unique values (high cardinality)
Short strings
Few rows

```python

Poor case: many unique values¶

unique_ids = [f'ID_{i}' for i in range(100_000)] # All unique s_string = pd.Series(unique_ids) s_cat = s_string.astype('category')

print(f"String: {s_string.memory_usage(deep=True) / 1e6:.1f} MB") print(f"Categorical: {s_cat.memory_usage(deep=True) / 1e6:.1f} MB")

Similar or worse for high cardinality¶

```

Integer Code Sizes¶

Pandas automatically chooses the smallest integer type for codes:

Number of Categories	Code Type	Bytes per Value
≤ 127	int8	1
≤ 32,767	int16	2
≤ 2,147,483,647	int32	4
> 2,147,483,647	int64	8

```python

Few categories -> int8¶

s = pd.Series(['a', 'b', 'c'] * 1000, dtype='category') print(f"3 categories: {s.cat.codes.dtype}") # int8

Many categories -> int16¶

cats = [f'cat_{i}' for i in range(200)] s = pd.Series(np.random.choice(cats, 1000), dtype='category') print(f"200 categories: {s.cat.codes.dtype}") # int16 ```

DataFrame Memory Optimization¶

```python def optimize_dataframe(df, verbose=True): """Convert low-cardinality string columns to categorical.""" original_memory = df.memory_usage(deep=True).sum()

for col in df.select_dtypes(include=['object']).columns:
    n_unique = df[col].nunique()
    n_total = len(df)

    # Convert if less than 50% unique values
    if n_unique / n_total < 0.5:
        df[col] = df[col].astype('category')
        if verbose:
            print(f"Converted '{col}': {n_unique} unique values")

new_memory = df.memory_usage(deep=True).sum()

if verbose:
    print(f"\nMemory: {original_memory/1e6:.1f} MB → {new_memory/1e6:.1f} MB")
    print(f"Reduction: {(1 - new_memory/original_memory)*100:.1f}%")

return df

Example usage¶

df = pd.DataFrame({ 'sector': np.random.choice(['Tech', 'Finance', 'Health'], 100_000), 'rating': np.random.choice(['A', 'B', 'C', 'D'], 100_000), 'id': [f'ID_{i}' for i in range(100_000)], # High cardinality - won't convert 'value': np.random.randn(100_000) })

df = optimize_dataframe(df) ```

``` Converted 'sector': 3 unique values Converted 'rating': 4 unique values

Memory: 14.2 MB → 2.1 MB Reduction: 85.2% ```

Real-World Example: S&P 500 Data¶

```python

Simulate S&P 500 historical data¶

np.random.seed(42)

sectors = ['Technology', 'Healthcare', 'Finance', 'Energy', 'Consumer Discretionary', 'Consumer Staples', 'Industrials', 'Materials', 'Utilities', 'Real Estate', 'Communication Services']

tickers = [f'STOCK_{i:03d}' for i in range(500)] dates = pd.date_range('2020-01-01', '2024-01-01', freq='B')

Create large dataset¶

n_rows = len(tickers) * len(dates) df = pd.DataFrame({ 'date': np.tile(dates, len(tickers)), 'ticker': np.repeat(tickers, len(dates)), 'sector': np.repeat(np.random.choice(sectors, len(tickers)), len(dates)), 'close': np.random.randn(n_rows).cumsum() + 100, 'volume': np.random.randint(1000, 1000000, n_rows) })

print(f"Dataset size: {len(df):,} rows") print(f"\nBefore optimization:") print(df.memory_usage(deep=True)) print(f"Total: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")

Optimize¶

df['ticker'] = df['ticker'].astype('category') df['sector'] = df['sector'].astype('category')

print(f"\nAfter optimization:") print(df.memory_usage(deep=True)) print(f"Total: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB") ```

Guidelines¶

Unique Values (% of rows)	Recommendation
< 1%	✅ Definitely use categorical
1-10%	✅ Use categorical
10-50%	⚠️ Test both options
> 50%	❌ Probably not beneficial

Performance vs Memory Trade-off¶

Converting to categorical has a small upfront cost but saves memory and speeds up operations:

```python import time

Large dataset¶

n = 5_000_000 sectors = ['A', 'B', 'C', 'D', 'E'] data = np.random.choice(sectors, n)

Conversion time¶

start = time.time() s_cat = pd.Series(data, dtype='category') conv_time = time.time() - start print(f"Conversion time: {conv_time:.2f}s")

GroupBy comparison¶

s_string = pd.Series(data) values = np.random.randn(n)

start = time.time() pd.Series(values).groupby(s_string).mean() string_groupby = time.time() - start

start = time.time() pd.Series(values).groupby(s_cat).mean() cat_groupby = time.time() - start

print(f"String groupby: {string_groupby:.3f}s") print(f"Categorical groupby: {cat_groupby:.3f}s") print(f"Speedup: {string_groupby/cat_groupby:.1f}x") ```

Exercises¶

Exercise 1. Create a DataFrame with a string column containing 100000 rows but only 5 unique values. Compare memory usage before and after converting to categorical.

Solution to Exercise 1

```python import pandas as pd

See page content for relevant API details¶

s = pd.Series(['a', 'b', 'c', 'a', 'b'], dtype='category') print(s) print(s.cat.categories) print(s.cat.codes) ```

Exercise 2. Explain how categorical data is stored internally (codes + categories). Why is this more memory-efficient than storing repeated strings?

Solution to Exercise 2

See the explanation in the main content of this page. The key concept involves understanding the categorical data type and its internal representation in Pandas.

Exercise 3. Write code that reads memory usage of each column in a DataFrame using df.memory_usage(deep=True) and identifies which columns would benefit from categorical conversion.

Solution to Exercise 3

```python import pandas as pd import numpy as np

np.random.seed(42) df = pd.DataFrame({'col': np.random.choice(['A', 'B', 'C'], 1000)}) df['col'] = df['col'].astype('category') print(df.dtypes) print(df['col'].value_counts()) ```

Exercise 4. Create a function that takes a DataFrame and automatically converts all string columns with fewer than 50 unique values to categorical type. Return the total memory saved.

Solution to Exercise 4

```python import pandas as pd

s = pd.Categorical(['low', 'medium', 'high', 'low'], categories=['low', 'medium', 'high'], ordered=True) print(s) print(s > 'low') ```