Skip to content

Memory Efficiency with Categoricals

One of the primary benefits of categorical data is dramatic memory savings. This document demonstrates the memory characteristics and optimization strategies.

Mental Model

Object-dtype columns store a full Python string per cell -- a million rows of "Technology" means a million copies. Categorical dtype stores the string once in a lookup table and keeps only a small integer code per row. The fewer unique values relative to total rows, the larger the memory savings.

How Memory is Saved

String Storage (object dtype)

Each string value is stored separately in memory, even if repeated:

```python import pandas as pd import numpy as np

1 million rows with 10 unique sectors

sectors = ['Technology', 'Finance', 'Healthcare', 'Retail', 'Energy', 'Utilities', 'Media', 'Aerospace', 'Banks', 'Insurance']

np.random.seed(42) data = np.random.choice(sectors, size=1_000_000)

String storage

s_string = pd.Series(data) string_memory = s_string.memory_usage(deep=True) print(f"String storage: {string_memory / 1e6:.2f} MB") ```

String storage: 57.89 MB

Categorical Storage

Categories are stored once; data stores only integer codes:

```python

Categorical storage

s_cat = s_string.astype('category') cat_memory = s_cat.memory_usage(deep=True) print(f"Categorical storage: {cat_memory / 1e6:.2f} MB") print(f"Memory reduction: {string_memory / cat_memory:.1f}x") ```

Categorical storage: 1.00 MB Memory reduction: 57.9x

Memory Breakdown

```python def analyze_categorical_memory(s_cat): """Analyze memory components of a categorical Series.""" # Category table size categories = s_cat.cat.categories cat_memory = categories.memory_usage(deep=True)

# Codes array size (integer array)
codes_memory = s_cat.cat.codes.nbytes

# Total
total = s_cat.memory_usage(deep=True)

print(f"Categories ({len(categories)} unique): {cat_memory:,} bytes")
print(f"Codes ({len(s_cat):,} values): {codes_memory:,} bytes")
print(f"Total: {total:,} bytes")

return cat_memory, codes_memory, total

s_cat = pd.Series(np.random.choice(['A', 'B', 'C'], 1_000_000), dtype='category') analyze_categorical_memory(s_cat) ```

Categories (3 unique): 248 bytes Codes (1,000,000 values): 1,000,000 bytes Total: 1,000,376 bytes

Memory Comparison Table

```python def compare_memory(n_rows, n_categories, avg_string_length=10): """Compare string vs categorical memory usage.""" # Generate data categories = [f'Cat_{i:0{len(str(n_categories))}d}' for i in range(n_categories)] data = np.random.choice(categories, n_rows)

# String
s_string = pd.Series(data)
string_mem = s_string.memory_usage(deep=True)

# Categorical
s_cat = s_string.astype('category')
cat_mem = s_cat.memory_usage(deep=True)

return string_mem, cat_mem, string_mem / cat_mem

Test different scenarios

scenarios = [ (100_000, 5), (100_000, 50), (100_000, 500), (1_000_000, 10), (1_000_000, 100), (1_000_000, 1000), ]

print(f"{'Rows':>12} {'Categories':>12} {'String MB':>12} {'Cat MB':>12} {'Ratio':>8}") print("-" * 60)

for n_rows, n_cats in scenarios: str_mem, cat_mem, ratio = compare_memory(n_rows, n_cats) print(f"{n_rows:>12,} {n_cats:>12} {str_mem/1e6:>12.2f} {cat_mem/1e6:>12.2f} {ratio:>8.1f}x") ```

``` Rows Categories String MB Cat MB Ratio


 100,000            5         5.79         0.10    57.9x
 100,000           50         5.79         0.11    52.6x
 100,000          500         6.30         0.15    42.0x

1,000,000 10 57.89 1.00 57.9x 1,000,000 100 57.89 1.01 57.3x 1,000,000 1000 62.89 1.10 57.2x ```

When Categoricals Save Memory

High Savings (Use Categorical)

  • Few unique values relative to total rows
  • Long string values
  • Many repeated values

```python

Ideal case: 1M rows, 10 categories, long strings

countries = ['United States of America', 'United Kingdom', 'Germany', 'France', 'Japan', 'China', 'India', 'Brazil', 'Canada', 'Australia']

data = np.random.choice(countries, 1_000_000) s_string = pd.Series(data) s_cat = s_string.astype('category')

print(f"String: {s_string.memory_usage(deep=True) / 1e6:.1f} MB") print(f"Categorical: {s_cat.memory_usage(deep=True) / 1e6:.1f} MB") ```

Low Savings (May Not Be Worth It)

  • Many unique values (high cardinality)
  • Short strings
  • Few rows

```python

Poor case: many unique values

unique_ids = [f'ID_{i}' for i in range(100_000)] # All unique s_string = pd.Series(unique_ids) s_cat = s_string.astype('category')

print(f"String: {s_string.memory_usage(deep=True) / 1e6:.1f} MB") print(f"Categorical: {s_cat.memory_usage(deep=True) / 1e6:.1f} MB")

Similar or worse for high cardinality

```

Integer Code Sizes

Pandas automatically chooses the smallest integer type for codes:

Number of Categories Code Type Bytes per Value
≤ 127 int8 1
≤ 32,767 int16 2
≤ 2,147,483,647 int32 4
> 2,147,483,647 int64 8

```python

Few categories -> int8

s = pd.Series(['a', 'b', 'c'] * 1000, dtype='category') print(f"3 categories: {s.cat.codes.dtype}") # int8

Many categories -> int16

cats = [f'cat_{i}' for i in range(200)] s = pd.Series(np.random.choice(cats, 1000), dtype='category') print(f"200 categories: {s.cat.codes.dtype}") # int16 ```

DataFrame Memory Optimization

```python def optimize_dataframe(df, verbose=True): """Convert low-cardinality string columns to categorical.""" original_memory = df.memory_usage(deep=True).sum()

for col in df.select_dtypes(include=['object']).columns:
    n_unique = df[col].nunique()
    n_total = len(df)

    # Convert if less than 50% unique values
    if n_unique / n_total < 0.5:
        df[col] = df[col].astype('category')
        if verbose:
            print(f"Converted '{col}': {n_unique} unique values")

new_memory = df.memory_usage(deep=True).sum()

if verbose:
    print(f"\nMemory: {original_memory/1e6:.1f} MB → {new_memory/1e6:.1f} MB")
    print(f"Reduction: {(1 - new_memory/original_memory)*100:.1f}%")

return df

Example usage

df = pd.DataFrame({ 'sector': np.random.choice(['Tech', 'Finance', 'Health'], 100_000), 'rating': np.random.choice(['A', 'B', 'C', 'D'], 100_000), 'id': [f'ID_{i}' for i in range(100_000)], # High cardinality - won't convert 'value': np.random.randn(100_000) })

df = optimize_dataframe(df) ```

``` Converted 'sector': 3 unique values Converted 'rating': 4 unique values

Memory: 14.2 MB → 2.1 MB Reduction: 85.2% ```

Real-World Example: S&P 500 Data

```python

Simulate S&P 500 historical data

np.random.seed(42)

sectors = ['Technology', 'Healthcare', 'Finance', 'Energy', 'Consumer Discretionary', 'Consumer Staples', 'Industrials', 'Materials', 'Utilities', 'Real Estate', 'Communication Services']

tickers = [f'STOCK_{i:03d}' for i in range(500)] dates = pd.date_range('2020-01-01', '2024-01-01', freq='B')

Create large dataset

n_rows = len(tickers) * len(dates) df = pd.DataFrame({ 'date': np.tile(dates, len(tickers)), 'ticker': np.repeat(tickers, len(dates)), 'sector': np.repeat(np.random.choice(sectors, len(tickers)), len(dates)), 'close': np.random.randn(n_rows).cumsum() + 100, 'volume': np.random.randint(1000, 1000000, n_rows) })

print(f"Dataset size: {len(df):,} rows") print(f"\nBefore optimization:") print(df.memory_usage(deep=True)) print(f"Total: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")

Optimize

df['ticker'] = df['ticker'].astype('category') df['sector'] = df['sector'].astype('category')

print(f"\nAfter optimization:") print(df.memory_usage(deep=True)) print(f"Total: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB") ```

Guidelines

Unique Values (% of rows) Recommendation
< 1% ✅ Definitely use categorical
1-10% ✅ Use categorical
10-50% ⚠️ Test both options
> 50% ❌ Probably not beneficial

Performance vs Memory Trade-off

Converting to categorical has a small upfront cost but saves memory and speeds up operations:

```python import time

Large dataset

n = 5_000_000 sectors = ['A', 'B', 'C', 'D', 'E'] data = np.random.choice(sectors, n)

Conversion time

start = time.time() s_cat = pd.Series(data, dtype='category') conv_time = time.time() - start print(f"Conversion time: {conv_time:.2f}s")

GroupBy comparison

s_string = pd.Series(data) values = np.random.randn(n)

start = time.time() pd.Series(values).groupby(s_string).mean() string_groupby = time.time() - start

start = time.time() pd.Series(values).groupby(s_cat).mean() cat_groupby = time.time() - start

print(f"String groupby: {string_groupby:.3f}s") print(f"Categorical groupby: {cat_groupby:.3f}s") print(f"Speedup: {string_groupby/cat_groupby:.1f}x") ```


Exercises

Exercise 1. Create a DataFrame with a string column containing 100000 rows but only 5 unique values. Compare memory usage before and after converting to categorical.

Solution to Exercise 1

```python import pandas as pd

See page content for relevant API details

s = pd.Series(['a', 'b', 'c', 'a', 'b'], dtype='category') print(s) print(s.cat.categories) print(s.cat.codes) ```


Exercise 2. Explain how categorical data is stored internally (codes + categories). Why is this more memory-efficient than storing repeated strings?

Solution to Exercise 2

See the explanation in the main content of this page. The key concept involves understanding the categorical data type and its internal representation in Pandas.


Exercise 3. Write code that reads memory usage of each column in a DataFrame using df.memory_usage(deep=True) and identifies which columns would benefit from categorical conversion.

Solution to Exercise 3

```python import pandas as pd import numpy as np

np.random.seed(42) df = pd.DataFrame({'col': np.random.choice(['A', 'B', 'C'], 1000)}) df['col'] = df['col'].astype('category') print(df.dtypes) print(df['col'].value_counts()) ```


Exercise 4. Create a function that takes a DataFrame and automatically converts all string columns with fewer than 50 unique values to categorical type. Return the total memory saved.

Solution to Exercise 4

```python import pandas as pd

s = pd.Categorical(['low', 'medium', 'high', 'low'], categories=['low', 'medium', 'high'], ordered=True) print(s) print(s > 'low') ```