Memory Usage¶

Understanding memory consumption is essential when working with large datasets. Pandas provides memory_usage() and info() methods for profiling DataFrame memory.

memory_usage() Method¶

Returns the memory consumption of each column in bytes.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'int_col': np.random.randint(0, 100, 100000),
    'float_col': np.random.randn(100000),
    'str_col': ['category_' + str(i % 10) for i in range(100000)],
    'bool_col': np.random.choice([True, False], 100000)
})

print(df.memory_usage())

Index          128
int_col     800000
float_col   800000
str_col     800000
bool_col    100000
dtype: int64

The deep Parameter¶

By default, memory_usage() underestimates memory for object columns (strings). Use deep=True for accurate measurement:

# Without deep: underestimates string memory
print(f"Shallow: {df.memory_usage().sum() / 1e6:.2f} MB")

# With deep: accurate measurement
print(f"Deep: {df.memory_usage(deep=True).sum() / 1e6:.2f} MB")

Shallow: 2.50 MB
Deep: 6.89 MB

Why deep=True Matters¶

Object dtype columns store pointers to Python objects. Without deep=True, only pointer size is counted:

# String column memory comparison
s_string = pd.Series(['hello world'] * 100000)

print(f"Shallow: {s_string.memory_usage() / 1e6:.2f} MB")      # Just pointers
print(f"Deep: {s_string.memory_usage(deep=True) / 1e6:.2f} MB") # Actual strings

index Parameter¶

Control whether to include index memory:

# Include index (default)
print(df.memory_usage(index=True))

# Exclude index
print(df.memory_usage(index=False))

info() Method¶

Provides a comprehensive summary including memory usage.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   int_col    100000 non-null  int64  
 1   float_col  100000 non-null  float64
 2   str_col    100000 non-null  object 
 3   bool_col   100000 non-null  bool   
dtypes: bool(1), float64(1), int64(1), object(1)
memory usage: 2.4 MB

memory_usage Parameter¶

# Accurate memory with deep calculation
df.info(memory_usage='deep')

memory usage: 6.8 MB

verbose Parameter¶

Control column detail display:

# Suppress column details for wide DataFrames
df.info(verbose=False)

Memory by Data Type¶

Different dtypes consume different amounts of memory:

Dtype	Bytes per Value	Notes
`bool`	1	Most efficient for True/False
`int8`	1	Range: -128 to 127
`int16`	2	Range: -32,768 to 32,767
`int32`	4	Range: ±2.1 billion
`int64`	8	Default integer type
`float16`	2	Limited precision
`float32`	4	Single precision
`float64`	8	Default float type
`object`	8+	Pointer + actual object
`category`	1-8	Depends on category count

# Compare memory for same data, different types
n = 1_000_000

df_compare = pd.DataFrame({
    'int64': np.array([1, 2, 3, 4, 5] * (n // 5), dtype='int64'),
    'int32': np.array([1, 2, 3, 4, 5] * (n // 5), dtype='int32'),
    'int16': np.array([1, 2, 3, 4, 5] * (n // 5), dtype='int16'),
    'int8': np.array([1, 2, 3, 4, 5] * (n // 5), dtype='int8'),
})

print(df_compare.memory_usage(deep=True))

Index      128
int64    8000000
int32    4000000
int16    2000000
int8     1000000
dtype: int64

Analyzing Memory Distribution¶

def memory_report(df):
    """Generate a detailed memory report."""
    mem = df.memory_usage(deep=True)
    total = mem.sum()

    print(f"Total Memory: {total / 1e6:.2f} MB\n")
    print(f"{'Column':<20} {'Type':<12} {'Memory':>12} {'Percent':>8}")
    print("-" * 54)

    for col in df.columns:
        col_mem = mem[col]
        pct = col_mem / total * 100
        dtype = str(df[col].dtype)
        print(f"{col:<20} {dtype:<12} {col_mem/1e6:>10.2f} MB {pct:>7.1f}%")

memory_report(df)

Total Memory: 6.89 MB

Column               Type              Memory  Percent
------------------------------------------------------
int_col              int64             0.80 MB   11.6%
float_col            float64           0.80 MB   11.6%
str_col              object            5.19 MB   75.3%
bool_col             bool              0.10 MB    1.5%

Monitoring Memory Growth¶

Track memory during transformations:

def track_memory(df, operation_name):
    """Print memory after an operation."""
    mem_mb = df.memory_usage(deep=True).sum() / 1e6
    print(f"{operation_name}: {mem_mb:.2f} MB")

# Initial
track_memory(df, "Initial")

# After adding column
df['new_col'] = df['int_col'] * 2
track_memory(df, "After adding column")

# After type conversion
df['str_col'] = df['str_col'].astype('category')
track_memory(df, "After categorical conversion")

Practical Example: Optimizing a DataFrame¶

def optimize_dtypes(df, verbose=True):
    """Optimize DataFrame memory by downcasting types."""
    initial_mem = df.memory_usage(deep=True).sum()

    for col in df.columns:
        col_type = df[col].dtype

        # Optimize integers
        if col_type == 'int64':
            c_min, c_max = df[col].min(), df[col].max()
            if c_min >= -128 and c_max <= 127:
                df[col] = df[col].astype('int8')
            elif c_min >= -32768 and c_max <= 32767:
                df[col] = df[col].astype('int16')
            elif c_min >= -2147483648 and c_max <= 2147483647:
                df[col] = df[col].astype('int32')

        # Optimize floats
        elif col_type == 'float64':
            df[col] = df[col].astype('float32')

        # Convert low-cardinality strings to categorical
        elif col_type == 'object':
            n_unique = df[col].nunique()
            if n_unique / len(df) < 0.5:  # Less than 50% unique
                df[col] = df[col].astype('category')

    final_mem = df.memory_usage(deep=True).sum()

    if verbose:
        print(f"Memory: {initial_mem/1e6:.1f} MB → {final_mem/1e6:.1f} MB")
        print(f"Reduction: {(1 - final_mem/initial_mem)*100:.1f}%")

    return df

df_optimized = optimize_dtypes(df.copy())

Summary¶

Method	Purpose	Key Parameter
`memory_usage()`	Per-column memory	`deep=True` for accuracy
`info()`	Overall summary	`memory_usage='deep'`

Best practices: - Always use deep=True for object columns - Monitor memory after each major transformation - Optimize dtypes for large datasets - Convert string columns to categorical when appropriate