Memory Usage¶
Understanding memory consumption is essential when working with large datasets. Pandas provides memory_usage() and info() methods for profiling DataFrame memory.
Mental Model
Each column occupies a contiguous block of memory whose size depends on dtype and row count. memory_usage(deep=True) measures the true footprint, including the actual string contents of object columns. The key insight: object columns often dominate memory because each cell is a separate Python object, while numeric columns are compact arrays.
memory_usage() Method¶
Returns the memory consumption of each column in bytes.
```python import pandas as pd import numpy as np
df = pd.DataFrame({ 'int_col': np.random.randint(0, 100, 100000), 'float_col': np.random.randn(100000), 'str_col': ['category_' + str(i % 10) for i in range(100000)], 'bool_col': np.random.choice([True, False], 100000) })
print(df.memory_usage()) ```
Index 128
int_col 800000
float_col 800000
str_col 800000
bool_col 100000
dtype: int64
The deep Parameter¶
By default, memory_usage() underestimates memory for object columns (strings). Use deep=True for accurate measurement:
```python
Without deep: underestimates string memory¶
print(f"Shallow: {df.memory_usage().sum() / 1e6:.2f} MB")
With deep: accurate measurement¶
print(f"Deep: {df.memory_usage(deep=True).sum() / 1e6:.2f} MB") ```
Shallow: 2.50 MB
Deep: 6.89 MB
Why deep=True Matters¶
Object dtype columns store pointers to Python objects. Without deep=True, only pointer size is counted:
```python
String column memory comparison¶
s_string = pd.Series(['hello world'] * 100000)
print(f"Shallow: {s_string.memory_usage() / 1e6:.2f} MB") # Just pointers print(f"Deep: {s_string.memory_usage(deep=True) / 1e6:.2f} MB") # Actual strings ```
index Parameter¶
Control whether to include index memory:
```python
Include index (default)¶
print(df.memory_usage(index=True))
Exclude index¶
print(df.memory_usage(index=False)) ```
info() Method¶
Provides a comprehensive summary including memory usage.
python
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 int_col 100000 non-null int64
1 float_col 100000 non-null float64
2 str_col 100000 non-null object
3 bool_col 100000 non-null bool
dtypes: bool(1), float64(1), int64(1), object(1)
memory usage: 2.4 MB
memory_usage Parameter¶
```python
Accurate memory with deep calculation¶
df.info(memory_usage='deep') ```
memory usage: 6.8 MB
verbose Parameter¶
Control column detail display:
```python
Suppress column details for wide DataFrames¶
df.info(verbose=False) ```
Memory by Data Type¶
Different dtypes consume different amounts of memory:
| Dtype | Bytes per Value | Notes |
|---|---|---|
bool |
1 | Most efficient for True/False |
int8 |
1 | Range: -128 to 127 |
int16 |
2 | Range: -32,768 to 32,767 |
int32 |
4 | Range: ±2.1 billion |
int64 |
8 | Default integer type |
float16 |
2 | Limited precision |
float32 |
4 | Single precision |
float64 |
8 | Default float type |
object |
8+ | Pointer + actual object |
category |
1-8 | Depends on category count |
```python
Compare memory for same data, different types¶
n = 1_000_000
df_compare = pd.DataFrame({ 'int64': np.array([1, 2, 3, 4, 5] * (n // 5), dtype='int64'), 'int32': np.array([1, 2, 3, 4, 5] * (n // 5), dtype='int32'), 'int16': np.array([1, 2, 3, 4, 5] * (n // 5), dtype='int16'), 'int8': np.array([1, 2, 3, 4, 5] * (n // 5), dtype='int8'), })
print(df_compare.memory_usage(deep=True)) ```
Index 128
int64 8000000
int32 4000000
int16 2000000
int8 1000000
dtype: int64
Analyzing Memory Distribution¶
```python def memory_report(df): """Generate a detailed memory report.""" mem = df.memory_usage(deep=True) total = mem.sum()
print(f"Total Memory: {total / 1e6:.2f} MB\n")
print(f"{'Column':<20} {'Type':<12} {'Memory':>12} {'Percent':>8}")
print("-" * 54)
for col in df.columns:
col_mem = mem[col]
pct = col_mem / total * 100
dtype = str(df[col].dtype)
print(f"{col:<20} {dtype:<12} {col_mem/1e6:>10.2f} MB {pct:>7.1f}%")
memory_report(df) ```
``` Total Memory: 6.89 MB
Column Type Memory Percent¶
int_col int64 0.80 MB 11.6% float_col float64 0.80 MB 11.6% str_col object 5.19 MB 75.3% bool_col bool 0.10 MB 1.5% ```
Monitoring Memory Growth¶
Track memory during transformations:
```python def track_memory(df, operation_name): """Print memory after an operation.""" mem_mb = df.memory_usage(deep=True).sum() / 1e6 print(f"{operation_name}: {mem_mb:.2f} MB")
Initial¶
track_memory(df, "Initial")
After adding column¶
df['new_col'] = df['int_col'] * 2 track_memory(df, "After adding column")
After type conversion¶
df['str_col'] = df['str_col'].astype('category') track_memory(df, "After categorical conversion") ```
Practical Example: Optimizing a DataFrame¶
```python def optimize_dtypes(df, verbose=True): """Optimize DataFrame memory by downcasting types.""" initial_mem = df.memory_usage(deep=True).sum()
for col in df.columns:
col_type = df[col].dtype
# Optimize integers
if col_type == 'int64':
c_min, c_max = df[col].min(), df[col].max()
if c_min >= -128 and c_max <= 127:
df[col] = df[col].astype('int8')
elif c_min >= -32768 and c_max <= 32767:
df[col] = df[col].astype('int16')
elif c_min >= -2147483648 and c_max <= 2147483647:
df[col] = df[col].astype('int32')
# Optimize floats
elif col_type == 'float64':
df[col] = df[col].astype('float32')
# Convert low-cardinality strings to categorical
elif col_type == 'object':
n_unique = df[col].nunique()
if n_unique / len(df) < 0.5: # Less than 50% unique
df[col] = df[col].astype('category')
final_mem = df.memory_usage(deep=True).sum()
if verbose:
print(f"Memory: {initial_mem/1e6:.1f} MB → {final_mem/1e6:.1f} MB")
print(f"Reduction: {(1 - final_mem/initial_mem)*100:.1f}%")
return df
df_optimized = optimize_dtypes(df.copy()) ```
Summary¶
| Method | Purpose | Key Parameter |
|---|---|---|
memory_usage() |
Per-column memory | deep=True for accuracy |
info() |
Overall summary | memory_usage='deep' |
Best practices:
- Always use
deep=Truefor object columns - Monitor memory after each major transformation
- Optimize dtypes for large datasets
- Convert string columns to categorical when appropriate
Exercises¶
Exercise 1.
Create a DataFrame with 100,000 rows containing an int64 column, a float64 column, and a string (object) column. Use .memory_usage(deep=True) to find the total memory. Compare it with .memory_usage() (without deep=True) to see the difference for object columns.
Solution to Exercise 1
Compare shallow vs deep memory measurement.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ints': np.random.randint(0, 100, 100000),
'floats': np.random.randn(100000),
'strings': ['item_' + str(i % 10) for i in range(100000)]
})
shallow = df.memory_usage().sum()
deep = df.memory_usage(deep=True).sum()
print(f"Shallow: {shallow / 1e6:.2f} MB")
print(f"Deep: {deep / 1e6:.2f} MB")
print(f"Difference: {(deep - shallow) / 1e6:.2f} MB")
Exercise 2.
Take a DataFrame with a string column that has low cardinality (few unique values). Convert it to 'category' dtype using .astype('category'). Compare memory usage before and after the conversion using .memory_usage(deep=True).
Solution to Exercise 2
Convert low-cardinality string column to categorical.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'status': np.random.choice(['active', 'inactive', 'pending'], 100000)
})
before = df.memory_usage(deep=True).sum()
df['status'] = df['status'].astype('category')
after = df.memory_usage(deep=True).sum()
print(f"Before: {before / 1e6:.2f} MB")
print(f"After: {after / 1e6:.2f} MB")
print(f"Reduction: {(1 - after / before) * 100:.1f}%")
Exercise 3.
Create a DataFrame with an int64 column where all values fit in int8 range (-128 to 127). Downcast it to int8 using .astype('int8'). Measure the memory reduction and verify the values are unchanged.
Solution to Exercise 3
Downcast int64 to int8 and verify values.
import pandas as pd
import numpy as np
df = pd.DataFrame({'values': np.random.randint(-128, 128, 100000).astype('int64')})
before = df.memory_usage(deep=True).sum()
original_values = df['values'].copy()
df['values'] = df['values'].astype('int8')
after = df.memory_usage(deep=True).sum()
print(f"Before: {before / 1e6:.2f} MB")
print(f"After: {after / 1e6:.2f} MB")
print(f"Reduction: {(1 - after / before) * 100:.1f}%")
assert (df['values'] == original_values).all()
print("Values unchanged after downcast: True")