explode Method¶
The explode() method transforms list-like elements in a column into separate rows, duplicating the index for each element. This is essential for normalizing nested data structures.
Mental Model
explode is the inverse of groupby(...).agg(list). If a cell contains a list like [a, b, c], explode creates three rows -- one per element -- duplicating the values in all other columns. It normalizes nested data into a flat, one-value-per-row format suitable for standard pandas operations.
Basic Usage¶
Exploding a Single Column¶
```python import pandas as pd
df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Carol'], 'hobbies': [['reading', 'chess'], ['hiking'], ['painting', 'music', 'cooking']] })
print("Original:") print(df)
exploded = df.explode('hobbies') print("\nExploded:") print(exploded) ```
``` Original: name hobbies 0 Alice [reading, chess] 1 Bob [hiking] 2 Carol [painting, music, cooking]
Exploded: name hobbies 0 Alice reading 0 Alice chess 1 Bob hiking 2 Carol painting 2 Carol music 2 Carol cooking ```
Index Preservation¶
Notice that the index is preserved and duplicated. This maintains the relationship to the original row.
```python
To get a fresh sequential index¶
exploded_reset = df.explode('hobbies').reset_index(drop=True) print(exploded_reset) ```
Handling Empty Lists and NaN¶
Empty Lists Become NaN¶
```python df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Carol'], 'items': [['a', 'b'], [], ['c']] })
exploded = df.explode('items') print(exploded) ```
name items
0 Alice a
0 Alice b
1 Bob NaN
2 Carol c
NaN Values Stay NaN¶
```python df = pd.DataFrame({ 'name': ['Alice', 'Bob'], 'items': [['a', 'b'], None] })
exploded = df.explode('items') print(exploded) ```
name items
0 Alice a
0 Alice b
1 Bob None
Exploding Multiple Columns¶
Synchronized Explosion (pandas 1.3+)¶
When two columns have lists of the same length, you can explode them together.
```python df = pd.DataFrame({ 'id': [1, 2], 'keys': [['a', 'b'], ['c', 'd', 'e']], 'values': [[10, 20], [30, 40, 50]] })
Explode both columns together¶
exploded = df.explode(['keys', 'values']) print(exploded) ```
id keys values
0 1 a 10
0 1 b 20
1 2 c 30
1 2 d 40
1 2 e 50
Mismatched Lengths Cause Errors¶
```python df = pd.DataFrame({ 'id': [1], 'keys': [['a', 'b']], 'values': [[10, 20, 30]] # Length mismatch! })
This will raise ValueError¶
df.explode(['keys', 'values'])¶
```
Practical Examples¶
1. Normalizing JSON Data¶
```python
Data from an API with nested structure¶
data = pd.DataFrame({ 'user_id': [1, 2, 3], 'username': ['alice', 'bob', 'carol'], 'tags': [ ['python', 'data'], ['javascript', 'react', 'node'], ['python', 'ml', 'deep-learning'] ] })
Normalize to one tag per row¶
normalized = data.explode('tags')
Count tag frequency¶
tag_counts = normalized['tags'].value_counts() print(tag_counts) ```
2. Stock Holdings Breakdown¶
```python
Portfolio with multiple holdings per account¶
portfolio = pd.DataFrame({ 'account': ['IRA', 'Brokerage'], 'tickers': [['AAPL', 'GOOGL', 'MSFT'], ['VTI', 'BND']], 'shares': [[100, 50, 75], [500, 200]] })
Explode to individual holdings¶
holdings = portfolio.explode(['tickers', 'shares']) print(holdings) ```
account tickers shares
0 IRA AAPL 100
0 IRA GOOGL 50
0 IRA MSFT 75
1 Brokerage VTI 500
1 Brokerage BND 200
3. Email Recipients¶
```python emails = pd.DataFrame({ 'subject': ['Meeting', 'Report', 'Update'], 'recipients': [ ['alice@company.com', 'bob@company.com'], ['carol@company.com'], ['alice@company.com', 'bob@company.com', 'carol@company.com'] ], 'sent_date': ['2024-01-15', '2024-01-16', '2024-01-17'] })
One row per recipient¶
expanded = emails.explode('recipients')
Count emails per recipient¶
print(expanded['recipients'].value_counts()) ```
4. Processing Survey Responses¶
```python survey = pd.DataFrame({ 'respondent_id': [1, 2, 3], 'selected_options': [ ['Option A', 'Option C'], ['Option B'], ['Option A', 'Option B', 'Option C', 'Option D'] ] })
Expand to analyze individual selections¶
expanded = survey.explode('selected_options')
Cross-tabulation¶
print(pd.crosstab(expanded['respondent_id'], expanded['selected_options'])) ```
5. Time Series with Multiple Events¶
```python events = pd.DataFrame({ 'date': pd.to_datetime(['2024-01-01', '2024-01-02']), 'events': [ ['login', 'purchase', 'logout'], ['login', 'browse'] ], 'timestamps': [ ['09:00', '10:30', '11:00'], ['14:00', '14:15'] ] })
Expand events with their timestamps¶
expanded = events.explode(['events', 'timestamps']) print(expanded) ```
Working with String Data¶
If your "list" is actually a string representation, you need to convert it first.
String to List Conversion¶
```python df = pd.DataFrame({ 'id': [1, 2], 'tags': ['["python", "data"]', '["ml", "ai"]'] # String, not list! })
Check type¶
print(type(df['tags'].iloc[0])) #
Convert string to list using ast.literal_eval¶
import ast df['tags'] = df['tags'].apply(ast.literal_eval)
Now explode works¶
exploded = df.explode('tags') print(exploded) ```
Comma-Separated Strings¶
```python df = pd.DataFrame({ 'id': [1, 2], 'tags': ['python,data,ml', 'javascript,react'] })
Split string into list, then explode¶
df['tags'] = df['tags'].str.split(',') exploded = df.explode('tags') print(exploded) ```
Inverse Operation: Aggregating Back¶
To reverse an explode operation, use groupby with a list aggregation.
```python
Exploded data¶
exploded = pd.DataFrame({ 'name': ['Alice', 'Alice', 'Bob', 'Bob', 'Bob'], 'hobby': ['reading', 'chess', 'hiking', 'camping', 'fishing'] })
Aggregate back to lists¶
aggregated = exploded.groupby('name')['hobby'].agg(list).reset_index() print(aggregated) ```
name hobby
0 Alice [reading, chess]
1 Bob [hiking, camping, fishing]
Performance Considerations¶
Memory Impact¶
Exploding creates many new rows, which can significantly increase memory usage.
```python
Original: 1000 rows, average 10 items per list = 10,000 rows after explode¶
df = pd.DataFrame({ 'id': range(1000), 'items': [list(range(10)) for _ in range(1000)] })
print(f"Original rows: {len(df)}") exploded = df.explode('items') print(f"Exploded rows: {len(exploded)}") ```
Processing Large Datasets¶
For very large datasets, consider processing in chunks:
python
def explode_in_chunks(df, column, chunk_size=10000):
"""Explode a large DataFrame in chunks."""
chunks = []
for start in range(0, len(df), chunk_size):
chunk = df.iloc[start:start + chunk_size]
chunks.append(chunk.explode(column))
return pd.concat(chunks, ignore_index=True)
Key Parameters¶
| Parameter | Description | Default |
|---|---|---|
column |
Column(s) to explode | Required |
ignore_index |
Reset index to sequential integers | False |
Common Pitfalls¶
1. Expecting Original Index After reset_index¶
```python exploded = df.explode('items')
Index is duplicated (0, 0, 1, 1, 2, 2, ...)¶
Use reset_index for sequential index¶
exploded = df.explode('items').reset_index(drop=True)
Now index is (0, 1, 2, 3, 4, 5, ...)¶
```
2. Scalar Values Don't Explode¶
```python df = pd.DataFrame({ 'id': [1, 2], 'value': [10, 20] # Scalars, not lists })
exploded = df.explode('value') print(exploded) # No change - scalars stay as is ```
3. Nested Lists Stay Nested¶
```python df = pd.DataFrame({ 'id': [1], 'nested': [[['a', 'b'], ['c', 'd']]] # List of lists })
exploded = df.explode('nested') print(exploded) ```
id nested
0 1 [a, b]
0 1 [c, d]
To fully flatten, you need multiple explode calls or a custom approach.
Exercises¶
Exercise 1. Write code that uses df.explode('col') to convert a column containing lists into separate rows.
Solution to Exercise 1
```python import pandas as pd import numpy as np
Solution for the specific exercise¶
np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(10), 'B': np.random.randn(10)}) print(df.head()) ```
Exercise 2. Explain when explode() is useful. Give a real-world example of data that would benefit from it.
Solution to Exercise 2
See the main content for the detailed explanation. The key concept involves understanding the Pandas API and its behavior for this specific operation.
Exercise 3. Create a DataFrame where one column contains comma-separated strings. Split them into lists with .str.split(',') and then use explode().
Solution to Exercise 3
```python import pandas as pd import numpy as np
np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(20), 'B': np.random.randn(20)}) result = df.describe() print(result) ```
Exercise 4. Write code showing that explode() preserves the original index. How can you reset it?
Solution to Exercise 4
```python import pandas as pd import numpy as np
np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(50), 'group': np.random.choice(['X', 'Y'], 50)}) result = df.groupby('group').mean() print(result) ```