explode Method¶

The explode() method transforms list-like elements in a column into separate rows, duplicating the index for each element. This is essential for normalizing nested data structures.

Basic Usage¶

Exploding a Single Column¶

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol'],
    'hobbies': [['reading', 'chess'], ['hiking'], ['painting', 'music', 'cooking']]
})

print("Original:")
print(df)

exploded = df.explode('hobbies')
print("\nExploded:")
print(exploded)

Original:
    name                      hobbies
0  Alice             [reading, chess]
1    Bob                     [hiking]
2  Carol  [painting, music, cooking]

Exploded:
    name   hobbies
0  Alice   reading
0  Alice     chess
1    Bob    hiking
2  Carol  painting
2  Carol     music
2  Carol   cooking

Index Preservation¶

Notice that the index is preserved and duplicated. This maintains the relationship to the original row.

# To get a fresh sequential index
exploded_reset = df.explode('hobbies').reset_index(drop=True)
print(exploded_reset)

Handling Empty Lists and NaN¶

Empty Lists Become NaN¶

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol'],
    'items': [['a', 'b'], [], ['c']]
})

exploded = df.explode('items')
print(exploded)

    name items
0  Alice     a
0  Alice     b
1    Bob   NaN
2  Carol     c

NaN Values Stay NaN¶

df = pd.DataFrame({
    'name': ['Alice', 'Bob'],
    'items': [['a', 'b'], None]
})

exploded = df.explode('items')
print(exploded)

    name items
0  Alice     a
0  Alice     b
1    Bob  None

Exploding Multiple Columns¶

Synchronized Explosion (pandas 1.3+)¶

When two columns have lists of the same length, you can explode them together.

df = pd.DataFrame({
    'id': [1, 2],
    'keys': [['a', 'b'], ['c', 'd', 'e']],
    'values': [[10, 20], [30, 40, 50]]
})

# Explode both columns together
exploded = df.explode(['keys', 'values'])
print(exploded)

   id keys  values
0   1    a      10
0   1    b      20
1   2    c      30
1   2    d      40
1   2    e      50

Mismatched Lengths Cause Errors¶

df = pd.DataFrame({
    'id': [1],
    'keys': [['a', 'b']],
    'values': [[10, 20, 30]]  # Length mismatch!
})

# This will raise ValueError
# df.explode(['keys', 'values'])

Practical Examples¶

1. Normalizing JSON Data¶

# Data from an API with nested structure
data = pd.DataFrame({
    'user_id': [1, 2, 3],
    'username': ['alice', 'bob', 'carol'],
    'tags': [
        ['python', 'data'],
        ['javascript', 'react', 'node'],
        ['python', 'ml', 'deep-learning']
    ]
})

# Normalize to one tag per row
normalized = data.explode('tags')

# Count tag frequency
tag_counts = normalized['tags'].value_counts()
print(tag_counts)

2. Stock Holdings Breakdown¶

# Portfolio with multiple holdings per account
portfolio = pd.DataFrame({
    'account': ['IRA', 'Brokerage'],
    'tickers': [['AAPL', 'GOOGL', 'MSFT'], ['VTI', 'BND']],
    'shares': [[100, 50, 75], [500, 200]]
})

# Explode to individual holdings
holdings = portfolio.explode(['tickers', 'shares'])
print(holdings)

     account tickers shares
0        IRA    AAPL    100
0        IRA   GOOGL     50
0        IRA    MSFT     75
1  Brokerage     VTI    500
1  Brokerage     BND    200

3. Email Recipients¶

emails = pd.DataFrame({
    'subject': ['Meeting', 'Report', 'Update'],
    'recipients': [
        ['alice@company.com', 'bob@company.com'],
        ['carol@company.com'],
        ['alice@company.com', 'bob@company.com', 'carol@company.com']
    ],
    'sent_date': ['2024-01-15', '2024-01-16', '2024-01-17']
})

# One row per recipient
expanded = emails.explode('recipients')

# Count emails per recipient
print(expanded['recipients'].value_counts())

4. Processing Survey Responses¶

survey = pd.DataFrame({
    'respondent_id': [1, 2, 3],
    'selected_options': [
        ['Option A', 'Option C'],
        ['Option B'],
        ['Option A', 'Option B', 'Option C', 'Option D']
    ]
})

# Expand to analyze individual selections
expanded = survey.explode('selected_options')

# Cross-tabulation
print(pd.crosstab(expanded['respondent_id'], expanded['selected_options']))

5. Time Series with Multiple Events¶

events = pd.DataFrame({
    'date': pd.to_datetime(['2024-01-01', '2024-01-02']),
    'events': [
        ['login', 'purchase', 'logout'],
        ['login', 'browse']
    ],
    'timestamps': [
        ['09:00', '10:30', '11:00'],
        ['14:00', '14:15']
    ]
})

# Expand events with their timestamps
expanded = events.explode(['events', 'timestamps'])
print(expanded)

Working with String Data¶

If your "list" is actually a string representation, you need to convert it first.

String to List Conversion¶

df = pd.DataFrame({
    'id': [1, 2],
    'tags': ['["python", "data"]', '["ml", "ai"]']  # String, not list!
})

# Check type
print(type(df['tags'].iloc[0]))  # <class 'str'>

# Convert string to list using ast.literal_eval
import ast
df['tags'] = df['tags'].apply(ast.literal_eval)

# Now explode works
exploded = df.explode('tags')
print(exploded)

Comma-Separated Strings¶

df = pd.DataFrame({
    'id': [1, 2],
    'tags': ['python,data,ml', 'javascript,react']
})

# Split string into list, then explode
df['tags'] = df['tags'].str.split(',')
exploded = df.explode('tags')
print(exploded)

Inverse Operation: Aggregating Back¶

To reverse an explode operation, use groupby with a list aggregation.

# Exploded data
exploded = pd.DataFrame({
    'name': ['Alice', 'Alice', 'Bob', 'Bob', 'Bob'],
    'hobby': ['reading', 'chess', 'hiking', 'camping', 'fishing']
})

# Aggregate back to lists
aggregated = exploded.groupby('name')['hobby'].agg(list).reset_index()
print(aggregated)

    name                        hobby
0  Alice              [reading, chess]
1    Bob  [hiking, camping, fishing]

Performance Considerations¶

Memory Impact¶

Exploding creates many new rows, which can significantly increase memory usage.

# Original: 1000 rows, average 10 items per list = 10,000 rows after explode
df = pd.DataFrame({
    'id': range(1000),
    'items': [list(range(10)) for _ in range(1000)]
})

print(f"Original rows: {len(df)}")
exploded = df.explode('items')
print(f"Exploded rows: {len(exploded)}")

Processing Large Datasets¶

For very large datasets, consider processing in chunks:

def explode_in_chunks(df, column, chunk_size=10000):
    """Explode a large DataFrame in chunks."""
    chunks = []
    for start in range(0, len(df), chunk_size):
        chunk = df.iloc[start:start + chunk_size]
        chunks.append(chunk.explode(column))
    return pd.concat(chunks, ignore_index=True)

Key Parameters¶

Parameter	Description	Default
`column`	Column(s) to explode	Required
`ignore_index`	Reset index to sequential integers	False

Common Pitfalls¶

1. Expecting Original Index After reset_index¶

exploded = df.explode('items')
# Index is duplicated (0, 0, 1, 1, 2, 2, ...)

# Use reset_index for sequential index
exploded = df.explode('items').reset_index(drop=True)
# Now index is (0, 1, 2, 3, 4, 5, ...)

2. Scalar Values Don't Explode¶

df = pd.DataFrame({
    'id': [1, 2],
    'value': [10, 20]  # Scalars, not lists
})

exploded = df.explode('value')
print(exploded)  # No change - scalars stay as is

3. Nested Lists Stay Nested¶

df = pd.DataFrame({
    'id': [1],
    'nested': [[['a', 'b'], ['c', 'd']]]  # List of lists
})

exploded = df.explode('nested')
print(exploded)

   id  nested
0   1  [a, b]
0   1  [c, d]

To fully flatten, you need multiple explode calls or a custom approach.