Hierarchical Indexing¶
Hierarchical indexing (also called MultiIndex) enables you to store and manipulate data with multiple levels of indexing. This is essential for working with higher-dimensional data in a 2D DataFrame structure.
Conceptual Overview¶
temperature humidity
country city
india mumbai 32 70
delhi 45 60
us new york 21 68
chicago 14 65
↑ ↑
Level 0 Level 1
Hierarchical indexing allows you to: - Represent 3D+ data in a 2D structure - Perform grouped operations efficiently - Select data at different levels of granularity
Creating Hierarchical Indexes¶
Using pd.concat with keys¶
import pandas as pd
# Create separate DataFrames
india = pd.DataFrame({
"city": ["mumbai", "delhi"],
"temperature": [32, 45],
"humidity": [70, 60],
}).set_index('city')
us = pd.DataFrame({
"city": ["new york", "chicago"],
"temperature": [21, 14],
"humidity": [68, 65],
}).set_index('city')
# Concatenate with hierarchical keys
df = pd.concat([india, us], keys=["india", "us"])
print(df)
temperature humidity
india mumbai 32 70
delhi 45 60
us new york 21 68
chicago 14 65
Using MultiIndex.from_tuples¶
index = pd.MultiIndex.from_tuples([
('india', 'mumbai'),
('india', 'delhi'),
('us', 'new york'),
('us', 'chicago')
], names=['country', 'city'])
df = pd.DataFrame({
'temperature': [32, 45, 21, 14],
'humidity': [70, 60, 68, 65]
}, index=index)
Using MultiIndex.from_product¶
Creates a MultiIndex from the Cartesian product of iterables.
countries = ['india', 'us']
metrics = ['temperature', 'humidity']
index = pd.MultiIndex.from_product(
[countries, metrics],
names=['country', 'metric']
)
print(index)
MultiIndex([('india', 'temperature'),
('india', 'humidity'),
('us', 'temperature'),
('us', 'humidity')],
names=['country', 'metric'])
Using MultiIndex.from_arrays¶
arrays = [
['india', 'india', 'us', 'us'],
['mumbai', 'delhi', 'new york', 'chicago']
]
index = pd.MultiIndex.from_arrays(arrays, names=['country', 'city'])
Selecting Data with MultiIndex¶
Using loc with Tuples (Recommended)¶
df = pd.concat([india, us], keys=["india", "us"])
# Select specific country and city
print(df.loc[("us", "new york")])
temperature 21
humidity 68
Name: (us, new york), dtype: int64
Selecting an Entire Level¶
# Select all cities in 'us'
print(df.loc["us"])
temperature humidity
city
new york 21 68
chicago 14 65
Avoid Chained Indexing¶
# NOT RECOMMENDED - Chained indexing
df.loc["us"].loc["new york"]
# RECOMMENDED - Single tuple access
df.loc[("us", "new york")]
Why avoid chained indexing?
- df.loc["us"] creates an intermediate DataFrame
- pandas treats these as separate operations
- Can lead to unpredictable behavior with assignments
- Single tuple access is faster and clearer
Hierarchical Columns¶
MultiIndex can also be applied to columns.
df = pd.DataFrame(
[[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]],
columns=pd.MultiIndex.from_product(
[['one', 'two'], ['first', 'second']]
)
)
print(df)
one two
first second first second
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
Accessing Hierarchical Columns¶
# Select top-level column group
print(df['one'])
first second
0 1 2
1 5 6
2 9 10
# Select specific nested column - use tuple
print(df[('one', 'second')])
0 2
1 6
2 10
Name: (one, second), dtype: int64
Avoid Chained Column Access¶
# NOT RECOMMENDED
df['one']['second']
# RECOMMENDED
df.loc[:, ('one', 'second')]
MultiIndex Operations¶
Swapping Levels¶
df = pd.concat([india, us], keys=["india", "us"])
# Swap the order of index levels
df_swapped = df.swaplevel()
print(df_swapped)
temperature humidity
mumbai india 32 70
delhi india 45 60
new york us 21 68
chicago us 14 65
Reordering Levels¶
For MultiIndex with more than 2 levels, use reorder_levels().
# Create 3-level MultiIndex
index = pd.MultiIndex.from_tuples([
('2024', 'Q1', 'Jan'),
('2024', 'Q1', 'Feb'),
('2024', 'Q2', 'Apr'),
('2024', 'Q2', 'May')
], names=['year', 'quarter', 'month'])
df = pd.DataFrame({'value': [10, 20, 30, 40]}, index=index)
print(df)
value
year quarter month
2024 Q1 Jan 10
Feb 20
Q2 Apr 30
May 40
# Reorder to: month, year, quarter
df_reordered = df.reorder_levels(['month', 'year', 'quarter'])
print(df_reordered)
value
month year quarter
Jan 2024 Q1 10
Feb 2024 Q1 20
Apr 2024 Q2 30
May 2024 Q2 40
# Use integer positions
df_reordered = df.reorder_levels([2, 0, 1]) # Same result
# Sort after reordering for proper order
df_reordered = df.reorder_levels(['month', 'year', 'quarter']).sort_index()
Sorting by Index¶
# Sort by all levels
df_sorted = df.sort_index()
# Sort by specific level
df_sorted = df.sort_index(level='city')
Resetting Index¶
# Convert MultiIndex to columns
df_reset = df.reset_index()
print(df_reset)
level_0 level_1 temperature humidity
0 india mumbai 32 70
1 india delhi 45 60
2 us new york 21 68
3 us chicago 14 65
Setting MultiIndex from Columns¶
df_reset.set_index(['level_0', 'level_1'], inplace=True)
df_reset.index.names = ['country', 'city']
Cross-Section Selection with xs¶
The xs method provides a convenient way to select data at a particular level.
df = pd.concat([india, us], keys=["india", "us"])
# Select all rows where level 0 is 'india'
print(df.xs('india', level=0))
temperature humidity
city
mumbai 32 70
delhi 45 60
# Select all rows where city is 'delhi' (level 1)
print(df.xs('delhi', level=1))
temperature humidity
country
india 45 60
Aggregation with MultiIndex¶
GroupBy on Index Levels¶
# Aggregate by country (level 0)
print(df.groupby(level=0).mean())
temperature humidity
country
india 38.5 65.0
us 17.5 66.5
Using level parameter in aggregations¶
# Sum across level 0
print(df.sum(level=0)) # Deprecated in newer pandas
# Modern approach
print(df.groupby(level=0).sum())
Financial Example: Multi-Asset Time Series¶
import pandas as pd
import numpy as np
# Create multi-asset price data
dates = pd.date_range('2024-01-01', periods=5)
assets = ['AAPL', 'MSFT']
metrics = ['open', 'high', 'low', 'close']
# Create MultiIndex columns
columns = pd.MultiIndex.from_product([assets, metrics])
# Random price data
np.random.seed(42)
data = np.random.randn(5, 8).cumsum(axis=0) + 100
df = pd.DataFrame(data, index=dates, columns=columns)
print(df.round(2))
AAPL MSFT
open high low close open high low close
2024-01-01 100.50 100.86 101.51 103.02 100.31 99.85 99.38 100.95
2024-01-02 101.27 100.58 101.99 103.81 100.82 100.00 99.97 101.12
...
Selecting Asset Data¶
# Get all AAPL data
aapl_data = df['AAPL']
# Get close prices for all assets
close_prices = df.xs('close', level=1, axis=1)
print(close_prices)
Index Attributes¶
names¶
df = pd.concat([india, us], keys=["india", "us"])
print(df.index.names) # [None, 'city']
# Set names
df.index.names = ['country', 'city']
levels¶
print(df.index.levels)
# FrozenList([['india', 'us'], ['chicago', 'delhi', 'mumbai', 'new york']])
nlevels¶
print(df.index.nlevels) # 2
Best Practices¶
- Use tuple indexing instead of chained indexing for clarity and performance
- Name your index levels for self-documenting code
- Sort the index after creation for better performance with
loc - Consider alternatives - sometimes separate columns are clearer than MultiIndex
- Use xs() for cleaner cross-section selection
# Good: Clear, named, sorted MultiIndex
df.index.names = ['country', 'city']
df = df.sort_index()
result = df.loc[('us', 'chicago'), 'temperature']
# Avoid: Unnamed, unsorted, chained access
result = df.loc['us'].loc['chicago']['temperature']
Runnable Example: multiindex_tutorial.py¶
"""
Pandas Tutorial: Multi-Index (Hierarchical Indexing).
Covers creating and working with multi-level indices.
"""
import pandas as pd
import numpy as np
# =============================================================================
# Main
# =============================================================================
if __name__ == "__main__":
print("="*70)
print("MULTI-INDEX (HIERARCHICAL INDEXING)")
print("="*70)
# Create multi-index DataFrame
np.random.seed(42)
index = pd.MultiIndex.from_product([
['Store1', 'Store2', 'Store3'],
['Product A', 'Product B']
], names=['Store', 'Product'])
df = pd.DataFrame({
'Sales': np.random.randint(100, 1000, 6),
'Quantity': np.random.randint(10, 100, 6)
}, index=index)
print("\nMulti-Index DataFrame:")
print(df)
# Selecting with multi-index
print("\n1. Select by outer index (Store1):")
print(df.loc['Store1'])
print("\n2. Select by both indices (Store1, Product A):")
print(df.loc[('Store1', 'Product A')])
print("\n3. Select using slice:")
print(df.loc[('Store1', slice(None)), :])
# Stack/Unstack
print("\n4. Unstack (inner index to columns):")
unstacked = df.unstack()
print(unstacked)
print("\n5. Stack back:")
stacked = unstacked.stack()
print(stacked)
# Swap levels
print("\n6. Swap index levels:")
swapped = df.swaplevel()
print(swapped)
# Sort by index
print("\n7. Sort by index:")
sorted_df = swapped.sort_index()
print(sorted_df)
# Reset index
print("\n8. Reset multi-index to columns:")
reset = df.reset_index()
print(reset)
# Set multi-index from columns
print("\n9. Create multi-index from columns:")
df_flat = pd.DataFrame({
'Store': ['A', 'A', 'B', 'B'],
'Product': ['X', 'Y', 'X', 'Y'],
'Sales': [100, 200, 150, 250]
})
print("Flat DataFrame:")
print(df_flat)
df_multi = df_flat.set_index(['Store', 'Product'])
print("\nWith Multi-Index:")
print(df_multi)
# Aggregation with multi-index
print("\n10. Aggregation by level:")
print("Sum by Store:")
print(df.sum(level='Store'))
print("\nMean by Product:")
print(df.mean(level='Product'))
print("\nKEY TAKEAWAYS:")
print("- MultiIndex: Hierarchical row/column indices")
print("- Create with from_product(), from_tuples(), from_arrays()")
print("- Select with loc[] using tuples")
print("- unstack(): Move index level to columns")
print("- stack(): Move column level to index")
print("- swaplevel(): Swap index levels")
print("- Aggregate by specific levels")