Hierarchical Indexing¶

Hierarchical indexing (also called MultiIndex) enables you to store and manipulate data with multiple levels of indexing. This is essential for working with higher-dimensional data in a 2D DataFrame structure.

Mental Model

A MultiIndex is a tree of labels on the row (or column) axis. Instead of a flat list like [0, 1, 2], you get nested levels like (country, city). This lets a 2D DataFrame represent 3D or higher-dimensional data by encoding extra dimensions into the index, without needing separate tables.

Conceptual Overview¶

temperature humidity country city india mumbai 32 70 delhi 45 60 us new york 21 68 chicago 14 65 ↑ ↑ Level 0 Level 1

Hierarchical indexing allows you to:

Represent 3D+ data in a 2D structure
Perform grouped operations efficiently
Select data at different levels of granularity

Creating Hierarchical Indexes¶

Using pd.concat with keys¶

```python import pandas as pd

Create separate DataFrames¶

india = pd.DataFrame({ "city": ["mumbai", "delhi"], "temperature": [32, 45], "humidity": [70, 60], }).set_index('city')

us = pd.DataFrame({ "city": ["new york", "chicago"], "temperature": [21, 14], "humidity": [68, 65], }).set_index('city')

Concatenate with hierarchical keys¶

df = pd.concat([india, us], keys=["india", "us"]) print(df) ```

temperature humidity india mumbai 32 70 delhi 45 60 us new york 21 68 chicago 14 65

Using MultiIndex.from_tuples¶

```python index = pd.MultiIndex.from_tuples([ ('india', 'mumbai'), ('india', 'delhi'), ('us', 'new york'), ('us', 'chicago') ], names=['country', 'city'])

df = pd.DataFrame({ 'temperature': [32, 45, 21, 14], 'humidity': [70, 60, 68, 65] }, index=index) ```

Using MultiIndex.from_product¶

Creates a MultiIndex from the Cartesian product of iterables.

```python countries = ['india', 'us'] metrics = ['temperature', 'humidity']

index = pd.MultiIndex.from_product( [countries, metrics], names=['country', 'metric'] ) print(index) ```

MultiIndex([('india', 'temperature'), ('india', 'humidity'), ('us', 'temperature'), ('us', 'humidity')], names=['country', 'metric'])

Using MultiIndex.from_arrays¶

```python arrays = [ ['india', 'india', 'us', 'us'], ['mumbai', 'delhi', 'new york', 'chicago'] ]

index = pd.MultiIndex.from_arrays(arrays, names=['country', 'city']) ```

Selecting Data with MultiIndex¶

Using loc with Tuples (Recommended)¶

```python df = pd.concat([india, us], keys=["india", "us"])

Select specific country and city¶

print(df.loc[("us", "new york")]) ```

temperature 21 humidity 68 Name: (us, new york), dtype: int64

Selecting an Entire Level¶

```python

Select all cities in 'us'¶

print(df.loc["us"]) ```

temperature humidity city new york 21 68 chicago 14 65

Avoid Chained Indexing¶

```python

NOT RECOMMENDED - Chained indexing¶

df.loc["us"].loc["new york"]

RECOMMENDED - Single tuple access¶

df.loc[("us", "new york")] ```

Why avoid chained indexing?

df.loc["us"] creates an intermediate DataFrame
pandas treats these as separate operations
Can lead to unpredictable behavior with assignments
Single tuple access is faster and clearer

Hierarchical Columns¶

MultiIndex can also be applied to columns.

python df = pd.DataFrame( [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]], columns=pd.MultiIndex.from_product( [['one', 'two'], ['first', 'second']] ) ) print(df)

one two first second first second 0 1 2 3 4 1 5 6 7 8 2 9 10 11 12

Accessing Hierarchical Columns¶

```python

Select top-level column group¶

print(df['one']) ```

first second 0 1 2 1 5 6 2 9 10

```python

Select specific nested column - use tuple¶

print(df[('one', 'second')]) ```

0 2 1 6 2 10 Name: (one, second), dtype: int64

Avoid Chained Column Access¶

```python

NOT RECOMMENDED¶

df['one']['second']

RECOMMENDED¶

df.loc[:, ('one', 'second')] ```

MultiIndex Operations¶

Swapping Levels¶

```python df = pd.concat([india, us], keys=["india", "us"])

Swap the order of index levels¶

df_swapped = df.swaplevel() print(df_swapped) ```

temperature humidity mumbai india 32 70 delhi india 45 60 new york us 21 68 chicago us 14 65

Reordering Levels¶

For MultiIndex with more than 2 levels, use reorder_levels().

```python

Create 3-level MultiIndex¶

index = pd.MultiIndex.from_tuples([ ('2024', 'Q1', 'Jan'), ('2024', 'Q1', 'Feb'), ('2024', 'Q2', 'Apr'), ('2024', 'Q2', 'May') ], names=['year', 'quarter', 'month'])

df = pd.DataFrame({'value': [10, 20, 30, 40]}, index=index) print(df) ```

value year quarter month 2024 Q1 Jan 10 Feb 20 Q2 Apr 30 May 40

```python

Reorder to: month, year, quarter¶

df_reordered = df.reorder_levels(['month', 'year', 'quarter']) print(df_reordered) ```

value month year quarter Jan 2024 Q1 10 Feb 2024 Q1 20 Apr 2024 Q2 30 May 2024 Q2 40

```python

Use integer positions¶

df_reordered = df.reorder_levels([2, 0, 1]) # Same result

Sort after reordering for proper order¶

df_reordered = df.reorder_levels(['month', 'year', 'quarter']).sort_index() ```

Sorting by Index¶

```python

Sort by all levels¶

df_sorted = df.sort_index()

Sort by specific level¶

df_sorted = df.sort_index(level='city') ```

Resetting Index¶

```python

Convert MultiIndex to columns¶

df_reset = df.reset_index() print(df_reset) ```

level_0 level_1 temperature humidity 0 india mumbai 32 70 1 india delhi 45 60 2 us new york 21 68 3 us chicago 14 65

Setting MultiIndex from Columns¶

python df_reset.set_index(['level_0', 'level_1'], inplace=True) df_reset.index.names = ['country', 'city']

Cross-Section Selection with xs¶

The xs method provides a convenient way to select data at a particular level.

```python df = pd.concat([india, us], keys=["india", "us"])

Select all rows where level 0 is 'india'¶

print(df.xs('india', level=0)) ```

temperature humidity city mumbai 32 70 delhi 45 60

```python

Select all rows where city is 'delhi' (level 1)¶

print(df.xs('delhi', level=1)) ```

temperature humidity country india 45 60

Aggregation with MultiIndex¶

GroupBy on Index Levels¶

```python

Aggregate by country (level 0)¶

print(df.groupby(level=0).mean()) ```

temperature humidity country india 38.5 65.0 us 17.5 66.5

Using level parameter in aggregations¶

```python

Sum across level 0¶

print(df.sum(level=0)) # Deprecated in newer pandas

Modern approach¶

print(df.groupby(level=0).sum()) ```

Financial Example: Multi-Asset Time Series¶

```python import pandas as pd import numpy as np

Create multi-asset price data¶

dates = pd.date_range('2024-01-01', periods=5) assets = ['AAPL', 'MSFT'] metrics = ['open', 'high', 'low', 'close']

Create MultiIndex columns¶

columns = pd.MultiIndex.from_product([assets, metrics])

Random price data¶

np.random.seed(42) data = np.random.randn(5, 8).cumsum(axis=0) + 100

df = pd.DataFrame(data, index=dates, columns=columns) print(df.round(2)) ```

AAPL MSFT open high low close open high low close 2024-01-01 100.50 100.86 101.51 103.02 100.31 99.85 99.38 100.95 2024-01-02 101.27 100.58 101.99 103.81 100.82 100.00 99.97 101.12 ...

Selecting Asset Data¶

```python

Get all AAPL data¶

aapl_data = df['AAPL']

Get close prices for all assets¶

close_prices = df.xs('close', level=1, axis=1) print(close_prices) ```

Index Attributes¶

names¶

```python df = pd.concat([india, us], keys=["india", "us"]) print(df.index.names) # [None, 'city']

Set names¶

df.index.names = ['country', 'city'] ```

levels¶

```python print(df.index.levels)

FrozenList([['india', 'us'], ['chicago', 'delhi', 'mumbai', 'new york']])¶

```

nlevels¶

python print(df.index.nlevels) # 2

Best Practices¶

Use tuple indexing instead of chained indexing for clarity and performance
Name your index levels for self-documenting code
Sort the index after creation for better performance with loc
Consider alternatives - sometimes separate columns are clearer than MultiIndex
Use xs() for cleaner cross-section selection

```python

Good: Clear, named, sorted MultiIndex¶

df.index.names = ['country', 'city'] df = df.sort_index() result = df.loc[('us', 'chicago'), 'temperature']

Avoid: Unnamed, unsorted, chained access¶

result = df.loc['us'].loc['chicago']['temperature'] ```

Runnable Example: `multiindex_tutorial.py`¶

```python """ Pandas Tutorial: Multi-Index (Hierarchical Indexing).

Covers creating and working with multi-level indices. """

import pandas as pd import numpy as np

=============================================================================¶

Main¶

=============================================================================¶

if name == "main":

print("="*70)
print("MULTI-INDEX (HIERARCHICAL INDEXING)")
print("="*70)

# Create multi-index DataFrame
np.random.seed(42)
index = pd.MultiIndex.from_product([
    ['Store1', 'Store2', 'Store3'],
    ['Product A', 'Product B']
], names=['Store', 'Product'])

df = pd.DataFrame({
    'Sales': np.random.randint(100, 1000, 6),
    'Quantity': np.random.randint(10, 100, 6)
}, index=index)

print("\nMulti-Index DataFrame:")
print(df)

# Selecting with multi-index
print("\n1. Select by outer index (Store1):")
print(df.loc['Store1'])

print("\n2. Select by both indices (Store1, Product A):")
print(df.loc[('Store1', 'Product A')])

print("\n3. Select using slice:")
print(df.loc[('Store1', slice(None)), :])

# Stack/Unstack
print("\n4. Unstack (inner index to columns):")
unstacked = df.unstack()
print(unstacked)

print("\n5. Stack back:")
stacked = unstacked.stack()
print(stacked)

# Swap levels
print("\n6. Swap index levels:")
swapped = df.swaplevel()
print(swapped)

# Sort by index
print("\n7. Sort by index:")
sorted_df = swapped.sort_index()
print(sorted_df)

# Reset index
print("\n8. Reset multi-index to columns:")
reset = df.reset_index()
print(reset)

# Set multi-index from columns
print("\n9. Create multi-index from columns:")
df_flat = pd.DataFrame({
    'Store': ['A', 'A', 'B', 'B'],
    'Product': ['X', 'Y', 'X', 'Y'],
    'Sales': [100, 200, 150, 250]
})
print("Flat DataFrame:")
print(df_flat)

df_multi = df_flat.set_index(['Store', 'Product'])
print("\nWith Multi-Index:")
print(df_multi)

# Aggregation with multi-index
print("\n10. Aggregation by level:")
print("Sum by Store:")
print(df.sum(level='Store'))

print("\nMean by Product:")
print(df.mean(level='Product'))

print("\nKEY TAKEAWAYS:")
print("- MultiIndex: Hierarchical row/column indices")
print("- Create with from_product(), from_tuples(), from_arrays()")
print("- Select with loc[] using tuples")
print("- unstack(): Move index level to columns")
print("- stack(): Move column level to index")
print("- swaplevel(): Swap index levels")
print("- Aggregate by specific levels")

```

Exercises¶

Exercise 1. Create a DataFrame with a MultiIndex from tuples (e.g., ('US', 'NYC'), ('US', 'LA'), ('UK', 'London')). Use .loc[] to select all rows for a specific level-0 value (e.g., 'US').

Solution to Exercise 1

Select from a MultiIndex using loc.

import pandas as pd

idx = pd.MultiIndex.from_tuples([
    ('US', 'NYC'), ('US', 'LA'), ('UK', 'London'), ('UK', 'Manchester')
], names=['country', 'city'])
df = pd.DataFrame({'pop': [8, 4, 9, 3]}, index=idx)
print(df.loc['US'])

Exercise 2. Create a MultiIndex DataFrame and use .xs() (cross-section) to select data at a specific level. Compare .xs('US', level=0) with .loc['US'].

Solution to Exercise 2

Use xs for cross-section selection.

import pandas as pd

idx = pd.MultiIndex.from_tuples([
    ('US', 'NYC'), ('US', 'LA'), ('UK', 'London')
], names=['country', 'city'])
df = pd.DataFrame({'val': [100, 200, 300]}, index=idx)
print(df.xs('US', level='country'))
print(df.loc['US'])

Exercise 3. Create a DataFrame with a MultiIndex. Use .unstack() to pivot one level of the index into columns, then .stack() to reverse the operation. Verify you get back the original structure.

Solution to Exercise 3

Stack and unstack a MultiIndex DataFrame.

import pandas as pd

idx = pd.MultiIndex.from_tuples([
    ('A', 'x'), ('A', 'y'), ('B', 'x'), ('B', 'y')
], names=['group', 'sub'])
df = pd.DataFrame({'val': [1, 2, 3, 4]}, index=idx)
unstacked = df.unstack()
print("Unstacked:\n", unstacked)
restacked = unstacked.stack()
print("\nRestacked:\n", restacked)

Hierarchical Indexing¶

Conceptual Overview¶

Creating Hierarchical Indexes¶

Using pd.concat with keys¶

Create separate DataFrames¶

Concatenate with hierarchical keys¶

Using MultiIndex.from_tuples¶

Using MultiIndex.from_product¶

Using MultiIndex.from_arrays¶

Selecting Data with MultiIndex¶

Using loc with Tuples (Recommended)¶

Select specific country and city¶

Selecting an Entire Level¶

Select all cities in 'us'¶

Avoid Chained Indexing¶

NOT RECOMMENDED - Chained indexing¶

RECOMMENDED - Single tuple access¶

Hierarchical Columns¶

Accessing Hierarchical Columns¶

Select top-level column group¶

Select specific nested column - use tuple¶

Avoid Chained Column Access¶

NOT RECOMMENDED¶

RECOMMENDED¶

MultiIndex Operations¶

Swapping Levels¶

Swap the order of index levels¶

Reordering Levels¶

Create 3-level MultiIndex¶

Reorder to: month, year, quarter¶

Use integer positions¶

Sort after reordering for proper order¶

Sorting by Index¶

Sort by all levels¶

Sort by specific level¶

Resetting Index¶

Convert MultiIndex to columns¶

Setting MultiIndex from Columns¶

Cross-Section Selection with xs¶

Select all rows where level 0 is 'india'¶

Select all rows where city is 'delhi' (level 1)¶

Aggregation with MultiIndex¶

GroupBy on Index Levels¶

Aggregate by country (level 0)¶

Using level parameter in aggregations¶

Sum across level 0¶

Modern approach¶

Financial Example: Multi-Asset Time Series¶

Create multi-asset price data¶

Create MultiIndex columns¶

Random price data¶

Selecting Asset Data¶

Get all AAPL data¶

Get close prices for all assets¶

Index Attributes¶

names¶

Set names¶

levels¶

FrozenList([['india', 'us'], ['chicago', 'delhi', 'mumbai', 'new york']])¶

nlevels¶

Best Practices¶

Good: Clear, named, sorted MultiIndex¶

Avoid: Unnamed, unsorted, chained access¶

Runnable Example: multiindex_tutorial.py¶

=============================================================================¶

Main¶

=============================================================================¶

Exercises¶

Runnable Example: `multiindex_tutorial.py`¶