Hierarchical Indexing¶

Hierarchical indexing (also called MultiIndex) enables you to store and manipulate data with multiple levels of indexing. This is essential for working with higher-dimensional data in a 2D DataFrame structure.

Conceptual Overview¶

                        temperature  humidity
country  city                                
india    mumbai               32        70
         delhi                45        60
us       new york             21        68
         chicago              14        65
         ↑          ↑
      Level 0   Level 1

Hierarchical indexing allows you to: - Represent 3D+ data in a 2D structure - Perform grouped operations efficiently - Select data at different levels of granularity

Creating Hierarchical Indexes¶

Using pd.concat with keys¶

import pandas as pd

# Create separate DataFrames
india = pd.DataFrame({
    "city": ["mumbai", "delhi"],
    "temperature": [32, 45],
    "humidity": [70, 60],
}).set_index('city')

us = pd.DataFrame({
    "city": ["new york", "chicago"],
    "temperature": [21, 14],
    "humidity": [68, 65],
}).set_index('city')

# Concatenate with hierarchical keys
df = pd.concat([india, us], keys=["india", "us"])
print(df)

                  temperature  humidity
india mumbai              32        70
      delhi               45        60
us    new york            21        68
      chicago             14        65

Using MultiIndex.from_tuples¶

index = pd.MultiIndex.from_tuples([
    ('india', 'mumbai'),
    ('india', 'delhi'),
    ('us', 'new york'),
    ('us', 'chicago')
], names=['country', 'city'])

df = pd.DataFrame({
    'temperature': [32, 45, 21, 14],
    'humidity': [70, 60, 68, 65]
}, index=index)

Using MultiIndex.from_product¶

Creates a MultiIndex from the Cartesian product of iterables.

countries = ['india', 'us']
metrics = ['temperature', 'humidity']

index = pd.MultiIndex.from_product(
    [countries, metrics],
    names=['country', 'metric']
)
print(index)

MultiIndex([('india', 'temperature'),
            ('india', 'humidity'),
            ('us', 'temperature'),
            ('us', 'humidity')],
           names=['country', 'metric'])

Using MultiIndex.from_arrays¶

arrays = [
    ['india', 'india', 'us', 'us'],
    ['mumbai', 'delhi', 'new york', 'chicago']
]

index = pd.MultiIndex.from_arrays(arrays, names=['country', 'city'])

Selecting Data with MultiIndex¶

Using loc with Tuples (Recommended)¶

df = pd.concat([india, us], keys=["india", "us"])

# Select specific country and city
print(df.loc[("us", "new york")])

temperature    21
humidity       68
Name: (us, new york), dtype: int64

Selecting an Entire Level¶

# Select all cities in 'us'
print(df.loc["us"])

          temperature  humidity
city                           
new york           21        68
chicago            14        65

Avoid Chained Indexing¶

# NOT RECOMMENDED - Chained indexing
df.loc["us"].loc["new york"]

# RECOMMENDED - Single tuple access
df.loc[("us", "new york")]

Why avoid chained indexing? - df.loc["us"] creates an intermediate DataFrame - pandas treats these as separate operations - Can lead to unpredictable behavior with assignments - Single tuple access is faster and clearer

Hierarchical Columns¶

MultiIndex can also be applied to columns.

df = pd.DataFrame(
    [[1, 2, 3, 4],
     [5, 6, 7, 8],
     [9, 10, 11, 12]],
    columns=pd.MultiIndex.from_product(
        [['one', 'two'], ['first', 'second']]
    )
)
print(df)

  one        two       
  first second first second
0     1      2     3      4
1     5      6     7      8
2     9     10    11     12

Accessing Hierarchical Columns¶

# Select top-level column group
print(df['one'])

   first  second
0      1       2
1      5       6
2      9      10

# Select specific nested column - use tuple
print(df[('one', 'second')])

0     2
1     6
2    10
Name: (one, second), dtype: int64

Avoid Chained Column Access¶

# NOT RECOMMENDED
df['one']['second']

# RECOMMENDED
df.loc[:, ('one', 'second')]

MultiIndex Operations¶

Swapping Levels¶

df = pd.concat([india, us], keys=["india", "us"])

# Swap the order of index levels
df_swapped = df.swaplevel()
print(df_swapped)

                  temperature  humidity
mumbai   india             32        70
delhi    india             45        60
new york us                21        68
chicago  us                14        65

Reordering Levels¶

For MultiIndex with more than 2 levels, use reorder_levels().

# Create 3-level MultiIndex
index = pd.MultiIndex.from_tuples([
    ('2024', 'Q1', 'Jan'),
    ('2024', 'Q1', 'Feb'),
    ('2024', 'Q2', 'Apr'),
    ('2024', 'Q2', 'May')
], names=['year', 'quarter', 'month'])

df = pd.DataFrame({'value': [10, 20, 30, 40]}, index=index)
print(df)

                        value
year quarter month           
2024 Q1      Jan          10
             Feb          20
     Q2      Apr          30
             May          40

# Reorder to: month, year, quarter
df_reordered = df.reorder_levels(['month', 'year', 'quarter'])
print(df_reordered)

                        value
month year quarter           
Jan   2024 Q1             10
Feb   2024 Q1             20
Apr   2024 Q2             30
May   2024 Q2             40

# Use integer positions
df_reordered = df.reorder_levels([2, 0, 1])  # Same result

# Sort after reordering for proper order
df_reordered = df.reorder_levels(['month', 'year', 'quarter']).sort_index()

Sorting by Index¶

# Sort by all levels
df_sorted = df.sort_index()

# Sort by specific level
df_sorted = df.sort_index(level='city')

Resetting Index¶

# Convert MultiIndex to columns
df_reset = df.reset_index()
print(df_reset)

  level_0    level_1  temperature  humidity
0   india     mumbai           32        70
1   india      delhi           45        60
2      us   new york           21        68
3      us    chicago           14        65

Setting MultiIndex from Columns¶

df_reset.set_index(['level_0', 'level_1'], inplace=True)
df_reset.index.names = ['country', 'city']

Cross-Section Selection with xs¶

The xs method provides a convenient way to select data at a particular level.

df = pd.concat([india, us], keys=["india", "us"])

# Select all rows where level 0 is 'india'
print(df.xs('india', level=0))

        temperature  humidity
city                         
mumbai           32        70
delhi            45        60

# Select all rows where city is 'delhi' (level 1)
print(df.xs('delhi', level=1))

         temperature  humidity
country                       
india             45        60

Aggregation with MultiIndex¶

GroupBy on Index Levels¶

# Aggregate by country (level 0)
print(df.groupby(level=0).mean())

         temperature  humidity
country                       
india           38.5      65.0
us              17.5      66.5

Using level parameter in aggregations¶

# Sum across level 0
print(df.sum(level=0))  # Deprecated in newer pandas

# Modern approach
print(df.groupby(level=0).sum())

Financial Example: Multi-Asset Time Series¶

import pandas as pd
import numpy as np

# Create multi-asset price data
dates = pd.date_range('2024-01-01', periods=5)
assets = ['AAPL', 'MSFT']
metrics = ['open', 'high', 'low', 'close']

# Create MultiIndex columns
columns = pd.MultiIndex.from_product([assets, metrics])

# Random price data
np.random.seed(42)
data = np.random.randn(5, 8).cumsum(axis=0) + 100

df = pd.DataFrame(data, index=dates, columns=columns)
print(df.round(2))

            AAPL                          MSFT                        
            open   high    low  close     open   high    low  close
2024-01-01 100.50 100.86 101.51 103.02  100.31  99.85  99.38 100.95
2024-01-02 101.27 100.58 101.99 103.81  100.82 100.00  99.97 101.12
...

Selecting Asset Data¶

# Get all AAPL data
aapl_data = df['AAPL']

# Get close prices for all assets
close_prices = df.xs('close', level=1, axis=1)
print(close_prices)

Index Attributes¶

names¶

df = pd.concat([india, us], keys=["india", "us"])
print(df.index.names)  # [None, 'city']

# Set names
df.index.names = ['country', 'city']

levels¶

print(df.index.levels)
# FrozenList([['india', 'us'], ['chicago', 'delhi', 'mumbai', 'new york']])

nlevels¶

print(df.index.nlevels)  # 2

Best Practices¶

Use tuple indexing instead of chained indexing for clarity and performance
Name your index levels for self-documenting code
Sort the index after creation for better performance with loc
Consider alternatives - sometimes separate columns are clearer than MultiIndex
Use xs() for cleaner cross-section selection

# Good: Clear, named, sorted MultiIndex
df.index.names = ['country', 'city']
df = df.sort_index()
result = df.loc[('us', 'chicago'), 'temperature']

# Avoid: Unnamed, unsorted, chained access
result = df.loc['us'].loc['chicago']['temperature']

Runnable Example: `multiindex_tutorial.py`¶

"""
Pandas Tutorial: Multi-Index (Hierarchical Indexing).

Covers creating and working with multi-level indices.
"""

import pandas as pd
import numpy as np

# =============================================================================
# Main
# =============================================================================

if __name__ == "__main__":

    print("="*70)
    print("MULTI-INDEX (HIERARCHICAL INDEXING)")
    print("="*70)

    # Create multi-index DataFrame
    np.random.seed(42)
    index = pd.MultiIndex.from_product([
        ['Store1', 'Store2', 'Store3'],
        ['Product A', 'Product B']
    ], names=['Store', 'Product'])

    df = pd.DataFrame({
        'Sales': np.random.randint(100, 1000, 6),
        'Quantity': np.random.randint(10, 100, 6)
    }, index=index)

    print("\nMulti-Index DataFrame:")
    print(df)

    # Selecting with multi-index
    print("\n1. Select by outer index (Store1):")
    print(df.loc['Store1'])

    print("\n2. Select by both indices (Store1, Product A):")
    print(df.loc[('Store1', 'Product A')])

    print("\n3. Select using slice:")
    print(df.loc[('Store1', slice(None)), :])

    # Stack/Unstack
    print("\n4. Unstack (inner index to columns):")
    unstacked = df.unstack()
    print(unstacked)

    print("\n5. Stack back:")
    stacked = unstacked.stack()
    print(stacked)

    # Swap levels
    print("\n6. Swap index levels:")
    swapped = df.swaplevel()
    print(swapped)

    # Sort by index
    print("\n7. Sort by index:")
    sorted_df = swapped.sort_index()
    print(sorted_df)

    # Reset index
    print("\n8. Reset multi-index to columns:")
    reset = df.reset_index()
    print(reset)

    # Set multi-index from columns
    print("\n9. Create multi-index from columns:")
    df_flat = pd.DataFrame({
        'Store': ['A', 'A', 'B', 'B'],
        'Product': ['X', 'Y', 'X', 'Y'],
        'Sales': [100, 200, 150, 250]
    })
    print("Flat DataFrame:")
    print(df_flat)

    df_multi = df_flat.set_index(['Store', 'Product'])
    print("\nWith Multi-Index:")
    print(df_multi)

    # Aggregation with multi-index
    print("\n10. Aggregation by level:")
    print("Sum by Store:")
    print(df.sum(level='Store'))

    print("\nMean by Product:")
    print(df.mean(level='Product'))

    print("\nKEY TAKEAWAYS:")
    print("- MultiIndex: Hierarchical row/column indices")
    print("- Create with from_product(), from_tuples(), from_arrays()")
    print("- Select with loc[] using tuples")
    print("- unstack(): Move index level to columns")
    print("- stack(): Move column level to index")
    print("- swaplevel(): Swap index levels")
    print("- Aggregate by specific levels")