Building Panel DataFrames¶
This document covers how to construct panel data structures in pandas using MultiIndex.
Mental Model
Building a panel means creating a DataFrame with a MultiIndex of (entity, time) pairs. MultiIndex.from_product generates a balanced panel (all entities at all times); from_tuples or from_arrays handle unbalanced panels where some entity-time combinations are missing. The construction method determines whether you start with a complete grid or a sparse one.
Using MultiIndex.from_product¶
Creates a panel with all combinations of entities and time periods (balanced panel).
```python import pandas as pd import numpy as np
Define dimensions¶
tickers = ['AAPL', 'MSFT', 'GOOGL'] dates = pd.date_range('2024-01-01', periods=5)
Create MultiIndex (Cartesian product)¶
index = pd.MultiIndex.from_product( [tickers, dates], names=['ticker', 'date'] )
Create panel DataFrame¶
np.random.seed(42) n = len(index)
df = pd.DataFrame({ 'return': np.random.randn(n) * 0.02, 'volume': np.random.randint(1000, 10000, n), 'close': 100 + np.random.randn(n).cumsum() }, index=index)
print(df.head(10)) ```
return volume close
ticker date
AAPL 2024-01-01 0.009934 5765 100.496714
2024-01-02 -0.002765 6274 99.354071
2024-01-03 0.012936 2627 100.001572
2024-01-04 0.030486 8019 102.283551
2024-01-05 -0.004675 3927 101.396356
MSFT 2024-01-01 -0.003129 3109 101.542560
2024-01-02 0.015335 2239 100.533340
2024-01-03 0.000296 9596 100.070105
2024-01-04 -0.002013 8394 99.634291
2024-01-05 0.002826 8259 100.421411
Using MultiIndex.from_tuples¶
Useful when you have specific entity-time combinations (unbalanced panel).
```python
Specific combinations only¶
tuples = [ ('AAPL', '2024-01-01'), ('AAPL', '2024-01-02'), ('AAPL', '2024-01-03'), ('MSFT', '2024-01-02'), # MSFT starts later ('MSFT', '2024-01-03'), ]
index = pd.MultiIndex.from_tuples(tuples, names=['ticker', 'date'])
df = pd.DataFrame({ 'return': [0.01, 0.02, -0.01, 0.015, 0.008] }, index=index)
print(df) ```
Using MultiIndex.from_arrays¶
When you have separate arrays for each level:
```python tickers = ['AAPL', 'AAPL', 'AAPL', 'MSFT', 'MSFT', 'MSFT'] dates = ['2024-01-01', '2024-01-02', '2024-01-03'] * 2
index = pd.MultiIndex.from_arrays( [tickers, dates], names=['ticker', 'date'] )
df = pd.DataFrame({'return': [0.01, 0.02, -0.01, 0.015, 0.008, 0.012]}, index=index) print(df) ```
From Long-Format DataFrame¶
Convert a regular DataFrame to panel format:
```python
Long format data (common from databases)¶
data = pd.DataFrame({ 'ticker': ['AAPL', 'AAPL', 'MSFT', 'MSFT', 'GOOGL', 'GOOGL'], 'date': pd.to_datetime(['2024-01-01', '2024-01-02'] * 3), 'return': [0.01, 0.02, 0.015, 0.018, 0.008, 0.012], 'volume': [1000, 1100, 2000, 2100, 1500, 1600] })
print("Long format:") print(data) print()
Convert to panel (MultiIndex)¶
panel = data.set_index(['ticker', 'date']) print("Panel format:") print(panel) ```
From Wide-Format DataFrame¶
Convert wide format (tickers as columns) to panel:
```python
Wide format¶
wide = pd.DataFrame({ 'AAPL': [0.01, 0.02, -0.01], 'MSFT': [0.015, 0.018, 0.012], 'GOOGL': [0.008, 0.012, 0.009] }, index=pd.date_range('2024-01-01', periods=3)) wide.index.name = 'date' wide.columns.name = 'ticker'
print("Wide format:") print(wide) print()
Convert to panel using stack¶
panel = wide.stack() panel.name = 'return' print("Panel format (Series):") print(panel) print()
Or swap levels to have ticker first¶
panel = panel.swaplevel().sort_index() print("Panel format (ticker first):") print(panel) ```
From Real Financial Data¶
```python import yfinance as yf
Download multiple stocks¶
tickers = ['AAPL', 'MSFT', 'GOOGL'] start = '2024-01-01' end = '2024-01-31'
Download returns long-form data¶
dfs = [] for ticker in tickers: data = yf.Ticker(ticker).history(start=start, end=end) data['ticker'] = ticker data = data.reset_index()[['Date', 'ticker', 'Close', 'Volume']] dfs.append(data)
Combine and set MultiIndex¶
panel = pd.concat(dfs, ignore_index=True) panel = panel.rename(columns={'Date': 'date', 'Close': 'close', 'Volume': 'volume'}) panel = panel.set_index(['ticker', 'date']).sort_index()
print(panel.head(15)) ```
Building with Specific Structure¶
Sector-Stock-Date Hierarchy¶
```python
Three-level panel¶
sectors = ['Tech', 'Finance'] tickers = { 'Tech': ['AAPL', 'MSFT'], 'Finance': ['JPM', 'BAC'] } dates = pd.date_range('2024-01-01', periods=3)
Build tuples¶
tuples = [] for sector, stocks in tickers.items(): for stock in stocks: for date in dates: tuples.append((sector, stock, date))
index = pd.MultiIndex.from_tuples(tuples, names=['sector', 'ticker', 'date'])
df = pd.DataFrame({ 'return': np.random.randn(len(index)) * 0.02 }, index=index)
print(df) ```
Verifying Panel Structure¶
```python
Check panel dimensions¶
print(f"Index levels: {df.index.names}") print(f"Number of levels: {df.index.nlevels}") print(f"Shape: {df.shape}")
Check balance¶
print(f"\nEntities: {df.index.get_level_values('ticker').unique()}") print(f"Time periods: {df.index.get_level_values('date').unique()}")
Check for balanced panel¶
def is_balanced(df, entity_level='ticker', time_level='date'): n_entities = df.index.get_level_values(entity_level).nunique() n_times = df.index.get_level_values(time_level).nunique() expected = n_entities * n_times actual = len(df) return actual == expected
print(f"\nBalanced panel: {is_balanced(df)}") ```
Best Practices¶
- Name your index levels for clarity
- Sort the index after creation for performance
- Use appropriate dtypes (datetime for dates, category for tickers)
- Verify balance if your analysis requires it
```python
Good practice example¶
index = pd.MultiIndex.from_product( [tickers, dates], names=['ticker', 'date'] # Named levels )
df = pd.DataFrame({'return': data}, index=index) df = df.sort_index() # Sorted for performance df.index = df.index.set_levels( df.index.levels[0].astype('category'), # Categorical ticker level=0 ) ```
Exercises¶
Exercise 1. Write code that creates panel data by concatenating DataFrames for three entities with a keys parameter identifying each entity.
Solution to Exercise 1
```python import pandas as pd import numpy as np
Solution for the specific exercise¶
np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(10), 'B': np.random.randn(10)}) print(df.head()) ```
Exercise 2. Explain two common formats for panel data: wide format and long format. Show how to convert between them.
Solution to Exercise 2
See the main content for the detailed explanation. The key concept involves understanding the Pandas API and its behavior for this specific operation.
Exercise 3. Write code that creates a MultiIndex DataFrame from a dictionary of DataFrames representing different entities.
Solution to Exercise 3
```python import pandas as pd import numpy as np
np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(20), 'B': np.random.randn(20)}) result = df.describe() print(result) ```
Exercise 4. Create a long-format panel DataFrame with columns ['entity', 'date', 'value'] and convert it to a MultiIndex structure using set_index().
Solution to Exercise 4
```python import pandas as pd import numpy as np
np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(50), 'group': np.random.choice(['X', 'Y'], 50)}) result = df.groupby('group').mean() print(result) ```