Accessing Panel Data¶
This document covers techniques for selecting and slicing panel data with MultiIndex.
Mental Model
Accessing panel data is indexing with two keys: entity and time. Use loc[(entity, time)] for a specific observation, loc[entity] to get all time periods for one entity, or xs(time, level='date') to get all entities at one time point. The MultiIndex makes these two-dimensional queries natural on a flat DataFrame.
Setup¶
```python import pandas as pd import numpy as np
Create sample panel¶
tickers = ['AAPL', 'MSFT', 'GOOGL'] dates = pd.date_range('2024-01-01', periods=5) index = pd.MultiIndex.from_product([tickers, dates], names=['ticker', 'date'])
np.random.seed(42) df = pd.DataFrame({ 'return': np.random.randn(15) * 0.02, 'volume': np.random.randint(1000, 10000, 15) }, index=index)
print(df) ```
Selecting a Single Entity¶
Using .loc with Label¶
```python
All data for AAPL¶
aapl_data = df.loc['AAPL'] print(aapl_data) ```
return volume
date
2024-01-01 0.009934 5765
2024-01-02 -0.002765 6274
2024-01-03 0.012936 2627
2024-01-04 0.030486 8019
2024-01-05 -0.004675 3927
Note: Result has only the date index level.
Using .xs (Cross-Section)¶
```python
Same result using xs¶
aapl_data = df.xs('AAPL', level='ticker') print(aapl_data) ```
Selecting a Single Time Period¶
All Entities at One Time¶
```python
All stocks on 2024-01-01¶
day_data = df.xs('2024-01-01', level='date') print(day_data) ```
return volume
ticker
AAPL 0.009934 5765
MSFT -0.003129 3109
GOOGL -0.018867 3046
Selecting Entity-Time Combination¶
Using Tuple¶
```python
AAPL on 2024-01-01¶
single_obs = df.loc[('AAPL', '2024-01-01')] print(single_obs) ```
return 0.009934
volume 5765.000000
Name: (AAPL, 2024-01-01), dtype: float64
Specific Column¶
```python
AAPL return on 2024-01-01¶
value = df.loc[('AAPL', '2024-01-01'), 'return'] print(value) # 0.009934... ```
Slicing Multiple Entities¶
```python
Multiple specific tickers¶
selected = df.loc[['AAPL', 'MSFT']] print(selected) ```
Slicing Time Ranges¶
For One Entity¶
```python
AAPL from Jan 2 to Jan 4¶
aapl_slice = df.loc['AAPL'].loc['2024-01-02':'2024-01-04'] print(aapl_slice) ```
For All Entities¶
```python
All stocks, Jan 2 to Jan 4¶
time_slice = df.loc[(slice(None), slice('2024-01-02', '2024-01-04')), :] print(time_slice) ```
Using pd.IndexSlice for cleaner syntax:
```python idx = pd.IndexSlice
All tickers, specific date range¶
result = df.loc[idx[:, '2024-01-02':'2024-01-04'], :] print(result) ```
Complex Selections with IndexSlice¶
```python idx = pd.IndexSlice
Specific tickers, all dates¶
result = df.loc[idx[['AAPL', 'GOOGL'], :], :] print("AAPL and GOOGL, all dates:") print(result) print()
All tickers, specific dates¶
result = df.loc[idx[:, ['2024-01-01', '2024-01-03']], :] print("All tickers, specific dates:") print(result) print()
Specific ticker, date range¶
result = df.loc[idx['MSFT', '2024-01-02':'2024-01-04'], :] print("MSFT, Jan 2-4:") print(result) ```
Boolean Selection¶
Based on Values¶
```python
High volume observations¶
high_volume = df[df['volume'] > 7000] print(high_volume) ```
Based on Index Level¶
```python
Get the ticker level values¶
tickers_in_index = df.index.get_level_values('ticker')
Filter to specific tickers¶
tech_stocks = df[tickers_in_index.isin(['AAPL', 'MSFT'])] print(tech_stocks) ```
Combined Conditions¶
```python
AAPL with positive returns¶
aapl_positive = df[(df.index.get_level_values('ticker') == 'AAPL') & (df['return'] > 0)] print(aapl_positive) ```
Accessing Index Levels¶
```python
Get all unique tickers¶
tickers = df.index.get_level_values('ticker').unique() print(f"Tickers: {tickers.tolist()}")
Get all unique dates¶
dates = df.index.get_level_values('date').unique() print(f"Dates: {dates.tolist()}")
Get index as DataFrame¶
index_df = df.index.to_frame(index=False) print(index_df.head()) ```
Reset vs Preserve Index¶
Keep MultiIndex¶
```python
Selection preserves index structure¶
subset = df.loc['AAPL'] print(f"Index: {subset.index.name}") # 'date' ```
Reset to Columns¶
```python
Convert index levels to columns¶
flat = df.reset_index() print(flat.head()) ```
ticker date return volume
0 AAPL 2024-01-01 0.009934 5765
1 AAPL 2024-01-02 -0.002765 6274
...
Query Method for Panel¶
```python
Reset index first, then query¶
flat = df.reset_index()
Query syntax¶
result = flat.query("ticker == 'AAPL' and return > 0") print(result) ```
Accessing Panel Series¶
```python
Get returns as panel Series¶
returns = df['return'] print(type(returns)) # Series with MultiIndex
Access specific return¶
aapl_jan1_return = returns[('AAPL', '2024-01-01')] print(aapl_jan1_return) ```
Summary of Access Methods¶
| Goal | Method |
|---|---|
| Single entity | df.loc['AAPL'] or df.xs('AAPL', level='ticker') |
| Single time | df.xs('2024-01-01', level='date') |
| Entity + time | df.loc[('AAPL', '2024-01-01')] |
| Multiple entities | df.loc[['AAPL', 'MSFT']] |
| Time range | df.loc[idx[:, 'start':'end'], :] |
| Complex slice | pd.IndexSlice with .loc |
| Boolean filter | df[condition] |
Exercises¶
Exercise 1. Write code that creates a dictionary of DataFrames (simulating panel data) and accesses data for a specific entity using dictionary indexing.
Solution to Exercise 1
```python import pandas as pd import numpy as np
Solution for the specific exercise¶
np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(10), 'B': np.random.randn(10)}) print(df.head()) ```
Exercise 2. Explain how to use pd.concat() with keys to create a panel-like structure with MultiIndex.
Solution to Exercise 2
See the main content for the detailed explanation. The key concept involves understanding the Pandas API and its behavior for this specific operation.
Exercise 3. Write code that selects all rows for a specific time period across all entities in a MultiIndex DataFrame using .loc.
Solution to Exercise 3
```python import pandas as pd import numpy as np
np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(20), 'B': np.random.randn(20)}) result = df.describe() print(result) ```
Exercise 4. Create panel data with MultiIndex and use xs() to extract a cross-section for a specific entity.
Solution to Exercise 4
```python import pandas as pd import numpy as np
np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(50), 'group': np.random.choice(['X', 'Y'], 50)}) result = df.groupby('group').mean() print(result) ```