Introduction to Panel Data¶
Panel data (also called longitudinal data) combines cross-sectional and time-series dimensions. Each observation is identified by two keys: an entity (individual, firm, country) and a time point.
Mental Model
Panel data is a 3D cube flattened into a 2D table. The three dimensions are entity, time, and variable. In pandas, entity and time go into a MultiIndex and variables become columns. This lets you ask both "how does entity X change over time?" and "how do all entities compare at time T?" from a single DataFrame.
What is Panel Data?¶
Panel data has two dimensions:
- Cross-sectional: Different entities (stocks, companies, individuals)
- Time-series: Repeated observations over time
┌─────────────────────────────────────────────────┐
│ Panel Data │
├─────────────────────────────────────────────────┤
│ Entity (i) Time (t) Value │
│ ────────── ──────── ───── │
│ AAPL 2024-01-01 \$150.00 │
│ AAPL 2024-01-02 \$151.50 │
│ AAPL 2024-01-03 \$149.80 │
│ MSFT 2024-01-01 \$300.00 │
│ MSFT 2024-01-02 \$302.00 │
│ MSFT 2024-01-03 \$301.50 │
│ GOOGL 2024-01-01 \$140.00 │
│ GOOGL 2024-01-02 \$141.20 │
│ GOOGL 2024-01-03 \$142.00 │
└─────────────────────────────────────────────────┘
Panel Data vs Other Data Types¶
| Data Type | Entities | Time Points | Example |
|---|---|---|---|
| Cross-sectional | Multiple | Single | Survey at one point |
| Time-series | Single | Multiple | One stock's history |
| Panel | Multiple | Multiple | Multiple stocks over time |
Why Use Panel Data?¶
1. More Information¶
Panel data contains more observations than cross-sectional or time-series alone:
Cross-sectional: 100 stocks × 1 day = 100 observations
Time-series: 1 stock × 252 days = 252 observations
Panel: 100 stocks × 252 days = 25,200 observations
2. Control for Unobserved Heterogeneity¶
Panel data allows fixed effects models that control for entity-specific factors:
- Stock-specific characteristics (management quality, brand value)
- Time-specific effects (market conditions, regulations)
3. Study Dynamics¶
Track how entities change over time.
Examples of Panel Data¶
| Domain | Entities | Time | Variables |
|---|---|---|---|
| Finance | Stocks | Days | Returns, Volume |
| Economics | Countries | Years | GDP, Inflation |
| Healthcare | Patients | Visits | Vitals, Tests |
| Education | Students | Grades | Scores, Attendance |
Balanced vs Unbalanced Panels¶
Balanced Panel¶
Every entity has observations for every time period.
Unbalanced Panel¶
Some entity-time combinations are missing (common in real data).
Long vs Wide Format¶
Long Format (Standard)¶
ticker date return
AAPL 2024-01-01 0.01
AAPL 2024-01-02 0.02
MSFT 2024-01-01 0.015
Wide Format¶
date AAPL MSFT
2024-01-01 0.01 0.015
2024-01-02 0.02 0.018
Panel Data in Pandas¶
Pandas handles panel data using MultiIndex:
```python import pandas as pd import numpy as np
tickers = ['AAPL', 'MSFT', 'GOOGL'] dates = pd.date_range('2024-01-01', periods=5)
index = pd.MultiIndex.from_product( [tickers, dates], names=['ticker', 'date'] )
returns = pd.Series(np.random.randn(15) * 0.02, index=index, name='return') print(returns) ```
Exercises¶
Exercise 1. Explain what panel data is. Give an example of data that has both cross-sectional and time-series dimensions.
Solution to Exercise 1
```python import pandas as pd import numpy as np
Solution for the specific exercise¶
np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(10), 'B': np.random.randn(10)}) print(df.head()) ```
Exercise 2. Write code that creates a simple panel dataset with 3 entities observed over 4 time periods.
Solution to Exercise 2
See the main content for the detailed explanation. The key concept involves understanding the Pandas API and its behavior for this specific operation.
Exercise 3. Explain the difference between balanced and unbalanced panel data. Write code that checks if a panel is balanced.
Solution to Exercise 3
```python import pandas as pd import numpy as np
np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(20), 'B': np.random.randn(20)}) result = df.describe() print(result) ```
Exercise 4. Create a panel dataset and use groupby() to compute summary statistics for each entity.
Solution to Exercise 4
```python import pandas as pd import numpy as np
np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(50), 'group': np.random.choice(['X', 'Y'], 50)}) result = df.groupby('group').mean() print(result) ```