Introduction to Panel Data¶

Panel data (also called longitudinal data) combines cross-sectional and time-series dimensions. Each observation is identified by two keys: an entity (individual, firm, country) and a time point.

Mental Model

Panel data is a 3D cube flattened into a 2D table. The three dimensions are entity, time, and variable. In pandas, entity and time go into a MultiIndex and variables become columns. This lets you ask both "how does entity X change over time?" and "how do all entities compare at time T?" from a single DataFrame.

What is Panel Data?¶

Panel data has two dimensions:

Cross-sectional: Different entities (stocks, companies, individuals)
Time-series: Repeated observations over time

┌─────────────────────────────────────────────────┐ │ Panel Data │ ├─────────────────────────────────────────────────┤ │ Entity (i) Time (t) Value │ │ ────────── ──────── ───── │ │ AAPL 2024-01-01 \$150.00 │ │ AAPL 2024-01-02 \$151.50 │ │ AAPL 2024-01-03 \$149.80 │ │ MSFT 2024-01-01 \$300.00 │ │ MSFT 2024-01-02 \$302.00 │ │ MSFT 2024-01-03 \$301.50 │ │ GOOGL 2024-01-01 \$140.00 │ │ GOOGL 2024-01-02 \$141.20 │ │ GOOGL 2024-01-03 \$142.00 │ └─────────────────────────────────────────────────┘

Panel Data vs Other Data Types¶

Data Type	Entities	Time Points	Example
Cross-sectional	Multiple	Single	Survey at one point
Time-series	Single	Multiple	One stock's history
Panel	Multiple	Multiple	Multiple stocks over time

Why Use Panel Data?¶

1. More Information¶

Panel data contains more observations than cross-sectional or time-series alone:

Cross-sectional: 100 stocks × 1 day = 100 observations Time-series: 1 stock × 252 days = 252 observations Panel: 100 stocks × 252 days = 25,200 observations

2. Control for Unobserved Heterogeneity¶

Panel data allows fixed effects models that control for entity-specific factors:

Stock-specific characteristics (management quality, brand value)
Time-specific effects (market conditions, regulations)

3. Study Dynamics¶

Track how entities change over time.

Examples of Panel Data¶

Domain	Entities	Time	Variables
Finance	Stocks	Days	Returns, Volume
Economics	Countries	Years	GDP, Inflation
Healthcare	Patients	Visits	Vitals, Tests
Education	Students	Grades	Scores, Attendance

Balanced vs Unbalanced Panels¶

Balanced Panel¶

Every entity has observations for every time period.

Unbalanced Panel¶

Some entity-time combinations are missing (common in real data).

Long vs Wide Format¶

Long Format (Standard)¶

ticker date return AAPL 2024-01-01 0.01 AAPL 2024-01-02 0.02 MSFT 2024-01-01 0.015

Wide Format¶

date AAPL MSFT 2024-01-01 0.01 0.015 2024-01-02 0.02 0.018

Panel Data in Pandas¶

Pandas handles panel data using MultiIndex:

```python import pandas as pd import numpy as np

tickers = ['AAPL', 'MSFT', 'GOOGL'] dates = pd.date_range('2024-01-01', periods=5)

index = pd.MultiIndex.from_product( [tickers, dates], names=['ticker', 'date'] )

returns = pd.Series(np.random.randn(15) * 0.02, index=index, name='return') print(returns) ```

Exercises¶

Exercise 1. Explain what panel data is. Give an example of data that has both cross-sectional and time-series dimensions.

Solution to Exercise 1

```python import pandas as pd import numpy as np

Solution for the specific exercise¶

np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(10), 'B': np.random.randn(10)}) print(df.head()) ```

Exercise 2. Write code that creates a simple panel dataset with 3 entities observed over 4 time periods.

Solution to Exercise 2

See the main content for the detailed explanation. The key concept involves understanding the Pandas API and its behavior for this specific operation.

Exercise 3. Explain the difference between balanced and unbalanced panel data. Write code that checks if a panel is balanced.

Solution to Exercise 3

```python import pandas as pd import numpy as np

np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(20), 'B': np.random.randn(20)}) result = df.describe() print(result) ```

Exercise 4. Create a panel dataset and use groupby() to compute summary statistics for each entity.

Solution to Exercise 4

```python import pandas as pd import numpy as np

np.random.seed(42) df = pd.DataFrame({'A': np.random.randn(50), 'group': np.random.choice(['X', 'Y'], 50)}) result = df.groupby('group').mean() print(result) ```