join Method¶
The join() method combines DataFrames based on their index. It provides a simpler syntax than merge for index-based operations.
Mental Model
join() is merge() with a narrower scope: it matches on the index by default and uses a left join. If your DataFrames already share a meaningful index (like dates or ticker symbols), join is the shortest path to combining them -- one call, no key specification needed.
Basic Usage¶
Join two DataFrames by index.
1. Simple Join¶
```python import pandas as pd
df1 = pd.DataFrame({ 'A': [1, 2, 3] }, index=['a', 'b', 'c'])
df2 = pd.DataFrame({ 'B': [4, 5, 6] }, index=['a', 'b', 'd'])
result = df1.join(df2) print(result) ```
A B
a 1.0 4.0
b 2.0 5.0
c 3.0 NaN
2. Default Left Join¶
join defaults to left join, keeping all left DataFrame rows.
3. Method vs Function¶
```python df1.join(df2) # Method syntax
No pd.join() function # Unlike merge¶
```
Financial Example¶
Combine stock price data by date index.
1. Download Data¶
```python import yfinance as yf
def download(ticker): return yf.Ticker(ticker).history(period="max")
tickers = ['META', 'AAPL', 'AMZN', 'NFLX', 'GOOG'] ```
2. Iterative Join¶
```python for i, ticker in enumerate(tickers): if i == 0: df = download(ticker).rename(columns={'Close': ticker})[[ticker]] else: dg = download(ticker).rename(columns={'Close': ticker})[[ticker]] df = df.join(dg, how="inner")
print(df.head(3)) print(df.tail(3)) ```
3. Result¶
FAANG closing prices aligned by date.
Joining Multiple DataFrames¶
Join several DataFrames at once.
1. List of DataFrames¶
python
df1.join([df2, df3, df4])
2. Sequential Join¶
python
result = df1.join(df2).join(df3).join(df4)
3. With Different How¶
```python
All must use same how parameter¶
df1.join([df2, df3], how='outer') ```
Suffix Handling¶
Handle overlapping column names.
1. lsuffix and rsuffix¶
```python df1 = pd.DataFrame({'A': [1, 2]}, index=['x', 'y']) df2 = pd.DataFrame({'A': [3, 4]}, index=['x', 'y'])
df1.join(df2, lsuffix='_left', rsuffix='_right') ```
A_left A_right
x 1 3
y 2 4
2. Required for Overlaps¶
```python
Without suffixes, overlapping columns raise error¶
df1.join(df2) # ValueError: columns overlap¶
```
3. Clear Naming¶
python
df1.join(df2, lsuffix='_2023', rsuffix='_2024')
Index Alignment¶
join aligns on index, not columns.
1. Different Index Types¶
```python
Works with any compatible index types¶
DatetimeIndex, RangeIndex, string index¶
```
2. Partial Overlap¶
```python
Unmatched indices get NaN (or excluded in inner join)¶
```
3. on Parameter¶
```python
Join left's column to right's index¶
df1.join(df2, on='key_column') ```
Exercises¶
Exercise 1.
Create two DataFrames with overlapping string indices. Use .join() with the default left join to combine them. Observe which rows have NaN and explain why.
Solution to Exercise 1
Join two DataFrames with partial index overlap.
import pandas as pd
df1 = pd.DataFrame({'A': [10, 20, 30]}, index=['x', 'y', 'z'])
df2 = pd.DataFrame({'B': [40, 50, 60]}, index=['x', 'y', 'w'])
result = df1.join(df2)
print(result)
# Row 'z' has NaN for column B because 'z' is not in df2's index
Exercise 2.
Create two DataFrames that share a column name 'value'. Use .join() with lsuffix='_left' and rsuffix='_right' to resolve the name collision. Verify both columns appear in the result.
Solution to Exercise 2
Use lsuffix and rsuffix for overlapping column names.
import pandas as pd
df1 = pd.DataFrame({'value': [1, 2, 3]}, index=['a', 'b', 'c'])
df2 = pd.DataFrame({'value': [4, 5, 6]}, index=['a', 'b', 'c'])
result = df1.join(df2, lsuffix='_left', rsuffix='_right')
print(result)
assert 'value_left' in result.columns
assert 'value_right' in result.columns
Exercise 3.
Create three DataFrames with different index ranges. Use .join() with a list of DataFrames and how='outer' to combine all three. Count the total number of NaN values in the result.
Solution to Exercise 3
Outer join a list of DataFrames and count NaN values.
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2]}, index=[0, 1])
df2 = pd.DataFrame({'B': [3, 4]}, index=[1, 2])
df3 = pd.DataFrame({'C': [5, 6]}, index=[2, 3])
result = df1.join([df2, df3], how='outer')
print(result)
print(f"Total NaN values: {result.isna().sum().sum()}")