Skip to content

join Method

The join() method combines DataFrames based on their index. It provides a simpler syntax than merge for index-based operations.

Mental Model

join() is merge() with a narrower scope: it matches on the index by default and uses a left join. If your DataFrames already share a meaningful index (like dates or ticker symbols), join is the shortest path to combining them -- one call, no key specification needed.

Basic Usage

Join two DataFrames by index.

1. Simple Join

```python import pandas as pd

df1 = pd.DataFrame({ 'A': [1, 2, 3] }, index=['a', 'b', 'c'])

df2 = pd.DataFrame({ 'B': [4, 5, 6] }, index=['a', 'b', 'd'])

result = df1.join(df2) print(result) ```

A B a 1.0 4.0 b 2.0 5.0 c 3.0 NaN

2. Default Left Join

join defaults to left join, keeping all left DataFrame rows.

3. Method vs Function

```python df1.join(df2) # Method syntax

No pd.join() function # Unlike merge

```

Financial Example

Combine stock price data by date index.

1. Download Data

```python import yfinance as yf

def download(ticker): return yf.Ticker(ticker).history(period="max")

tickers = ['META', 'AAPL', 'AMZN', 'NFLX', 'GOOG'] ```

2. Iterative Join

```python for i, ticker in enumerate(tickers): if i == 0: df = download(ticker).rename(columns={'Close': ticker})[[ticker]] else: dg = download(ticker).rename(columns={'Close': ticker})[[ticker]] df = df.join(dg, how="inner")

print(df.head(3)) print(df.tail(3)) ```

3. Result

FAANG closing prices aligned by date.

Joining Multiple DataFrames

Join several DataFrames at once.

1. List of DataFrames

python df1.join([df2, df3, df4])

2. Sequential Join

python result = df1.join(df2).join(df3).join(df4)

3. With Different How

```python

All must use same how parameter

df1.join([df2, df3], how='outer') ```

Suffix Handling

Handle overlapping column names.

1. lsuffix and rsuffix

```python df1 = pd.DataFrame({'A': [1, 2]}, index=['x', 'y']) df2 = pd.DataFrame({'A': [3, 4]}, index=['x', 'y'])

df1.join(df2, lsuffix='_left', rsuffix='_right') ```

A_left A_right x 1 3 y 2 4

2. Required for Overlaps

```python

Without suffixes, overlapping columns raise error

df1.join(df2) # ValueError: columns overlap

```

3. Clear Naming

python df1.join(df2, lsuffix='_2023', rsuffix='_2024')

Index Alignment

join aligns on index, not columns.

1. Different Index Types

```python

Works with any compatible index types

DatetimeIndex, RangeIndex, string index

```

2. Partial Overlap

```python

Unmatched indices get NaN (or excluded in inner join)

```

3. on Parameter

```python

Join left's column to right's index

df1.join(df2, on='key_column') ```


Exercises

Exercise 1. Create two DataFrames with overlapping string indices. Use .join() with the default left join to combine them. Observe which rows have NaN and explain why.

Solution to Exercise 1

Join two DataFrames with partial index overlap.

import pandas as pd

df1 = pd.DataFrame({'A': [10, 20, 30]}, index=['x', 'y', 'z'])
df2 = pd.DataFrame({'B': [40, 50, 60]}, index=['x', 'y', 'w'])
result = df1.join(df2)
print(result)
# Row 'z' has NaN for column B because 'z' is not in df2's index

Exercise 2. Create two DataFrames that share a column name 'value'. Use .join() with lsuffix='_left' and rsuffix='_right' to resolve the name collision. Verify both columns appear in the result.

Solution to Exercise 2

Use lsuffix and rsuffix for overlapping column names.

import pandas as pd

df1 = pd.DataFrame({'value': [1, 2, 3]}, index=['a', 'b', 'c'])
df2 = pd.DataFrame({'value': [4, 5, 6]}, index=['a', 'b', 'c'])
result = df1.join(df2, lsuffix='_left', rsuffix='_right')
print(result)
assert 'value_left' in result.columns
assert 'value_right' in result.columns

Exercise 3. Create three DataFrames with different index ranges. Use .join() with a list of DataFrames and how='outer' to combine all three. Count the total number of NaN values in the result.

Solution to Exercise 3

Outer join a list of DataFrames and count NaN values.

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2]}, index=[0, 1])
df2 = pd.DataFrame({'B': [3, 4]}, index=[1, 2])
df3 = pd.DataFrame({'C': [5, 6]}, index=[2, 3])
result = df1.join([df2, df3], how='outer')
print(result)
print(f"Total NaN values: {result.isna().sum().sum()}")