merge vs join¶

pandas provides two methods for combining DataFrames: merge() and join(). Understanding their differences is essential for data wrangling.

Mental Model

merge and join do the same thing under the hood -- they differ in defaults. merge matches on columns (inner join). join matches on the index (left join). Pick whichever matches where your keys live: columns for merge, index for join.

Key Differences¶

merge and join serve different purposes.

1. merge Uses Columns¶

```python import pandas as pd

df1 = pd.DataFrame({ 'key': ['A', 'B', 'C'], 'value1': [1, 2, 3] })

df2 = pd.DataFrame({ 'key': ['A', 'B', 'D'], 'value2': [4, 5, 6] })

merge on column¶

pd.merge(df1, df2, on='key') ```

2. join Uses Index¶

```python df1 = pd.DataFrame({ 'value1': [1, 2, 3] }, index=['A', 'B', 'C'])

df2 = pd.DataFrame({ 'value2': [4, 5, 6] }, index=['A', 'B', 'D'])

join on index¶

df1.join(df2) ```

3. Comparison Table¶

Feature	merge	join
Join key	Column values	Index labels
Syntax	`pd.merge(df1, df2)`	`df1.join(df2)`
Default	Inner join	Left join
Flexibility	More options	Simpler API

When to Use merge¶

Choose merge for column-based joins.

1. Foreign Key Relationships¶

```python orders = pd.DataFrame({ 'order_id': [1, 2, 3], 'customer_id': [101, 102, 101], 'amount': [100, 200, 150] })

customers = pd.DataFrame({ 'customer_id': [101, 102, 103], 'name': ['Alice', 'Bob', 'Carol'] })

pd.merge(orders, customers, on='customer_id') ```

2. Different Column Names¶

python pd.merge(df1, df2, left_on='id', right_on='customer_id')

3. Multiple Join Keys¶

python pd.merge(df1, df2, on=['key1', 'key2'])

When to Use join¶

Choose join for index-based operations.

1. Time Series Alignment¶

```python import yfinance as yf

aapl = yf.Ticker('AAPL').history(period='1y')[['Close']].rename( columns={'Close': 'AAPL'} ) msft = yf.Ticker('MSFT').history(period='1y')[['Close']].rename( columns={'Close': 'MSFT'} )

portfolio = aapl.join(msft, how='inner') ```

2. Multiple DataFrames¶

```python

Join multiple DataFrames by index¶

df1.join([df2, df3, df4]) ```

3. Suffix Handling¶

python df1.join(df2, lsuffix='_left', rsuffix='_right')

Four Join Types¶

Both methods support the same join types.

1. Inner Join¶

python pd.merge(df1, df2, on='key', how='inner') # Only matching df1.join(df2, how='inner')

2. Left Join¶

python pd.merge(df1, df2, on='key', how='left') # All from left df1.join(df2, how='left') # Default for join

3. Right Join¶

python pd.merge(df1, df2, on='key', how='right') # All from right df1.join(df2, how='right')

4. Outer Join¶

python pd.merge(df1, df2, on='key', how='outer') # All rows df1.join(df2, how='outer')

Visual Comparison¶

Four ways of combining two DataFrames.

1. Sample DataFrames¶

python df1 = pd.DataFrame({'city': ['NY', 'SF', 'LA'], 'temp': [21, 14, 35]}) df2 = pd.DataFrame({'city': ['SF', 'NY', 'ICN'], 'humidity': [65, 68, 75]})

2. Inner (Intersection)¶

```python pd.merge(df1, df2, on='city', how='inner')

Only NY and SF¶

```

3. Outer (Union)¶

```python pd.merge(df1, df2, on='city', how='outer')

NY, SF, LA, ICN (with NaN for missing)¶

```

Practical Guidelines¶

Choose the right method for your use case.

1. Use merge When¶

Joining on column values (foreign keys)
Column names differ between DataFrames
Need fine-grained control over join

2. Use join When¶

DataFrames share a meaningful index
Combining time series data
Simple index-based combination

3. Performance¶

Both methods have similar performance; choose based on data structure.

Exercises¶

Exercise 1. Create two DataFrames with a shared column 'key'. Combine them using pd.merge(). Then set 'key' as the index of both and use .join() to achieve the same result. Compare the outputs.

Solution to Exercise 1

Achieve the same result with merge and join.

import pandas as pd

df1 = pd.DataFrame({'key': ['a', 'b', 'c'], 'val1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['a', 'b', 'c'], 'val2': [4, 5, 6]})
result_merge = pd.merge(df1, df2, on='key')
result_join = df1.set_index('key').join(df2.set_index('key'))
print("merge:\n", result_merge)
print("join:\n", result_join)

Exercise 2. Demonstrate the different default behaviors: show that pd.merge() defaults to an inner join while .join() defaults to a left join, using the same pair of DataFrames.

Solution to Exercise 2

Show different default join types.

import pandas as pd

df1 = pd.DataFrame({'key': ['a', 'b', 'c'], 'v1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'], 'v2': [4, 5, 6]})
print("merge (default inner):")
print(pd.merge(df1, df2, on='key'))
print("\njoin (default left):")
print(df1.set_index('key').join(df2.set_index('key')))

Exercise 3. Identify a scenario where pd.merge() is clearly better than .join() (e.g., joining on columns with different names) and another where .join() is clearly better (e.g., joining multiple DataFrames by index). Implement both.

Solution to Exercise 3

Show when merge is better and when join is better.

import pandas as pd

# merge is better: different key column names
df1 = pd.DataFrame({'emp_id': [1, 2], 'name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'employee_id': [1, 2], 'dept': ['HR', 'IT']})
print("merge with different key names:")
print(pd.merge(df1, df2, left_on='emp_id', right_on='employee_id'))

# join is better: multiple DataFrames by index
a = pd.DataFrame({'A': [1, 2]}, index=['x', 'y'])
b = pd.DataFrame({'B': [3, 4]}, index=['x', 'y'])
c = pd.DataFrame({'C': [5, 6]}, index=['x', 'y'])
print("\njoin multiple DataFrames:")
print(a.join([b, c]))