Skip to content

merge vs join

pandas provides two methods for combining DataFrames: merge() and join(). Understanding their differences is essential for data wrangling.

Mental Model

merge and join do the same thing under the hood -- they differ in defaults. merge matches on columns (inner join). join matches on the index (left join). Pick whichever matches where your keys live: columns for merge, index for join.

Key Differences

merge and join serve different purposes.

1. merge Uses Columns

```python import pandas as pd

df1 = pd.DataFrame({ 'key': ['A', 'B', 'C'], 'value1': [1, 2, 3] })

df2 = pd.DataFrame({ 'key': ['A', 'B', 'D'], 'value2': [4, 5, 6] })

merge on column

pd.merge(df1, df2, on='key') ```

2. join Uses Index

```python df1 = pd.DataFrame({ 'value1': [1, 2, 3] }, index=['A', 'B', 'C'])

df2 = pd.DataFrame({ 'value2': [4, 5, 6] }, index=['A', 'B', 'D'])

join on index

df1.join(df2) ```

3. Comparison Table

Feature merge join
Join key Column values Index labels
Syntax pd.merge(df1, df2) df1.join(df2)
Default Inner join Left join
Flexibility More options Simpler API

When to Use merge

Choose merge for column-based joins.

1. Foreign Key Relationships

```python orders = pd.DataFrame({ 'order_id': [1, 2, 3], 'customer_id': [101, 102, 101], 'amount': [100, 200, 150] })

customers = pd.DataFrame({ 'customer_id': [101, 102, 103], 'name': ['Alice', 'Bob', 'Carol'] })

pd.merge(orders, customers, on='customer_id') ```

2. Different Column Names

python pd.merge(df1, df2, left_on='id', right_on='customer_id')

3. Multiple Join Keys

python pd.merge(df1, df2, on=['key1', 'key2'])

When to Use join

Choose join for index-based operations.

1. Time Series Alignment

```python import yfinance as yf

aapl = yf.Ticker('AAPL').history(period='1y')[['Close']].rename( columns={'Close': 'AAPL'} ) msft = yf.Ticker('MSFT').history(period='1y')[['Close']].rename( columns={'Close': 'MSFT'} )

portfolio = aapl.join(msft, how='inner') ```

2. Multiple DataFrames

```python

Join multiple DataFrames by index

df1.join([df2, df3, df4]) ```

3. Suffix Handling

python df1.join(df2, lsuffix='_left', rsuffix='_right')

Four Join Types

Both methods support the same join types.

1. Inner Join

python pd.merge(df1, df2, on='key', how='inner') # Only matching df1.join(df2, how='inner')

2. Left Join

python pd.merge(df1, df2, on='key', how='left') # All from left df1.join(df2, how='left') # Default for join

3. Right Join

python pd.merge(df1, df2, on='key', how='right') # All from right df1.join(df2, how='right')

4. Outer Join

python pd.merge(df1, df2, on='key', how='outer') # All rows df1.join(df2, how='outer')

Visual Comparison

Four ways of combining two DataFrames.

1. Sample DataFrames

python df1 = pd.DataFrame({'city': ['NY', 'SF', 'LA'], 'temp': [21, 14, 35]}) df2 = pd.DataFrame({'city': ['SF', 'NY', 'ICN'], 'humidity': [65, 68, 75]})

2. Inner (Intersection)

```python pd.merge(df1, df2, on='city', how='inner')

Only NY and SF

```

3. Outer (Union)

```python pd.merge(df1, df2, on='city', how='outer')

NY, SF, LA, ICN (with NaN for missing)

```

Practical Guidelines

Choose the right method for your use case.

1. Use merge When

  • Joining on column values (foreign keys)
  • Column names differ between DataFrames
  • Need fine-grained control over join

2. Use join When

  • DataFrames share a meaningful index
  • Combining time series data
  • Simple index-based combination

3. Performance

Both methods have similar performance; choose based on data structure.


Exercises

Exercise 1. Create two DataFrames with a shared column 'key'. Combine them using pd.merge(). Then set 'key' as the index of both and use .join() to achieve the same result. Compare the outputs.

Solution to Exercise 1

Achieve the same result with merge and join.

import pandas as pd

df1 = pd.DataFrame({'key': ['a', 'b', 'c'], 'val1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['a', 'b', 'c'], 'val2': [4, 5, 6]})
result_merge = pd.merge(df1, df2, on='key')
result_join = df1.set_index('key').join(df2.set_index('key'))
print("merge:\n", result_merge)
print("join:\n", result_join)

Exercise 2. Demonstrate the different default behaviors: show that pd.merge() defaults to an inner join while .join() defaults to a left join, using the same pair of DataFrames.

Solution to Exercise 2

Show different default join types.

import pandas as pd

df1 = pd.DataFrame({'key': ['a', 'b', 'c'], 'v1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'], 'v2': [4, 5, 6]})
print("merge (default inner):")
print(pd.merge(df1, df2, on='key'))
print("\njoin (default left):")
print(df1.set_index('key').join(df2.set_index('key')))

Exercise 3. Identify a scenario where pd.merge() is clearly better than .join() (e.g., joining on columns with different names) and another where .join() is clearly better (e.g., joining multiple DataFrames by index). Implement both.

Solution to Exercise 3

Show when merge is better and when join is better.

import pandas as pd

# merge is better: different key column names
df1 = pd.DataFrame({'emp_id': [1, 2], 'name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'employee_id': [1, 2], 'dept': ['HR', 'IT']})
print("merge with different key names:")
print(pd.merge(df1, df2, left_on='emp_id', right_on='employee_id'))

# join is better: multiple DataFrames by index
a = pd.DataFrame({'A': [1, 2]}, index=['x', 'y'])
b = pd.DataFrame({'B': [3, 4]}, index=['x', 'y'])
c = pd.DataFrame({'C': [5, 6]}, index=['x', 'y'])
print("\njoin multiple DataFrames:")
print(a.join([b, c]))