merge vs join¶
pandas provides two methods for combining DataFrames: merge() and join(). Understanding their differences is essential for data wrangling.
Key Differences¶
merge and join serve different purposes.
1. merge Uses Columns¶
import pandas as pd
df1 = pd.DataFrame({
'key': ['A', 'B', 'C'],
'value1': [1, 2, 3]
})
df2 = pd.DataFrame({
'key': ['A', 'B', 'D'],
'value2': [4, 5, 6]
})
# merge on column
pd.merge(df1, df2, on='key')
2. join Uses Index¶
df1 = pd.DataFrame({
'value1': [1, 2, 3]
}, index=['A', 'B', 'C'])
df2 = pd.DataFrame({
'value2': [4, 5, 6]
}, index=['A', 'B', 'D'])
# join on index
df1.join(df2)
3. Comparison Table¶
| Feature | merge | join |
|---|---|---|
| Join key | Column values | Index labels |
| Syntax | pd.merge(df1, df2) |
df1.join(df2) |
| Default | Inner join | Left join |
| Flexibility | More options | Simpler API |
When to Use merge¶
Choose merge for column-based joins.
1. Foreign Key Relationships¶
orders = pd.DataFrame({
'order_id': [1, 2, 3],
'customer_id': [101, 102, 101],
'amount': [100, 200, 150]
})
customers = pd.DataFrame({
'customer_id': [101, 102, 103],
'name': ['Alice', 'Bob', 'Carol']
})
pd.merge(orders, customers, on='customer_id')
2. Different Column Names¶
pd.merge(df1, df2, left_on='id', right_on='customer_id')
3. Multiple Join Keys¶
pd.merge(df1, df2, on=['key1', 'key2'])
When to Use join¶
Choose join for index-based operations.
1. Time Series Alignment¶
import yfinance as yf
aapl = yf.Ticker('AAPL').history(period='1y')[['Close']].rename(
columns={'Close': 'AAPL'}
)
msft = yf.Ticker('MSFT').history(period='1y')[['Close']].rename(
columns={'Close': 'MSFT'}
)
portfolio = aapl.join(msft, how='inner')
2. Multiple DataFrames¶
# Join multiple DataFrames by index
df1.join([df2, df3, df4])
3. Suffix Handling¶
df1.join(df2, lsuffix='_left', rsuffix='_right')
Four Join Types¶
Both methods support the same join types.
1. Inner Join¶
pd.merge(df1, df2, on='key', how='inner') # Only matching
df1.join(df2, how='inner')
2. Left Join¶
pd.merge(df1, df2, on='key', how='left') # All from left
df1.join(df2, how='left') # Default for join
3. Right Join¶
pd.merge(df1, df2, on='key', how='right') # All from right
df1.join(df2, how='right')
4. Outer Join¶
pd.merge(df1, df2, on='key', how='outer') # All rows
df1.join(df2, how='outer')
Visual Comparison¶
Four ways of combining two DataFrames.
1. Sample DataFrames¶
df1 = pd.DataFrame({'city': ['NY', 'SF', 'LA'], 'temp': [21, 14, 35]})
df2 = pd.DataFrame({'city': ['SF', 'NY', 'ICN'], 'humidity': [65, 68, 75]})
2. Inner (Intersection)¶
pd.merge(df1, df2, on='city', how='inner')
# Only NY and SF
3. Outer (Union)¶
pd.merge(df1, df2, on='city', how='outer')
# NY, SF, LA, ICN (with NaN for missing)
Practical Guidelines¶
Choose the right method for your use case.
1. Use merge When¶
- Joining on column values (foreign keys)
- Column names differ between DataFrames
- Need fine-grained control over join
2. Use join When¶
- DataFrames share a meaningful index
- Combining time series data
- Simple index-based combination
3. Performance¶
Both methods have similar performance; choose based on data structure.