Keyword - how¶

The how parameter specifies the type of join to perform. It determines which rows are included in the result based on key matching.

Mental Model

how is the Venn-diagram selector. inner keeps only the intersection (keys in both tables). left keeps all keys from the left table. right keeps all keys from the right. outer keeps the union of all keys. Unmatched cells are filled with NaN.

Inner Join¶

Keep only rows with matching keys in both DataFrames.

1. Default Behavior¶

```python import pandas as pd

df1 = pd.DataFrame({ 'city': ['NY', 'SF', 'LA'], 'temperature': [21, 14, 35] })

df2 = pd.DataFrame({ 'city': ['SF', 'NY', 'ICN'], 'humidity': [65, 68, 75] })

df = pd.merge(df1, df2, on='city', how='inner') print(df) ```

city temperature humidity 0 NY 21 68 1 SF 14 65

2. Intersection of Keys¶

Only SF and NY exist in both DataFrames.

3. Most Restrictive¶

Inner join produces the smallest result set.

Left Join¶

Keep all rows from the left DataFrame.

1. Left Join Example¶

python df = pd.merge(df1, df2, on='city', how='left') print(df)

city temperature humidity 0 NY 21 68.0 1 SF 14 65.0 2 LA 35 NaN

2. NaN for Missing¶

LA has no match in df2, so humidity is NaN.

3. Preserve Left Rows¶

All rows from df1 are preserved.

Right Join¶

Keep all rows from the right DataFrame.

1. Right Join Example¶

python df = pd.merge(df1, df2, on='city', how='right') print(df)

city temperature humidity 0 SF 14.0 65 1 NY 21.0 68 2 ICN NaN 75

2. NaN for Missing¶

ICN has no match in df1, so temperature is NaN.

3. Preserve Right Rows¶

All rows from df2 are preserved.

Outer Join¶

Keep all rows from both DataFrames.

1. Outer Join Example¶

python df = pd.merge(df1, df2, on='city', how='outer') print(df)

city temperature humidity 0 NY 21.0 68.0 1 SF 14.0 65.0 2 LA 35.0 NaN 3 ICN NaN 75.0

2. Union of Keys¶

All unique cities from both DataFrames.

3. Most Inclusive¶

Outer join produces the largest result set.

Visual Comparison¶

Side-by-side comparison of join types.

1. Sample Data¶

python left = pd.DataFrame({'key': ['A', 'B', 'C'], 'left_val': [1, 2, 3]}) right = pd.DataFrame({'key': ['B', 'C', 'D'], 'right_val': [4, 5, 6]})

2. Results Summary¶

how	Keys in Result	Missing Values
inner	B, C	None
left	A, B, C	right_val for A
right	B, C, D	left_val for D
outer	A, B, C, D	Both sides

3. Row Counts¶

python print(len(pd.merge(left, right, how='inner'))) # 2 print(len(pd.merge(left, right, how='left'))) # 3 print(len(pd.merge(left, right, how='right'))) # 3 print(len(pd.merge(left, right, how='outer'))) # 4

Practical Guidelines¶

Choosing the right join type.

1. Use Inner When¶

Only need matched records
Data quality requires both sides present
Computing ratios or comparisons

2. Use Left When¶

Preserving primary table structure
Optional enrichment data
Most common in practice

3. Use Outer When¶

Need complete picture of both sources
Finding mismatches
Data reconciliation

Exercises¶

Exercise 1. Create two DataFrames with a shared 'id' column (partial overlap). Perform all four merge types (inner, left, right, outer) and print the shape of each result.

Solution to Exercise 1

Compare shapes across all four join types.

import pandas as pd

df1 = pd.DataFrame({'id': [1, 2, 3, 4], 'val': ['a', 'b', 'c', 'd']})
df2 = pd.DataFrame({'id': [3, 4, 5, 6], 'score': [90, 80, 70, 60]})
for how in ['inner', 'left', 'right', 'outer']:
    result = pd.merge(df1, df2, on='id', how=how)
    print(f"{how}: {result.shape}")

Exercise 2. Create a customers DataFrame and an orders DataFrame. Perform a left merge on customer ID to find customers who have no orders (those with NaN in order columns).

Solution to Exercise 2

Find customers without orders using a left merge.

import pandas as pd

customers = pd.DataFrame({'cust_id': [1, 2, 3, 4], 'name': ['Alice', 'Bob', 'Carol', 'Dave']})
orders = pd.DataFrame({'cust_id': [1, 1, 3], 'amount': [100, 150, 200]})
result = pd.merge(customers, orders, on='cust_id', how='left')
no_orders = result[result['amount'].isna()]
print("Customers without orders:")
print(no_orders[['cust_id', 'name']])

Exercise 3. Using the same two DataFrames, perform an outer merge with indicator=True. Use the _merge column to count how many rows came from 'left_only', 'right_only', and 'both'.

Solution to Exercise 3

Use the indicator column to analyze merge sources.

import pandas as pd

df1 = pd.DataFrame({'id': [1, 2, 3], 'val': ['a', 'b', 'c']})
df2 = pd.DataFrame({'id': [2, 3, 4], 'score': [80, 90, 70]})
result = pd.merge(df1, df2, on='id', how='outer', indicator=True)
print(result['_merge'].value_counts())