Skip to content

Keyword - how

The how parameter specifies the type of join to perform. It determines which rows are included in the result based on key matching.

Mental Model

how is the Venn-diagram selector. inner keeps only the intersection (keys in both tables). left keeps all keys from the left table. right keeps all keys from the right. outer keeps the union of all keys. Unmatched cells are filled with NaN.

Inner Join

Keep only rows with matching keys in both DataFrames.

1. Default Behavior

```python import pandas as pd

df1 = pd.DataFrame({ 'city': ['NY', 'SF', 'LA'], 'temperature': [21, 14, 35] })

df2 = pd.DataFrame({ 'city': ['SF', 'NY', 'ICN'], 'humidity': [65, 68, 75] })

df = pd.merge(df1, df2, on='city', how='inner') print(df) ```

city temperature humidity 0 NY 21 68 1 SF 14 65

2. Intersection of Keys

Only SF and NY exist in both DataFrames.

3. Most Restrictive

Inner join produces the smallest result set.

Left Join

Keep all rows from the left DataFrame.

1. Left Join Example

python df = pd.merge(df1, df2, on='city', how='left') print(df)

city temperature humidity 0 NY 21 68.0 1 SF 14 65.0 2 LA 35 NaN

2. NaN for Missing

LA has no match in df2, so humidity is NaN.

3. Preserve Left Rows

All rows from df1 are preserved.

Right Join

Keep all rows from the right DataFrame.

1. Right Join Example

python df = pd.merge(df1, df2, on='city', how='right') print(df)

city temperature humidity 0 SF 14.0 65 1 NY 21.0 68 2 ICN NaN 75

2. NaN for Missing

ICN has no match in df1, so temperature is NaN.

3. Preserve Right Rows

All rows from df2 are preserved.

Outer Join

Keep all rows from both DataFrames.

1. Outer Join Example

python df = pd.merge(df1, df2, on='city', how='outer') print(df)

city temperature humidity 0 NY 21.0 68.0 1 SF 14.0 65.0 2 LA 35.0 NaN 3 ICN NaN 75.0

2. Union of Keys

All unique cities from both DataFrames.

3. Most Inclusive

Outer join produces the largest result set.

Visual Comparison

Side-by-side comparison of join types.

1. Sample Data

python left = pd.DataFrame({'key': ['A', 'B', 'C'], 'left_val': [1, 2, 3]}) right = pd.DataFrame({'key': ['B', 'C', 'D'], 'right_val': [4, 5, 6]})

2. Results Summary

how Keys in Result Missing Values
inner B, C None
left A, B, C right_val for A
right B, C, D left_val for D
outer A, B, C, D Both sides

3. Row Counts

python print(len(pd.merge(left, right, how='inner'))) # 2 print(len(pd.merge(left, right, how='left'))) # 3 print(len(pd.merge(left, right, how='right'))) # 3 print(len(pd.merge(left, right, how='outer'))) # 4

Practical Guidelines

Choosing the right join type.

1. Use Inner When

  • Only need matched records
  • Data quality requires both sides present
  • Computing ratios or comparisons

2. Use Left When

  • Preserving primary table structure
  • Optional enrichment data
  • Most common in practice

3. Use Outer When

  • Need complete picture of both sources
  • Finding mismatches
  • Data reconciliation

Exercises

Exercise 1. Create two DataFrames with a shared 'id' column (partial overlap). Perform all four merge types (inner, left, right, outer) and print the shape of each result.

Solution to Exercise 1

Compare shapes across all four join types.

import pandas as pd

df1 = pd.DataFrame({'id': [1, 2, 3, 4], 'val': ['a', 'b', 'c', 'd']})
df2 = pd.DataFrame({'id': [3, 4, 5, 6], 'score': [90, 80, 70, 60]})
for how in ['inner', 'left', 'right', 'outer']:
    result = pd.merge(df1, df2, on='id', how=how)
    print(f"{how}: {result.shape}")

Exercise 2. Create a customers DataFrame and an orders DataFrame. Perform a left merge on customer ID to find customers who have no orders (those with NaN in order columns).

Solution to Exercise 2

Find customers without orders using a left merge.

import pandas as pd

customers = pd.DataFrame({'cust_id': [1, 2, 3, 4], 'name': ['Alice', 'Bob', 'Carol', 'Dave']})
orders = pd.DataFrame({'cust_id': [1, 1, 3], 'amount': [100, 150, 200]})
result = pd.merge(customers, orders, on='cust_id', how='left')
no_orders = result[result['amount'].isna()]
print("Customers without orders:")
print(no_orders[['cust_id', 'name']])

Exercise 3. Using the same two DataFrames, perform an outer merge with indicator=True. Use the _merge column to count how many rows came from 'left_only', 'right_only', and 'both'.

Solution to Exercise 3

Use the indicator column to analyze merge sources.

import pandas as pd

df1 = pd.DataFrame({'id': [1, 2, 3], 'val': ['a', 'b', 'c']})
df2 = pd.DataFrame({'id': [2, 3, 4], 'score': [80, 90, 70]})
result = pd.merge(df1, df2, on='id', how='outer', indicator=True)
print(result['_merge'].value_counts())