Skip to content

merge Method

The merge() function combines DataFrames based on column values, similar to SQL joins. It is the most flexible method for combining datasets.

Mental Model

merge() is SQL JOIN for DataFrames. You specify which columns to match on (on, left_on, right_on), how to handle mismatches (how), and what to do with name collisions (suffixes). It defaults to an inner join on all shared column names -- always set on explicitly to avoid surprises.

Basic Syntax

Merge two DataFrames on a common column.

1. Simple Merge

```python import pandas as pd

df1 = pd.DataFrame({ 'city': ['NY', 'SF', 'LA'], 'temperature': [21, 14, 35] })

df2 = pd.DataFrame({ 'city': ['SF', 'NY', 'ICN'], 'humidity': [65, 68, 75] })

df = pd.merge(df1, df2, on='city') print(df) ```

city temperature humidity 0 NY 21 68 1 SF 14 65

2. Default Inner Join

By default, merge performs an inner join, keeping only matching rows.

3. Method Syntax

```python

Function syntax

pd.merge(df1, df2, on='key')

Method syntax (equivalent)

df1.merge(df2, on='key') ```

LeetCode Example: Person and Address

Combine person information with address.

1. Sample Data

```python person = pd.DataFrame({ 'personId': [1, 2, 3], 'firstName': ['John', 'Jane', 'Jake'], 'lastName': ['Doe', 'Smith', 'Brown'] })

address = pd.DataFrame({ 'personId': [1, 3], 'city': ['New York', 'Los Angeles'], 'state': ['NY', 'CA'] }) ```

2. Left Merge

python merged_df = person.merge(address, on='personId', how='left') print(merged_df)

3. Result

personId firstName lastName city state 0 1 John Doe New York NY 1 2 Jane Smith NaN NaN 2 3 Jake Brown Los Angeles CA

LeetCode Example: Project Employees

Join projects with employee experience.

1. Sample Data

```python project = pd.DataFrame({ 'project_id': [101, 102, 103], 'project_name': ['Project A', 'Project B', 'Project C'] })

employee = pd.DataFrame({ 'employee_id': [1, 2, 3, 4, 5], 'project_id': [101, 101, 102, 102, 103], 'experience_years': [5, 7, 3, 4, 6] }) ```

2. Merge with Default

python merged_df = pd.merge(project, employee, how='left') print(merged_df)

3. Auto-detected Key

When on=None, merge uses common column names automatically.

Common Column Detection

Merge automatically finds common columns when on is not specified.

1. Auto Detection

```python

Both DataFrames have 'project_id' column

pd.merge(project, employee) # on='project_id' inferred ```

2. Multiple Common Columns

```python

If both have ['key1', 'key2'], merges on both

pd.merge(df1, df2) ```

3. No Common Columns

```python

Raises ValueError if no common columns exist

Must specify left_on and right_on

```

Merge Result Structure

Understanding the output DataFrame.

1. Column Order

```python

Left DataFrame columns first, then right

Common column appears once (unless using suffixes)

```

2. Row Order

```python

Order depends on join type and matching

```

3. Index Reset

```python

Merge resets index to RangeIndex

Original indices are discarded

```


Runnable Example: merge_join_tutorial.py

```python """ Pandas Tutorial: Merging, Joining, and Concatenating DataFrames.

Covers different ways to combine DataFrames. """

import pandas as pd import numpy as np

=============================================================================

Main

=============================================================================

if name == "main":

print("="*70)
print("MERGING, JOINING, AND CONCATENATING")
print("="*70)

# Create sample DataFrames
df1 = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28]
})

df2 = pd.DataFrame({
    'ID': [1, 2, 3, 5],
    'Department': ['HR', 'IT', 'Finance', 'Marketing'],
    'Salary': [50000, 60000, 75000, 55000]
})

print("\nDataFrame 1 (Employees):")
print(df1)
print("\nDataFrame 2 (Departments):")
print(df2)

# Inner join (default)
print("\n1. Inner Join (intersection):")
inner_merged = pd.merge(df1, df2, on='ID', how='inner')
print(inner_merged)

# Left join
print("\n2. Left Join (keep all from left):")
left_merged = pd.merge(df1, df2, on='ID', how='left')
print(left_merged)

# Right join
print("\n3. Right Join (keep all from right):")
right_merged = pd.merge(df1, df2, on='ID', how='right')
print(right_merged)

# Outer join
print("\n4. Outer Join (keep all from both):")
outer_merged = pd.merge(df1, df2, on='ID', how='outer')
print(outer_merged)

# Merge on different column names
df3 = pd.DataFrame({
    'EmpID': [1, 2, 3],
    'Project': ['A', 'B', 'C']
})

print("\n5. Merge on different column names:")
merged_diff = pd.merge(df1, df3, left_on='ID', right_on='EmpID')
print(merged_diff)

# Concatenate DataFrames vertically
df_top = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df_bottom = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

print("\n6. Concatenate vertically (stack rows):")
print("Top:")
print(df_top)
print("Bottom:")
print(df_bottom)
print("Result:")
vertical_concat = pd.concat([df_top, df_bottom], ignore_index=True)
print(vertical_concat)

# Concatenate horizontally
df_left = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df_right = pd.DataFrame({'C': [7, 8, 9], 'D': [10, 11, 12]})

print("\n7. Concatenate horizontally (add columns):")
horizontal_concat = pd.concat([df_left, df_right], axis=1)
print(horizontal_concat)

# Join (using index)
df_indexed1 = df1.set_index('ID')
df_indexed2 = df2.set_index('ID')

print("\n8. Join using index:")
joined = df_indexed1.join(df_indexed2, how='inner')
print(joined)

print("\nKEY TAKEAWAYS:")
print("- merge(): SQL-style joins on columns")
print("- concat(): Stack DataFrames vertically or horizontally")
print("- join(): Merge on index")
print("- Join types: inner, left, right, outer")
print("- Use on= for same column names, left_on=/right_on= for different names")

```


Exercises

Exercise 1. Create two DataFrames sharing a 'key' column with partial overlap. Perform a default merge (inner join) and count the resulting rows. Then perform an outer merge and compare.

Solution to Exercise 1

Compare inner and outer merge row counts.

import pandas as pd

df1 = pd.DataFrame({'key': ['a', 'b', 'c'], 'val1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['b', 'c', 'd'], 'val2': [4, 5, 6]})
inner = pd.merge(df1, df2, on='key')
outer = pd.merge(df1, df2, on='key', how='outer')
print(f"Inner: {len(inner)} rows")
print(f"Outer: {len(outer)} rows")

Exercise 2. Create an orders DataFrame with a 'product_id' column and a products DataFrame with a 'product_id' and 'product_name' column. Use pd.merge() to look up the product name for each order.

Solution to Exercise 2

Look up product names for orders.

import pandas as pd

orders = pd.DataFrame({'order_id': [1, 2, 3], 'product_id': [101, 102, 101]})
products = pd.DataFrame({'product_id': [101, 102, 103], 'product_name': ['Widget', 'Gadget', 'Doohickey']})
result = pd.merge(orders, products, on='product_id')
print(result)

Exercise 3. Create two DataFrames and merge them using both the function syntax pd.merge(df1, df2) and the method syntax df1.merge(df2). Verify both produce identical results.

Solution to Exercise 3

Verify function syntax and method syntax produce the same result.

import pandas as pd

df1 = pd.DataFrame({'key': ['a', 'b'], 'x': [1, 2]})
df2 = pd.DataFrame({'key': ['a', 'b'], 'y': [3, 4]})
result_func = pd.merge(df1, df2, on='key')
result_method = df1.merge(df2, on='key')
assert result_func.equals(result_method)
print("Both approaches produce identical results.")