merge Method¶
The merge() function combines DataFrames based on column values, similar to SQL joins. It is the most flexible method for combining datasets.
Basic Syntax¶
Merge two DataFrames on a common column.
1. Simple Merge¶
import pandas as pd
df1 = pd.DataFrame({
'city': ['NY', 'SF', 'LA'],
'temperature': [21, 14, 35]
})
df2 = pd.DataFrame({
'city': ['SF', 'NY', 'ICN'],
'humidity': [65, 68, 75]
})
df = pd.merge(df1, df2, on='city')
print(df)
city temperature humidity
0 NY 21 68
1 SF 14 65
2. Default Inner Join¶
By default, merge performs an inner join, keeping only matching rows.
3. Method Syntax¶
# Function syntax
pd.merge(df1, df2, on='key')
# Method syntax (equivalent)
df1.merge(df2, on='key')
LeetCode Example: Person and Address¶
Combine person information with address.
1. Sample Data¶
person = pd.DataFrame({
'personId': [1, 2, 3],
'firstName': ['John', 'Jane', 'Jake'],
'lastName': ['Doe', 'Smith', 'Brown']
})
address = pd.DataFrame({
'personId': [1, 3],
'city': ['New York', 'Los Angeles'],
'state': ['NY', 'CA']
})
2. Left Merge¶
merged_df = person.merge(address, on='personId', how='left')
print(merged_df)
3. Result¶
personId firstName lastName city state
0 1 John Doe New York NY
1 2 Jane Smith NaN NaN
2 3 Jake Brown Los Angeles CA
LeetCode Example: Project Employees¶
Join projects with employee experience.
1. Sample Data¶
project = pd.DataFrame({
'project_id': [101, 102, 103],
'project_name': ['Project A', 'Project B', 'Project C']
})
employee = pd.DataFrame({
'employee_id': [1, 2, 3, 4, 5],
'project_id': [101, 101, 102, 102, 103],
'experience_years': [5, 7, 3, 4, 6]
})
2. Merge with Default¶
merged_df = pd.merge(project, employee, how='left')
print(merged_df)
3. Auto-detected Key¶
When on=None, merge uses common column names automatically.
Common Column Detection¶
Merge automatically finds common columns when on is not specified.
1. Auto Detection¶
# Both DataFrames have 'project_id' column
pd.merge(project, employee) # on='project_id' inferred
2. Multiple Common Columns¶
# If both have ['key1', 'key2'], merges on both
pd.merge(df1, df2)
3. No Common Columns¶
# Raises ValueError if no common columns exist
# Must specify left_on and right_on
Merge Result Structure¶
Understanding the output DataFrame.
1. Column Order¶
# Left DataFrame columns first, then right
# Common column appears once (unless using suffixes)
2. Row Order¶
# Order depends on join type and matching
3. Index Reset¶
# Merge resets index to RangeIndex
# Original indices are discarded
Runnable Example: merge_join_tutorial.py¶
"""
Pandas Tutorial: Merging, Joining, and Concatenating DataFrames.
Covers different ways to combine DataFrames.
"""
import pandas as pd
import numpy as np
# =============================================================================
# Main
# =============================================================================
if __name__ == "__main__":
print("="*70)
print("MERGING, JOINING, AND CONCATENATING")
print("="*70)
# Create sample DataFrames
df1 = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 28]
})
df2 = pd.DataFrame({
'ID': [1, 2, 3, 5],
'Department': ['HR', 'IT', 'Finance', 'Marketing'],
'Salary': [50000, 60000, 75000, 55000]
})
print("\nDataFrame 1 (Employees):")
print(df1)
print("\nDataFrame 2 (Departments):")
print(df2)
# Inner join (default)
print("\n1. Inner Join (intersection):")
inner_merged = pd.merge(df1, df2, on='ID', how='inner')
print(inner_merged)
# Left join
print("\n2. Left Join (keep all from left):")
left_merged = pd.merge(df1, df2, on='ID', how='left')
print(left_merged)
# Right join
print("\n3. Right Join (keep all from right):")
right_merged = pd.merge(df1, df2, on='ID', how='right')
print(right_merged)
# Outer join
print("\n4. Outer Join (keep all from both):")
outer_merged = pd.merge(df1, df2, on='ID', how='outer')
print(outer_merged)
# Merge on different column names
df3 = pd.DataFrame({
'EmpID': [1, 2, 3],
'Project': ['A', 'B', 'C']
})
print("\n5. Merge on different column names:")
merged_diff = pd.merge(df1, df3, left_on='ID', right_on='EmpID')
print(merged_diff)
# Concatenate DataFrames vertically
df_top = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df_bottom = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
print("\n6. Concatenate vertically (stack rows):")
print("Top:")
print(df_top)
print("Bottom:")
print(df_bottom)
print("Result:")
vertical_concat = pd.concat([df_top, df_bottom], ignore_index=True)
print(vertical_concat)
# Concatenate horizontally
df_left = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df_right = pd.DataFrame({'C': [7, 8, 9], 'D': [10, 11, 12]})
print("\n7. Concatenate horizontally (add columns):")
horizontal_concat = pd.concat([df_left, df_right], axis=1)
print(horizontal_concat)
# Join (using index)
df_indexed1 = df1.set_index('ID')
df_indexed2 = df2.set_index('ID')
print("\n8. Join using index:")
joined = df_indexed1.join(df_indexed2, how='inner')
print(joined)
print("\nKEY TAKEAWAYS:")
print("- merge(): SQL-style joins on columns")
print("- concat(): Stack DataFrames vertically or horizontally")
print("- join(): Merge on index")
print("- Join types: inner, left, right, outer")
print("- Use on= for same column names, left_on=/right_on= for different names")