Keyword - indicator¶
The indicator parameter adds a column showing the source of each row, useful for debugging and understanding merge results.
Basic Usage¶
Add a column indicating merge source.
1. Enable Indicator¶
import pandas as pd
df1 = pd.DataFrame({
'city': ['NY', 'SF', 'LA'],
'temperature': [21, 14, 35]
})
df2 = pd.DataFrame({
'city': ['SF', 'NY', 'ICN'],
'humidity': [65, 68, 75]
})
df = pd.merge(df1, df2, on='city', how='outer', indicator=True)
print(df)
city temperature humidity _merge
0 NY 21.0 68.0 both
1 SF 14.0 65.0 both
2 LA 35.0 NaN left_only
3 ICN NaN 75.0 right_only
2. Indicator Values¶
both: Key exists in both DataFramesleft_only: Key exists only in left DataFrameright_only: Key exists only in right DataFrame
3. Column Name¶
Default column name is _merge.
Custom Column Name¶
Specify indicator column name.
1. String Value¶
df = pd.merge(
df1, df2,
on='city',
how='outer',
indicator='source'
)
print(df['source'])
0 both
1 both
2 left_only
3 right_only
Name: source, dtype: object
2. Descriptive Name¶
df = pd.merge(
df1, df2,
on='city',
how='outer',
indicator='merge_status'
)
3. Avoid Conflicts¶
# Name must not conflict with existing columns
Filtering by Indicator¶
Use indicator to filter results.
1. Find Unmatched Left¶
df = pd.merge(df1, df2, on='city', how='outer', indicator=True)
unmatched_left = df[df['_merge'] == 'left_only']
print(unmatched_left)
city temperature humidity _merge
2 LA 35.0 NaN left_only
2. Find Unmatched Right¶
unmatched_right = df[df['_merge'] == 'right_only']
3. Find Matched Only¶
matched = df[df['_merge'] == 'both']
Data Quality Checks¶
Use indicator for merge validation.
1. Count Match Types¶
df = pd.merge(df1, df2, on='key', how='outer', indicator=True)
print(df['_merge'].value_counts())
both 100
left_only 15
right_only 20
Name: _merge, dtype: int64
2. Validate Complete Match¶
unmatched = df[df['_merge'] != 'both']
if len(unmatched) > 0:
print(f"Warning: {len(unmatched)} unmatched rows")
3. Data Reconciliation¶
# Identify missing records in each source
missing_in_right = df[df['_merge'] == 'left_only']['key']
missing_in_left = df[df['_merge'] == 'right_only']['key']
Drop Indicator Column¶
Remove indicator after use.
1. Drop Column¶
df = pd.merge(df1, df2, on='city', how='outer', indicator=True)
# Use indicator for filtering
matched_df = df[df['_merge'] == 'both'].drop('_merge', axis=1)
2. Chain Operations¶
matched_df = (
pd.merge(df1, df2, on='city', how='outer', indicator=True)
.query("_merge == 'both'")
.drop('_merge', axis=1)
)
3. Select Columns¶
result = df[['city', 'temperature', 'humidity']]
Practical Applications¶
Real-world uses for indicator.
1. Find New Records¶
# Find records added since last sync
new_records = merged[merged['_merge'] == 'right_only']
2. Find Deleted Records¶
# Find records removed since last sync
deleted_records = merged[merged['_merge'] == 'left_only']
3. Debug Merge Issues¶
# Understand why rows are missing
if merged['_merge'].value_counts().get('left_only', 0) > expected:
print("Investigate: too many unmatched left rows")