Keyword - indicator¶
The indicator parameter adds a column showing the source of each row, useful for debugging and understanding merge results.
Mental Model
indicator=True adds a _merge column with values left_only, right_only, or both, revealing exactly which table contributed each row. It is the best debugging tool for merges -- use it to verify that an inner join did not silently drop rows or that an outer join introduced expected NaN rows.
Basic Usage¶
Add a column indicating merge source.
1. Enable Indicator¶
```python import pandas as pd
df1 = pd.DataFrame({ 'city': ['NY', 'SF', 'LA'], 'temperature': [21, 14, 35] })
df2 = pd.DataFrame({ 'city': ['SF', 'NY', 'ICN'], 'humidity': [65, 68, 75] })
df = pd.merge(df1, df2, on='city', how='outer', indicator=True) print(df) ```
city temperature humidity _merge
0 NY 21.0 68.0 both
1 SF 14.0 65.0 both
2 LA 35.0 NaN left_only
3 ICN NaN 75.0 right_only
2. Indicator Values¶
both: Key exists in both DataFramesleft_only: Key exists only in left DataFrameright_only: Key exists only in right DataFrame
3. Column Name¶
Default column name is _merge.
Custom Column Name¶
Specify indicator column name.
1. String Value¶
python
df = pd.merge(
df1, df2,
on='city',
how='outer',
indicator='source'
)
print(df['source'])
0 both
1 both
2 left_only
3 right_only
Name: source, dtype: object
2. Descriptive Name¶
python
df = pd.merge(
df1, df2,
on='city',
how='outer',
indicator='merge_status'
)
3. Avoid Conflicts¶
```python
Name must not conflict with existing columns¶
```
Filtering by Indicator¶
Use indicator to filter results.
1. Find Unmatched Left¶
python
df = pd.merge(df1, df2, on='city', how='outer', indicator=True)
unmatched_left = df[df['_merge'] == 'left_only']
print(unmatched_left)
city temperature humidity _merge
2 LA 35.0 NaN left_only
2. Find Unmatched Right¶
python
unmatched_right = df[df['_merge'] == 'right_only']
3. Find Matched Only¶
python
matched = df[df['_merge'] == 'both']
Data Quality Checks¶
Use indicator for merge validation.
1. Count Match Types¶
python
df = pd.merge(df1, df2, on='key', how='outer', indicator=True)
print(df['_merge'].value_counts())
both 100
left_only 15
right_only 20
Name: _merge, dtype: int64
2. Validate Complete Match¶
python
unmatched = df[df['_merge'] != 'both']
if len(unmatched) > 0:
print(f"Warning: {len(unmatched)} unmatched rows")
3. Data Reconciliation¶
```python
Identify missing records in each source¶
missing_in_right = df[df['_merge'] == 'left_only']['key'] missing_in_left = df[df['_merge'] == 'right_only']['key'] ```
Drop Indicator Column¶
Remove indicator after use.
1. Drop Column¶
```python df = pd.merge(df1, df2, on='city', how='outer', indicator=True)
Use indicator for filtering¶
matched_df = df[df['_merge'] == 'both'].drop('_merge', axis=1) ```
2. Chain Operations¶
python
matched_df = (
pd.merge(df1, df2, on='city', how='outer', indicator=True)
.query("_merge == 'both'")
.drop('_merge', axis=1)
)
3. Select Columns¶
python
result = df[['city', 'temperature', 'humidity']]
Practical Applications¶
Real-world uses for indicator.
1. Find New Records¶
```python
Find records added since last sync¶
new_records = merged[merged['_merge'] == 'right_only'] ```
2. Find Deleted Records¶
```python
Find records removed since last sync¶
deleted_records = merged[merged['_merge'] == 'left_only'] ```
3. Debug Merge Issues¶
```python
Understand why rows are missing¶
if merged['_merge'].value_counts().get('left_only', 0) > expected: print("Investigate: too many unmatched left rows") ```
Exercises¶
Exercise 1.
Create two DataFrames and perform an outer merge with indicator=True. Filter the result to show only rows that appear in the left DataFrame but not the right (i.e., _merge == 'left_only').
Solution to Exercise 1
Find left-only rows using the indicator column.
import pandas as pd
df1 = pd.DataFrame({'key': ['a', 'b', 'c'], 'val1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['b', 'c', 'd'], 'val2': [4, 5, 6]})
result = pd.merge(df1, df2, on='key', how='outer', indicator=True)
left_only = result[result['_merge'] == 'left_only']
print(left_only)
Exercise 2.
Use indicator=True with an inner merge. Verify that all rows in the result have _merge == 'both' (since inner join only keeps matches).
Solution to Exercise 2
Verify inner merge indicator values are all 'both'.
import pandas as pd
df1 = pd.DataFrame({'key': ['a', 'b', 'c'], 'val1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['b', 'c', 'd'], 'val2': [4, 5, 6]})
result = pd.merge(df1, df2, on='key', how='inner', indicator=True)
assert (result['_merge'] == 'both').all()
print("All inner join rows are 'both'.")
Exercise 3.
Use indicator='source' (custom column name) in an outer merge. Use value_counts() on the indicator column to summarize how many rows came from each source.
Solution to Exercise 3
Use a custom indicator column name and summarize.
import pandas as pd
df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['A', 'B', 'C']})
df2 = pd.DataFrame({'id': [2, 3, 4], 'score': [80, 90, 70]})
result = pd.merge(df1, df2, on='id', how='outer', indicator='source')
print(result['source'].value_counts())