Skip to content

Self Merge

A self merge joins a DataFrame with itself, useful for comparing rows within the same table or representing hierarchical relationships.

Mental Model

A self merge treats the same table as both left and right, matching different columns against each other. The classic example is an employee table where manager_id references another row's id. By merging the table on manager_id = id, each employee row gains its manager's details -- a pattern that models any tree or graph stored in a flat table.

Basic Concept

Merge a DataFrame with itself using different columns.

1. Employee-Manager Example

```python import pandas as pd

employee = pd.DataFrame({ 'id': [1, 2, 3, 4], 'name': ['John', 'Doe', 'Jane', 'Smith'], 'salary': [50000, 40000, 60000, 30000], 'managerId': [None, 1, 1, 2] }) print(employee) ```

id name salary managerId 0 1 John 50000 NaN 1 2 Doe 40000 1.0 2 3 Jane 60000 1.0 3 4 Smith 30000 2.0

2. Self Merge

python merged = pd.merge( left=employee, right=employee, left_on='managerId', right_on='id', how='inner', suffixes=('_employee', '_manager') ) print(merged)

3. Result Interpretation

Each row pairs an employee with their manager.

LeetCode Example: Higher Salary than Manager

Find employees earning more than their managers.

1. Self Merge

python merged = pd.merge( left=employee, right=employee, left_on='managerId', right_on='id', how='inner', suffixes=('_employee', '_manager') )

2. Filter Condition

python higher_earners = merged[ merged['salary_employee'] > merged['salary_manager'] ]

3. Select Result

python result = higher_earners[['name_employee']].rename( columns={'name_employee': 'Employee'} )

Hierarchical Relationships

Self merge for parent-child relationships.

1. Category Hierarchy

python categories = pd.DataFrame({ 'id': [1, 2, 3, 4, 5], 'name': ['Electronics', 'Phones', 'Laptops', 'iPhone', 'MacBook'], 'parent_id': [None, 1, 1, 2, 3] })

2. Join Parent Info

python with_parent = pd.merge( categories, categories[['id', 'name']], left_on='parent_id', right_on='id', how='left', suffixes=('', '_parent') )

3. Result

```python

Shows each category with its parent category name

```

Comparing Consecutive Rows

Self merge to compare rows.

1. Stock Data

python stock = pd.DataFrame({ 'date': pd.date_range('2024-01-01', periods=5), 'price': [100, 102, 101, 105, 103] }) stock['row_num'] = range(len(stock))

2. Join Consecutive Rows

python stock_comparison = pd.merge( stock, stock, left_on='row_num', right_on=stock['row_num'] + 1, suffixes=('_today', '_yesterday') )

3. Calculate Change

python stock_comparison['change'] = ( stock_comparison['price_today'] - stock_comparison['price_yesterday'] )

Join Types in Self Merge

Choose appropriate join type.

1. Inner Join

```python

Only rows with valid relationship

pd.merge(df, df, left_on='parent_id', right_on='id', how='inner')

Excludes rows without parent (root nodes)

```

2. Left Join

```python

Keep all original rows

pd.merge(df, df, left_on='parent_id', right_on='id', how='left')

Root nodes have NaN for parent info

```

3. Use Case Selection

```python

Inner: When relationship is required

Left: When preserving all original rows matters

```

Performance Considerations

Self merge creates more rows.

1. Cartesian Product Warning

```python

Without proper keys, self merge creates N×N rows

Always specify join columns carefully

```

2. Large DataFrames

```python

Self merge can be memory-intensive

Consider filtering before merge

```

3. Alternative Approaches

```python

For simple comparisons, consider:

df['prev_value'] = df['value'].shift(1) ```


Exercises

Exercise 1. Create an employees DataFrame with 'id', 'name', and 'manager_id' columns. Perform a self merge to add each employee's manager name as a new column.

Solution to Exercise 1

Self merge to find each employee's manager name.

import pandas as pd

employees = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Carol', 'Dave'],
    'manager_id': [None, 1, 1, 2]
})
result = pd.merge(
    employees, employees[['id', 'name']],
    left_on='manager_id', right_on='id',
    suffixes=('', '_manager'), how='left'
)
print(result[['name', 'name_manager']])

Exercise 2. Create a flights DataFrame with 'origin' and 'destination' airport codes. Perform a self merge to find all pairs of flights that can be connected (where the destination of one flight matches the origin of another).

Solution to Exercise 2

Find connecting flights via self merge.

import pandas as pd

flights = pd.DataFrame({
    'flight': ['F1', 'F2', 'F3'],
    'origin': ['NYC', 'CHI', 'LAX'],
    'destination': ['CHI', 'LAX', 'NYC']
})
connections = pd.merge(
    flights, flights,
    left_on='destination', right_on='origin',
    suffixes=('_first', '_second')
)
print(connections[['flight_first', 'origin_first', 'destination_first', 'flight_second', 'destination_second']])

Exercise 3. Create a table of parent-child relationships. Use a self merge to find all grandparent-grandchild pairs (merge twice: child to parent, then parent to grandparent).

Solution to Exercise 3

Find grandparent-grandchild pairs with two self merges.

import pandas as pd

family = pd.DataFrame({
    'person': ['Alice', 'Bob', 'Carol', 'Dave'],
    'parent': ['Bob', 'Carol', None, 'Carol']
})
# First merge: person -> parent
step1 = pd.merge(family, family[['person', 'parent']], left_on='parent', right_on='person', suffixes=('', '_gp'))
# parent_gp is the grandparent
grandparents = step1[step1['parent_gp'].notna()][['person', 'parent_gp']]
grandparents.columns = ['grandchild', 'grandparent']
print(grandparents)