Skip to content

drop_duplicates Method

The drop_duplicates() method removes duplicate rows from a DataFrame.

Mental Model

drop_duplicates() scans for rows with identical values (across all or specified columns) and keeps only the first (or last) occurrence. The subset parameter narrows the uniqueness check to specific columns, and keep controls which duplicate survives. It is SQL's SELECT DISTINCT for DataFrames.

Basic Usage

Remove duplicate rows.

1. Remove All Duplicates

```python import pandas as pd

df = pd.DataFrame({ 'A': [1, 1, 2, 2, 3], 'B': ['a', 'a', 'b', 'b', 'c'] })

result = df.drop_duplicates() print(result) ```

A B 0 1 a 2 2 b 4 3 c

2. Keep First (Default)

First occurrence is kept, duplicates removed.

3. Returns New DataFrame

```python

Original unchanged

new_df = df.drop_duplicates() ```

subset Parameter

Check specific columns for duplicates.

1. Single Column

```python df = pd.DataFrame({ 'id': [1, 2, 1, 3], 'name': ['Alice', 'Bob', 'Alice', 'Charlie'] })

result = df.drop_duplicates(subset='id') print(result) ```

id name 0 1 Alice 1 2 Bob 3 3 Charlie

2. Multiple Columns

python result = df.drop_duplicates(subset=['id', 'name'])

3. All Columns (Default)

```python

subset=None checks all columns

result = df.drop_duplicates() # Uses all columns ```

keep Parameter

Control which duplicate to keep.

1. Keep First (Default)

```python df.drop_duplicates(keep='first')

Keeps first occurrence

```

2. Keep Last

```python df.drop_duplicates(keep='last')

Keeps last occurrence

```

3. Keep None (Remove All)

```python df.drop_duplicates(keep=False)

Removes ALL duplicates, keeps only unique rows

```

LeetCode Example: Delete Duplicate Emails

Keep first occurrence by email.

1. Sample Data

python person = pd.DataFrame({ 'id': [1, 2, 3], 'email': ['a@example.com', 'b@example.com', 'a@example.com'] })

2. Remove Duplicates

python person.drop_duplicates(subset='email', inplace=True) print(person)

id email 0 1 a@example.com 1 2 b@example.com

3. Sorted First

```python

Sort to control which row is kept

person = person.sort_values('id') person.drop_duplicates(subset='email', keep='first', inplace=True) ```

LeetCode Example: Second Highest Salary

Get unique sorted values.

1. Unique Salaries

```python employee = pd.DataFrame({ 'id': [1, 2, 3, 4], 'salary': [100, 200, 200, 300] })

unique_salaries = employee['salary'].drop_duplicates() ```

2. Sort Descending

python sorted_salaries = unique_salaries.sort_values(ascending=False) print(sorted_salaries)

3 300 1 200 0 100

3. Get Second Highest

python if len(sorted_salaries) >= 2: second_highest = sorted_salaries.iloc[1] else: second_highest = None

LeetCode Example: Consecutive Numbers

Drop duplicates after filtering.

1. Find Consecutive

python logs = pd.DataFrame({ 'id': [1, 2, 3, 4, 5, 6], 'num': [1, 1, 1, 2, 2, 2] })

2. Filter and Drop

```python

After filtering for consecutive numbers

consecutive = logs[ (logs['num'] == logs['num'].shift(1)) & (logs['num'] == logs['num'].shift(2)) ]

Drop duplicate numbers

result = consecutive.drop_duplicates('num') ```

3. Unique Values Only

python unique_nums = result[['num']].rename(columns={'num': 'ConsecutiveNums'})

LeetCode Example: Investments in 2016

keep=False for removing all duplicates.

1. Sample Data

python insurance = pd.DataFrame({ 'pid': [1, 2, 3, 4], 'lat': [10.0, 10.0, 20.0, 20.0], 'lon': [5.0, 5.0, 15.0, 25.0], 'tiv_2016': [100, 200, 300, 400] })

2. Remove All Duplicates

```python

Keep only unique lat/lon combinations

unique_locations = insurance.drop_duplicates( subset=['lat', 'lon'], keep=False ) ```

3. Result

python print(unique_locations)

pid lat lon tiv_2016 2 3 20.0 15.0 300 3 4 20.0 25.0 400

inplace Parameter

Modify DataFrame directly.

1. Without inplace

```python result = df.drop_duplicates()

df unchanged

```

2. With inplace

```python df.drop_duplicates(inplace=True)

df modified directly

```

3. Reassignment Preferred

python df = df.drop_duplicates()

ignore_index Parameter

Reset index after dropping.

1. Keep Original Index

```python result = df.drop_duplicates()

Keeps original index values

```

2. Reset Index

```python result = df.drop_duplicates(ignore_index=True)

Index is 0, 1, 2, ...

```

3. Equivalent To

python result = df.drop_duplicates().reset_index(drop=True)


Exercises

Exercise 1. Create a DataFrame with duplicate rows. Use drop_duplicates() to remove them. Then use drop_duplicates(keep='last') and compare which rows are retained versus the default keep='first'.

Solution to Exercise 1

Compare keep='first' vs keep='last'.

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 1, 2, 3],
    'B': ['x', 'y', 'x', 'y', 'z']
})
print("keep='first':")
print(df.drop_duplicates())
print("\nkeep='last':")
print(df.drop_duplicates(keep='last'))

Exercise 2. Create a DataFrame where some rows have the same value in one column but different values in another. Use drop_duplicates(subset=['column_name']) to remove duplicates based on a single column. Observe which rows are kept.

Solution to Exercise 2

Drop duplicates based on a subset of columns.

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Alice', 'Carol'],
    'score': [85, 90, 92, 88]
})
result = df.drop_duplicates(subset=['name'])
print(result)
# Only first occurrence of each name is kept

Exercise 3. Create a DataFrame with email addresses where some are duplicates. Use drop_duplicates(subset=['email'], keep=False) to remove all rows that have any duplicate email (keeping none of them). Count how many rows remain.

Solution to Exercise 3

Remove all rows with duplicate emails.

import pandas as pd

df = pd.DataFrame({
    'user': ['Alice', 'Bob', 'Carol', 'Dave', 'Eve'],
    'email': ['a@x.com', 'b@x.com', 'a@x.com', 'c@x.com', 'b@x.com']
})
result = df.drop_duplicates(subset=['email'], keep=False)
print(result)
print(f"Rows remaining: {len(result)}")