Skip to content

filter Method

The filter() method removes groups that do not satisfy a condition, keeping or excluding entire groups.

Basic Usage

Filter groups based on a condition.

1. Filter by Group Size

import pandas as pd

df = pd.DataFrame({
    'group': ['A', 'A', 'A', 'B', 'B', 'C'],
    'value': [1, 2, 3, 4, 5, 6]
})

# Keep only groups with more than 2 members
result = df.groupby('group').filter(lambda x: len(x) > 2)
print(result)
  group  value
0     A      1
1     A      2
2     A      3

2. Filter Function

The function receives each group DataFrame and returns True/False.

3. All or Nothing

If condition is True, entire group is kept; otherwise, entire group is removed.

Common Filtering Patterns

Typical filter conditions.

1. Minimum Group Size

df.groupby('group').filter(lambda x: len(x) >= 5)

2. Value Threshold

# Keep groups where mean exceeds threshold
df.groupby('group').filter(lambda x: x['value'].mean() > 10)

3. All Values Meet Condition

# Keep groups where all values are positive
df.groupby('group').filter(lambda x: (x['value'] > 0).all())

LeetCode Example: Classes with Students

Find classes with at least 5 students.

1. Sample Data

courses = pd.DataFrame({
    'class': ['Math', 'Math', 'Math', 'Math', 'Math',
              'Art', 'Art', 'Music', 'Music', 'Music'],
    'student': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
})

2. Filter Large Classes

large_classes = courses.groupby('class').filter(lambda x: len(x) >= 5)
print(large_classes)

3. Get Class Names

# Get unique class names
classes = large_classes['class'].unique()

Filter vs Boolean Indexing

Compare approaches.

1. Boolean Indexing

# Filter individual rows
df[df['value'] > 5]

2. GroupBy Filter

# Filter entire groups
df.groupby('group').filter(lambda x: x['value'].mean() > 5)

3. Key Difference

Boolean indexing filters rows; filter removes groups.

dropna Parameter

Handle groups with missing values.

1. Default Behavior

# dropna=True (default): ignore groups with NA

2. Include NA Groups

df.groupby('group', dropna=False).filter(lambda x: len(x) > 1)

3. Filter NA Groups

df.groupby('group').filter(lambda x: x['value'].notna().all())

Performance Considerations

Optimize filter operations.

1. Simple Conditions First

# Pre-filter when possible
valid_groups = df.groupby('group').size() >= 5
valid_group_names = valid_groups[valid_groups].index
df[df['group'].isin(valid_group_names)]

2. Avoid Lambda When Possible

# Faster alternative for size filter
group_sizes = df.groupby('group').size()
valid = group_sizes[group_sizes >= 5].index
df[df['group'].isin(valid)]

3. Use Built-in Methods

Built-in aggregations are faster than custom lambdas.