filter Method¶

The filter() method removes groups that do not satisfy a condition, keeping or excluding entire groups.

Mental Model

filter() is a group-level WHERE clause: it tests a condition on each group as a whole and keeps or drops the entire group. Unlike boolean indexing which tests row by row, filter asks "does this group meet the criterion?" and returns all rows of qualifying groups unchanged.

Basic Usage¶

Filter groups based on a condition.

1. Filter by Group Size¶

```python import pandas as pd

df = pd.DataFrame({ 'group': ['A', 'A', 'A', 'B', 'B', 'C'], 'value': [1, 2, 3, 4, 5, 6] })

Keep only groups with more than 2 members¶

result = df.groupby('group').filter(lambda x: len(x) > 2) print(result) ```

group value 0 A 1 1 A 2 2 A 3

2. Filter Function¶

The function receives each group DataFrame and returns True/False.

3. All or Nothing¶

If condition is True, entire group is kept; otherwise, entire group is removed.

Common Filtering Patterns¶

Typical filter conditions.

1. Minimum Group Size¶

python df.groupby('group').filter(lambda x: len(x) >= 5)

2. Value Threshold¶

```python

Keep groups where mean exceeds threshold¶

df.groupby('group').filter(lambda x: x['value'].mean() > 10) ```

3. All Values Meet Condition¶

```python

Keep groups where all values are positive¶

df.groupby('group').filter(lambda x: (x['value'] > 0).all()) ```

LeetCode Example: Classes with Students¶

Find classes with at least 5 students.

1. Sample Data¶

python courses = pd.DataFrame({ 'class': ['Math', 'Math', 'Math', 'Math', 'Math', 'Art', 'Art', 'Music', 'Music', 'Music'], 'student': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'] })

2. Filter Large Classes¶

python large_classes = courses.groupby('class').filter(lambda x: len(x) >= 5) print(large_classes)

3. Get Class Names¶

```python

Get unique class names¶

classes = large_classes['class'].unique() ```

Filter vs Boolean Indexing¶

Compare approaches.

1. Boolean Indexing¶

```python

Filter individual rows¶

df[df['value'] > 5] ```

2. GroupBy Filter¶

```python

Filter entire groups¶

df.groupby('group').filter(lambda x: x['value'].mean() > 5) ```

3. Key Difference¶

Boolean indexing filters rows; filter removes groups.

dropna Parameter¶

Handle groups with missing values.

1. Default Behavior¶

```python

dropna=True (default): ignore groups with NA¶

```

2. Include NA Groups¶

python df.groupby('group', dropna=False).filter(lambda x: len(x) > 1)

3. Filter NA Groups¶

python df.groupby('group').filter(lambda x: x['value'].notna().all())

Performance Considerations¶

Optimize filter operations.

1. Simple Conditions First¶

```python

Pre-filter when possible¶

valid_groups = df.groupby('group').size() >= 5 valid_group_names = valid_groups[valid_groups].index df[df['group'].isin(valid_group_names)] ```

2. Avoid Lambda When Possible¶

```python

Faster alternative for size filter¶

group_sizes = df.groupby('group').size() valid = group_sizes[group_sizes >= 5].index df[df['group'].isin(valid)] ```

3. Use Built-in Methods¶

Built-in aggregations are faster than custom lambdas.

Exercises¶

Exercise 1. Group a DataFrame by 'city' and use .filter() to keep only cities that have more than 3 records in the dataset.

Solution to Exercise 1

Filter groups by size using len(x).

import pandas as pd

df = pd.DataFrame({
    'city': ['NY', 'NY', 'NY', 'NY', 'LA', 'LA', 'SF'],
    'value': [10, 20, 30, 40, 50, 60, 70]
})
result = df.groupby('city').filter(lambda x: len(x) > 3)
print(result)

Exercise 2. Use .filter() to keep groups where the mean of the 'score' column exceeds 80. Print the entire filtered DataFrame (not just the group summaries).

Solution to Exercise 2

Filter groups by mean score threshold.

import pandas as pd

df = pd.DataFrame({
    'class': ['A', 'A', 'B', 'B', 'C', 'C'],
    'score': [85, 90, 70, 75, 95, 88]
})
result = df.groupby('class').filter(lambda x: x['score'].mean() > 80)
print(result)

Exercise 3. Compare the performance of using .filter(lambda x: len(x) >= 5) versus the manual approach of computing group sizes, finding valid groups, and using .isin(). Time both approaches on a larger DataFrame.

Solution to Exercise 3

Compare filter with manual isin approach.

import pandas as pd
import numpy as np
import time

np.random.seed(42)
df = pd.DataFrame({
    'group': np.random.choice(list('ABCDEFGHIJ'), 10000),
    'value': np.random.randn(10000)
})

start = time.time()
r1 = df.groupby('group').filter(lambda x: len(x) >= 900)
t1 = time.time() - start

start = time.time()
sizes = df.groupby('group').size()
valid = sizes[sizes >= 900].index
r2 = df[df['group'].isin(valid)]
t2 = time.time() - start

print(f"filter(): {t1:.4f}s, isin(): {t2:.4f}s")