Boolean Indexing¶
Filter DataFrame rows using boolean conditions and masks.
Mental Model
Boolean indexing is a two-step process: first create a True/False mask the same length as your DataFrame, then pass it inside [] to keep only the True rows. The mask is just a Series of booleans, and you combine masks with & (and), | (or), and ~ (not) -- never Python's and/or.
Basic Comparison¶
Filter using comparison operators.
1. Single Condition¶
```python import pandas as pd
df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie', 'David'], 'age': [25, 30, 35, 40], 'salary': [50000, 60000, 70000, 80000] })
Filter age > 30¶
result = df[df['age'] > 30] print(result) ```
name age salary
2 Charlie 35 70000
3 David 40 80000
2. Boolean Mask¶
python
mask = df['age'] > 30
print(mask)
0 False
1 False
2 True
3 True
Name: age, dtype: bool
3. Apply Mask¶
python
result = df[mask]
Multiple Conditions¶
Combine conditions with logical operators.
1. AND Condition (&)¶
```python
Age > 30 AND salary > 60000¶
result = df[(df['age'] > 30) & (df['salary'] > 60000)] ```
2. OR Condition (|)¶
```python
Age > 35 OR salary < 55000¶
result = df[(df['age'] > 35) | (df['salary'] < 55000)] ```
3. NOT Condition (~)¶
```python
NOT age > 30¶
result = df[~(df['age'] > 30)] ```
Parentheses Required¶
Always use parentheses with multiple conditions.
1. Correct Syntax¶
```python
Correct - parentheses around each condition¶
df[(df['age'] > 30) & (df['salary'] > 60000)] ```
2. Incorrect Syntax¶
```python
Wrong - will raise error¶
df[df['age'] > 30 & df['salary'] > 60000]¶
```
3. Operator Precedence¶
& has higher precedence than comparison operators.
loc with Boolean¶
Use loc for boolean indexing with column selection.
1. Filter and Select¶
python
result = df.loc[df['age'] > 30, ['name', 'salary']]
print(result)
name salary
2 Charlie 70000
3 David 80000
2. Modify Values¶
python
df.loc[df['age'] > 30, 'bonus'] = 5000
3. Multiple Conditions¶
python
df.loc[(df['age'] > 30) & (df['salary'] > 60000), 'level'] = 'Senior'
LeetCode Example: Employees Earning More Than Manager¶
Self-merge and filter comparison.
1. Sample Data¶
python
employee = pd.DataFrame({
'id': [1, 2, 3, 4],
'name': ['John', 'Doe', 'Jane', 'Smith'],
'salary': [50000, 40000, 60000, 30000],
'managerId': [None, 1, 1, 2]
})
2. Self Merge¶
python
merged = pd.merge(
employee, employee,
left_on='managerId', right_on='id',
suffixes=('_emp', '_mgr')
)
3. Filter Comparison¶
python
higher_salary = merged[merged['salary_emp'] > merged['salary_mgr']]
result = higher_salary[['name_emp']].rename(columns={'name_emp': 'Employee'})
LeetCode Example: Not Boring Movies¶
Filter with multiple conditions.
1. Sample Data¶
python
cinema = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'movie': ['Movie A', 'Movie B', 'Movie C', 'Movie D', 'Movie E'],
'description': ['thrilling', 'boring', 'exciting', 'boring doc', 'great'],
'rating': [4.5, 3.0, 4.7, 2.1, 4.9]
})
2. Odd ID and Not Boring¶
python
filtered = cinema[
(cinema['id'] % 2 != 0) &
(~cinema['description'].str.contains('boring', case=False))
]
3. Result¶
python
print(filtered)
Missing Value Handling¶
Filter based on null values.
1. Filter Non-null¶
python
result = df[df['salary'].notna()]
2. Filter Null¶
python
result = df[df['salary'].isna()]
3. Any Row with Missing¶
```python
Rows with any missing value¶
result = df[df.isna().any(axis=1)] ```
Date Filtering¶
Filter by date conditions.
1. Date Range¶
```python df['date'] = pd.to_datetime(df['date'])
result = df[ (df['date'] >= '2019-06-28') & (df['date'] <= '2019-07-27') ] ```
2. Year Filter¶
python
result = df[df['date'].dt.year == 2020]
3. Month Filter¶
python
result = df[df['date'].dt.month.isin([1, 2, 3])] # Q1
Exercises¶
Exercise 1.
Create a DataFrame of employees with columns 'name', 'department', and 'salary'. Use boolean indexing to find all employees in the 'IT' department with a salary greater than 60000.
Solution to Exercise 1
Combine two conditions with the & operator.
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Carol', 'Dave'],
'department': ['IT', 'HR', 'IT', 'IT'],
'salary': [55000, 60000, 70000, 65000]
})
result = df[(df['department'] == 'IT') & (df['salary'] > 60000)]
print(result)
Exercise 2.
Given a stock prices DataFrame with columns 'ticker', 'date', and 'close', create a boolean mask that selects rows where the close price is between 100 and 200 (inclusive) using the & operator. Demonstrate why parentheses are required.
Solution to Exercise 2
Use parentheses around each condition before combining with &.
import pandas as pd
df = pd.DataFrame({
'ticker': ['AAPL', 'MSFT', 'GOOGL', 'AMZN'],
'close': [150.0, 350.0, 140.0, 180.0]
})
# Parentheses required because & has higher precedence than >=
mask = (df['close'] >= 100) & (df['close'] <= 200)
print(df[mask])
Exercise 3.
Use .loc with a boolean mask to update values in place. Given a DataFrame with a 'score' column, set all scores below 50 to 50 (a minimum floor). Verify the change by printing the minimum score before and after.
Solution to Exercise 3
Use .loc with a boolean mask to set values.
import pandas as pd
df = pd.DataFrame({'score': [85, 42, 67, 38, 91, 25]})
print(f"Min before: {df['score'].min()}")
df.loc[df['score'] < 50, 'score'] = 50
print(f"Min after: {df['score'].min()}")
print(df)