query Method¶
The query() method filters DataFrames using a string expression, providing cleaner syntax for complex conditions.
Basic Usage¶
Filter with string expressions.
1. Simple Condition¶
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'Salary': [50000, 60000, 70000, 80000]
})
# Equivalent operations
result1 = df[df['Age'] < 35] # Boolean indexing
result2 = df.query('Age < 35') # Query method
print(result2)
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2. Cleaner Syntax¶
Query avoids repeated DataFrame references.
3. Column Names¶
Use column names directly in the query string.
Multiple Conditions¶
Combine conditions with and/or.
1. AND Condition¶
# Boolean indexing
df[(df['Age'] > 25) & (df['Salary'] > 50000)]
# Query method
df.query('Age > 25 and Salary > 50000')
2. OR Condition¶
df.query('Age > 35 or Salary < 55000')
3. Complex Conditions¶
df.query('(Age > 30 and Salary > 60000) or Name == "Alice"')
Column Comparisons¶
Compare columns against each other.
1. Column vs Column¶
df = pd.DataFrame({
'Subscribers': [760, 366, 1660, 171],
'Age': [40, 50, 25, 35]
})
# Subscribers > 10 * Age
df.query('Subscribers > 10 * Age')
2. Arithmetic Expressions¶
df.query('Salary / 1000 > Age')
3. Multiple Columns¶
df.query('High > Open and Close > Open')
External Variables¶
Reference variables with @ prefix.
1. Variable Reference¶
limit = 500
# Boolean indexing
df[df['Subscribers'] > limit]
# Query method
df.query('Subscribers > @limit')
2. Multiple Variables¶
min_age = 25
max_age = 40
df.query('@min_age <= Age <= @max_age')
3. List Variables¶
valid_names = ['Alice', 'Bob']
df.query('Name in @valid_names')
String Operations¶
Query with string conditions.
1. Equality¶
df.query('Name == "Alice"')
2. In List¶
df.query('Name in ["Alice", "Bob"]')
3. Not In¶
df.query('Name not in ["Charlie", "David"]')
Detecting NaN¶
Special handling for missing values.
1. NaN Detection¶
# NaN != NaN is True
df.query('Subscribers != Subscribers') # Finds NaN rows
2. Not NaN¶
df.query('Subscribers == Subscribers') # Non-NaN rows
3. Alternative¶
# More readable alternatives
df[df['Subscribers'].isna()]
df[df['Subscribers'].notna()]
Date Queries¶
Filter by date ranges.
1. Date Range¶
df.query('request_at >= "2013-10-01" and request_at <= "2013-10-03"')
2. Year Filter¶
df.query('sale_date.dt.year == 2019')
3. Month Filter¶
df.query('sale_date.dt.month in [1, 2, 3]')
LeetCode Example: Sales Analysis¶
Filter sales by date.
1. Sample Data¶
sales = pd.DataFrame({
'sale_date': pd.to_datetime(['2019-01-15', '2019-04-10', '2018-12-25']),
'product_id': [1, 2, 3]
})
2. Filter Q1 2019¶
q1_sales = sales.query('sale_date.dt.year == 2019 and sale_date.dt.month in [1, 2, 3]')
3. Exclude Q1 2019¶
not_q1 = sales.query('sale_date.dt.year != 2019 or sale_date.dt.month not in [1, 2, 3]')
Spaces in Column Names¶
Handle column names with spaces.
1. Backticks¶
df = pd.DataFrame({'First Name': ['Alice', 'Bob']})
df.query('`First Name` == "Alice"')
2. Rename Columns¶
# Better to avoid spaces in column names
df.columns = df.columns.str.replace(' ', '_')
3. Alternative¶
df[df['First Name'] == 'Alice'] # Boolean indexing works
Performance¶
When to use query.
1. Large DataFrames¶
Query can be faster for large DataFrames.
2. Complex Conditions¶
Query is cleaner for multiple conditions.
3. Simple Filters¶
Boolean indexing is fine for simple filters.
Query vs Boolean Indexing¶
Comparison of approaches.
1. Query Advantages¶
- Cleaner syntax for complex conditions
- No repeated DataFrame name
- More readable
2. Boolean Indexing Advantages¶
- More flexible
- Works with any expression
- No string parsing
3. Choose Based On¶
- Readability needs
- Condition complexity
- Personal preference