String Methods¶
The .str accessor provides string operations for filtering and transforming text data.
str.contains¶
Check if string contains a pattern.
1. Basic Contains¶
import pandas as pd
df = pd.DataFrame({
'description': ['thrilling adventure', 'boring documentary',
'exciting drama', 'slow and boring', 'great comedy']
})
# Filter containing 'boring'
result = df[df['description'].str.contains('boring')]
print(result)
description
1 boring documentary
3 slow and boring
2. Case Insensitive¶
result = df[df['description'].str.contains('boring', case=False)]
# Matches 'boring', 'Boring', 'BORING'
3. NOT Contains¶
result = df[~df['description'].str.contains('boring', case=False)]
str.startswith¶
Check if string starts with pattern.
1. Basic startswith¶
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Mike', 'Molly', 'Eve']
})
# Names starting with 'M'
result = df[df['name'].str.startswith('M')]
print(result)
name
2 Mike
3 Molly
2. NOT startswith¶
result = df[~df['name'].str.startswith('M')]
3. Multiple Prefixes¶
# Use regex with str.contains
result = df[df['name'].str.contains('^[AM]', regex=True)]
# Names starting with A or M
str.endswith¶
Check if string ends with pattern.
1. Basic endswith¶
df = pd.DataFrame({
'email': ['user@gmail.com', 'admin@yahoo.com', 'test@gmail.com']
})
result = df[df['email'].str.endswith('@gmail.com')]
2. File Extensions¶
df = pd.DataFrame({'filename': ['report.pdf', 'data.csv', 'image.png']})
pdf_files = df[df['filename'].str.endswith('.pdf')]
3. Multiple Suffixes¶
# Use tuple (pandas >= 1.0)
result = df[df['email'].str.endswith(('.com', '.org'))]
LeetCode Example: Not Boring Movies¶
Filter by description content.
1. Sample Data¶
cinema = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'description': ['thrilling', 'boring', 'exciting', 'boring doc', 'great']
})
2. Combined Filter¶
# Odd ID AND not boring
result = cinema[
(cinema['id'] % 2 != 0) &
(~cinema['description'].str.contains('boring', case=False))
]
3. Result¶
print(result)
LeetCode Example: Special Bonus¶
Filter by name prefix.
1. Sample Data¶
employees = pd.DataFrame({
'employee_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Mike', 'Molly', 'Eve'],
'salary': [50000, 60000, 70000, 80000, 90000]
})
2. Filter Condition¶
# Odd ID and name doesn't start with 'M'
eligible = (employees['employee_id'] % 2 != 0) & (~employees['name'].str.startswith('M'))
3. Apply Bonus¶
employees['bonus'] = 0
employees.loc[eligible, 'bonus'] = employees.loc[eligible, 'salary']
str.match¶
Match regular expression at start.
1. Regex Match¶
df = pd.DataFrame({'code': ['ABC123', 'XYZ456', 'ABC789']})
result = df[df['code'].str.match('^ABC')]
2. Pattern Groups¶
# Extract matched groups
df['code'].str.extract(r'([A-Z]+)(\d+)')
3. Full Match¶
result = df[df['code'].str.fullmatch(r'[A-Z]{3}\d{3}')]
str.len¶
Filter by string length.
1. Length Filter¶
df = pd.DataFrame({'name': ['Al', 'Bob', 'Charlie', 'Ed']})
result = df[df['name'].str.len() > 2]
2. Exact Length¶
result = df[df['name'].str.len() == 3]
3. Range¶
result = df[df['name'].str.len().between(3, 5)]
String Transformation¶
Transform before filtering.
1. Lower/Upper¶
df['name_lower'] = df['name'].str.lower()
result = df[df['name_lower'] == 'alice']
2. Strip Whitespace¶
df['name_clean'] = df['name'].str.strip()
3. Replace¶
df['name_clean'] = df['name'].str.replace('-', '')
Handling NaN¶
String methods and missing values.
1. na Parameter¶
# Default: NaN returns NaN
df['name'].str.contains('A') # NaN for missing
# Treat NaN as False
df['name'].str.contains('A', na=False)
2. fillna First¶
df['name'].fillna('').str.contains('A')
3. Check First¶
mask = df['name'].notna() & df['name'].str.contains('A')
Common Patterns¶
Frequently used string filters.
1. Email Domain¶
gmail_users = df[df['email'].str.contains('@gmail.com')]
2. Phone Format¶
valid_phone = df[df['phone'].str.match(r'^\d{3}-\d{3}-\d{4}$')]
3. Name Pattern¶
# Names with Jr. or Sr.
suffix = df[df['name'].str.contains(r'\b(Jr\.|Sr\.)\b', regex=True)]