String Methods¶
The .str accessor provides string operations for filtering and transforming text data.
Mental Model
String filtering combines two ideas: the .str accessor exposes vectorized string methods, and each method returns a boolean mask or transformed Series. Chain .str.contains() or .str.startswith() inside [] to filter rows by text patterns, just as you would chain comparison operators for numeric filtering.
str.contains¶
Check if string contains a pattern.
1. Basic Contains¶
```python import pandas as pd
df = pd.DataFrame({ 'description': ['thrilling adventure', 'boring documentary', 'exciting drama', 'slow and boring', 'great comedy'] })
Filter containing 'boring'¶
result = df[df['description'].str.contains('boring')] print(result) ```
description
1 boring documentary
3 slow and boring
2. Case Insensitive¶
```python result = df[df['description'].str.contains('boring', case=False)]
Matches 'boring', 'Boring', 'BORING'¶
```
3. NOT Contains¶
python
result = df[~df['description'].str.contains('boring', case=False)]
str.startswith¶
Check if string starts with pattern.
1. Basic startswith¶
```python df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Mike', 'Molly', 'Eve'] })
Names starting with 'M'¶
result = df[df['name'].str.startswith('M')] print(result) ```
name
2 Mike
3 Molly
2. NOT startswith¶
python
result = df[~df['name'].str.startswith('M')]
3. Multiple Prefixes¶
```python
Use regex with str.contains¶
result = df[df['name'].str.contains('^[AM]', regex=True)]
Names starting with A or M¶
```
str.endswith¶
Check if string ends with pattern.
1. Basic endswith¶
```python df = pd.DataFrame({ 'email': ['user@gmail.com', 'admin@yahoo.com', 'test@gmail.com'] })
result = df[df['email'].str.endswith('@gmail.com')] ```
2. File Extensions¶
python
df = pd.DataFrame({'filename': ['report.pdf', 'data.csv', 'image.png']})
pdf_files = df[df['filename'].str.endswith('.pdf')]
3. Multiple Suffixes¶
```python
Use tuple (pandas >= 1.0)¶
result = df[df['email'].str.endswith(('.com', '.org'))] ```
LeetCode Example: Not Boring Movies¶
Filter by description content.
1. Sample Data¶
python
cinema = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'description': ['thrilling', 'boring', 'exciting', 'boring doc', 'great']
})
2. Combined Filter¶
```python
Odd ID AND not boring¶
result = cinema[ (cinema['id'] % 2 != 0) & (~cinema['description'].str.contains('boring', case=False)) ] ```
3. Result¶
python
print(result)
LeetCode Example: Special Bonus¶
Filter by name prefix.
1. Sample Data¶
python
employees = pd.DataFrame({
'employee_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Mike', 'Molly', 'Eve'],
'salary': [50000, 60000, 70000, 80000, 90000]
})
2. Filter Condition¶
```python
Odd ID and name doesn't start with 'M'¶
eligible = (employees['employee_id'] % 2 != 0) & (~employees['name'].str.startswith('M')) ```
3. Apply Bonus¶
python
employees['bonus'] = 0
employees.loc[eligible, 'bonus'] = employees.loc[eligible, 'salary']
str.match¶
Match regular expression at start.
1. Regex Match¶
python
df = pd.DataFrame({'code': ['ABC123', 'XYZ456', 'ABC789']})
result = df[df['code'].str.match('^ABC')]
2. Pattern Groups¶
```python
Extract matched groups¶
df['code'].str.extract(r'([A-Z]+)(\d+)') ```
3. Full Match¶
python
result = df[df['code'].str.fullmatch(r'[A-Z]{3}\d{3}')]
str.len¶
Filter by string length.
1. Length Filter¶
python
df = pd.DataFrame({'name': ['Al', 'Bob', 'Charlie', 'Ed']})
result = df[df['name'].str.len() > 2]
2. Exact Length¶
python
result = df[df['name'].str.len() == 3]
3. Range¶
python
result = df[df['name'].str.len().between(3, 5)]
String Transformation¶
Transform before filtering.
1. Lower/Upper¶
python
df['name_lower'] = df['name'].str.lower()
result = df[df['name_lower'] == 'alice']
2. Strip Whitespace¶
python
df['name_clean'] = df['name'].str.strip()
3. Replace¶
python
df['name_clean'] = df['name'].str.replace('-', '')
Handling NaN¶
String methods and missing values.
1. na Parameter¶
```python
Default: NaN returns NaN¶
df['name'].str.contains('A') # NaN for missing
Treat NaN as False¶
df['name'].str.contains('A', na=False) ```
2. fillna First¶
python
df['name'].fillna('').str.contains('A')
3. Check First¶
python
mask = df['name'].notna() & df['name'].str.contains('A')
Common Patterns¶
Frequently used string filters.
1. Email Domain¶
python
gmail_users = df[df['email'].str.contains('@gmail.com')]
2. Phone Format¶
python
valid_phone = df[df['phone'].str.match(r'^\d{3}-\d{3}-\d{4}$')]
3. Name Pattern¶
```python
Names with Jr. or Sr.¶
suffix = df[df['name'].str.contains(r'\b(Jr.|Sr.)\b', regex=True)] ```
Exercises¶
Exercise 1.
Given a DataFrame with a 'name' column, use .str.contains() to find all names that contain the substring 'son' (case-insensitive). Print the matching rows.
Solution to Exercise 1
Use case=False for case-insensitive search.
import pandas as pd
df = pd.DataFrame({
'name': ['Johnson', 'Smith', 'Jackson', 'Wilson', 'Lee']
})
result = df[df['name'].str.contains('son', case=False)]
print(result)
Exercise 2.
Use .str.startswith() and .str.endswith() together to filter product codes that start with 'PRD' and end with a digit. Use a regex pattern with .str.match() as an alternative.
Solution to Exercise 2
Combine startswith and regex matching.
import pandas as pd
df = pd.DataFrame({
'code': ['PRD001', 'PRD002', 'SKU-10', 'PRD-AB', 'PRD999']
})
# Method 1: startswith + regex endswith
mask = df['code'].str.startswith('PRD') & df['code'].str.match(r'.*\d$')
print(df[mask])
Exercise 3.
Use .str.len() to filter a DataFrame of descriptions, keeping only rows where the description is between 10 and 100 characters long. Print the count of rows before and after filtering.
Solution to Exercise 3
Use .str.len() with .between() for length filtering.
import pandas as pd
df = pd.DataFrame({
'description': ['Short', 'A medium-length description here',
'X' * 150, 'Another valid description']
})
print(f"Before: {len(df)} rows")
mask = df['description'].str.len().between(10, 100)
result = df[mask]
print(f"After: {len(result)} rows")
print(result)