String Methods¶

The .str accessor provides string operations for filtering and transforming text data.

str.contains¶

Check if string contains a pattern.

1. Basic Contains¶

import pandas as pd

df = pd.DataFrame({
    'description': ['thrilling adventure', 'boring documentary', 
                   'exciting drama', 'slow and boring', 'great comedy']
})

# Filter containing 'boring'
result = df[df['description'].str.contains('boring')]
print(result)

         description
1  boring documentary
3     slow and boring

2. Case Insensitive¶

result = df[df['description'].str.contains('boring', case=False)]
# Matches 'boring', 'Boring', 'BORING'

3. NOT Contains¶

result = df[~df['description'].str.contains('boring', case=False)]

str.startswith¶

Check if string starts with pattern.

1. Basic startswith¶

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Mike', 'Molly', 'Eve']
})

# Names starting with 'M'
result = df[df['name'].str.startswith('M')]
print(result)

    name
2   Mike
3  Molly

2. NOT startswith¶

result = df[~df['name'].str.startswith('M')]

3. Multiple Prefixes¶

# Use regex with str.contains
result = df[df['name'].str.contains('^[AM]', regex=True)]
# Names starting with A or M

str.endswith¶

Check if string ends with pattern.

1. Basic endswith¶

df = pd.DataFrame({
    'email': ['user@gmail.com', 'admin@yahoo.com', 'test@gmail.com']
})

result = df[df['email'].str.endswith('@gmail.com')]

2. File Extensions¶

df = pd.DataFrame({'filename': ['report.pdf', 'data.csv', 'image.png']})
pdf_files = df[df['filename'].str.endswith('.pdf')]

3. Multiple Suffixes¶

# Use tuple (pandas >= 1.0)
result = df[df['email'].str.endswith(('.com', '.org'))]

LeetCode Example: Not Boring Movies¶

Filter by description content.

1. Sample Data¶

cinema = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'description': ['thrilling', 'boring', 'exciting', 'boring doc', 'great']
})

2. Combined Filter¶

# Odd ID AND not boring
result = cinema[
    (cinema['id'] % 2 != 0) & 
    (~cinema['description'].str.contains('boring', case=False))
]

3. Result¶

print(result)

LeetCode Example: Special Bonus¶

Filter by name prefix.

1. Sample Data¶

employees = pd.DataFrame({
    'employee_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Mike', 'Molly', 'Eve'],
    'salary': [50000, 60000, 70000, 80000, 90000]
})

2. Filter Condition¶

# Odd ID and name doesn't start with 'M'
eligible = (employees['employee_id'] % 2 != 0) & (~employees['name'].str.startswith('M'))

3. Apply Bonus¶

employees['bonus'] = 0
employees.loc[eligible, 'bonus'] = employees.loc[eligible, 'salary']

str.match¶

Match regular expression at start.

1. Regex Match¶

df = pd.DataFrame({'code': ['ABC123', 'XYZ456', 'ABC789']})
result = df[df['code'].str.match('^ABC')]

2. Pattern Groups¶

# Extract matched groups
df['code'].str.extract(r'([A-Z]+)(\d+)')

3. Full Match¶

result = df[df['code'].str.fullmatch(r'[A-Z]{3}\d{3}')]

str.len¶

Filter by string length.

1. Length Filter¶

df = pd.DataFrame({'name': ['Al', 'Bob', 'Charlie', 'Ed']})
result = df[df['name'].str.len() > 2]

2. Exact Length¶

result = df[df['name'].str.len() == 3]

3. Range¶

result = df[df['name'].str.len().between(3, 5)]

String Transformation¶

Transform before filtering.

1. Lower/Upper¶

df['name_lower'] = df['name'].str.lower()
result = df[df['name_lower'] == 'alice']

2. Strip Whitespace¶

df['name_clean'] = df['name'].str.strip()

3. Replace¶

df['name_clean'] = df['name'].str.replace('-', '')

Handling NaN¶

String methods and missing values.

1. na Parameter¶

# Default: NaN returns NaN
df['name'].str.contains('A')  # NaN for missing

# Treat NaN as False
df['name'].str.contains('A', na=False)

2. fillna First¶

df['name'].fillna('').str.contains('A')

3. Check First¶

mask = df['name'].notna() & df['name'].str.contains('A')

Common Patterns¶

Frequently used string filters.

1. Email Domain¶

gmail_users = df[df['email'].str.contains('@gmail.com')]

2. Phone Format¶

valid_phone = df[df['phone'].str.match(r'^\d{3}-\d{3}-\d{4}$')]

3. Name Pattern¶

# Names with Jr. or Sr.
suffix = df[df['name'].str.contains(r'\b(Jr\.|Sr\.)\b', regex=True)]