String Accessor (str)¶
The str accessor in pandas provides vectorized string operations on Series containing string data. This allows you to apply string methods to entire columns without explicit loops.
Overview¶
import pandas as pd
s = pd.Series(['Alice', 'Bob', 'Charlie'])
# Access string methods via .str accessor
print(s.str.upper())
0 ALICE
1 BOB
2 CHARLIE
dtype: object
Why Use the str Accessor?¶
- Vectorized Operations: Apply string methods to all elements at once
- Automatic NaN Handling: Missing values are handled gracefully
- Method Chaining: Chain multiple string operations together
- Performance: Optimized for pandas data structures
# Without str accessor (slow, verbose)
result = [x.upper() if pd.notna(x) else x for x in s]
# With str accessor (fast, concise)
result = s.str.upper()
Case Transformation¶
upper() and lower()¶
s = pd.Series(['Hello', 'World', 'Python'])
print(s.str.upper()) # HELLO, WORLD, PYTHON
print(s.str.lower()) # hello, world, python
capitalize()¶
Capitalize first character, lowercase the rest.
s = pd.Series(['alice', 'JOHN', 'mARY', 'bOB'])
print(s.str.capitalize())
0 Alice
1 John
2 Mary
3 Bob
dtype: object
Practical Example (LeetCode 1667): Fix names in a table.
users = pd.DataFrame({
'user_id': [1, 2, 3, 4],
'name': ['alice', 'john', 'MARY', 'bOB']
})
users['name'] = users['name'].str.capitalize()
print(users)
user_id name
0 1 Alice
1 2 John
2 3 Mary
3 4 Bob
title()¶
Capitalize first character of each word.
s = pd.Series(['hello world', 'PYTHON PANDAS'])
print(s.str.title())
0 Hello World
1 Python Pandas
dtype: object
swapcase()¶
s = pd.Series(['Hello World'])
print(s.str.swapcase()) # hELLO wORLD
String Length¶
len()¶
s = pd.Series(['Hello', 'World', 'Python'])
print(s.str.len())
0 5
1 5
2 6
dtype: int64
Practical Example (LeetCode 1683): Find tweets longer than 15 characters.
tweets = pd.DataFrame({
'tweet_id': [1, 2, 3, 4],
'content': ['Hello world!', 'This is a very long tweet!', 'Short', 'Another tweet']
})
# Filter tweets with content > 15 characters
long_tweets = tweets[tweets['content'].str.len() > 15]
print(long_tweets)
tweet_id content
1 2 This is a very long tweet!
Substring Operations¶
Indexing with []¶
s = pd.Series(['Alice', 'Bob', 'Charlie'])
print(s.str[0]) # First character: A, B, C
print(s.str[-1]) # Last character: e, b, e
print(s.str[:3]) # First 3 characters: Ali, Bob, Cha
slice()¶
s = pd.Series(['Apple', 'Banana', 'Cherry'])
print(s.str.slice(0, 3)) # App, Ban, Che
get()¶
Get character at position (like indexing but safer with NaN).
s = pd.Series(['Alice', None, 'Charlie'])
print(s.str.get(0)) # A, NaN, C
Searching and Matching¶
contains()¶
Check if pattern is contained in each string.
s = pd.Series(['apple', 'banana', 'cherry', 'date'])
print(s.str.contains('an'))
0 False
1 True
2 False
3 False
dtype: bool
Practical Example (LeetCode 620): Find movies without "boring" in description.
cinema = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'description': ['A thrilling adventure', 'A boring documentary',
'An exciting drama', 'A slow and boring tale', 'A great comedy']
})
# Filter out boring movies (case-insensitive)
not_boring = cinema[~cinema['description'].str.contains('boring', case=False)]
print(not_boring)
id description
0 1 A thrilling adventure
2 3 An exciting drama
4 5 A great comedy
startswith() and endswith()¶
s = pd.Series(['Apple', 'Apricot', 'Banana', 'Avocado'])
print(s.str.startswith('A')) # True, True, False, True
print(s.str.endswith('a')) # False, False, True, False
Practical Example (LeetCode 1873): Find employees whose names don't start with 'M'.
employees = pd.DataFrame({
'employee_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Mike', 'Molly', 'Eve']
})
# Employees with odd ID and name not starting with 'M'
bonus_eligible = employees[
(employees['employee_id'] % 2 != 0) &
(~employees['name'].str.startswith('M'))
]
print(bonus_eligible)
employee_id name
0 1 Alice
4 5 Eve
match()¶
Match regular expression pattern at the start of string.
s = pd.Series(['alice@example.com', 'bob@test.org', 'invalid-email'])
# Check valid email format
pattern = r'^[a-zA-Z][a-zA-Z0-9._-]*@[a-zA-Z]+\.[a-zA-Z]+$'
print(s.str.match(pattern))
0 True
1 True
2 False
dtype: bool
Practical Example (LeetCode 1517): Find users with valid emails.
users = pd.DataFrame({
'user_id': [1, 2, 3],
'mail': ['alice@leetcode.com', 'bob@leet?com.com', '123@example.com']
})
# Valid email pattern: starts with letter, domain is @leetcode.com
valid_pattern = r'^[A-Za-z][A-Za-z0-9_.\-]*@leetcode\.com$'
valid_users = users[users['mail'].str.match(valid_pattern)]
print(valid_users)
find() and rfind()¶
Find position of substring.
s = pd.Series(['hello world', 'world hello', 'no match'])
print(s.str.find('world')) # 6, 0, -1
String Replacement¶
replace()¶
Replace occurrences of pattern.
s = pd.Series(['apple-pie', 'banana-split', 'cherry-tart'])
print(s.str.replace('-', '_'))
0 apple_pie
1 banana_split
2 cherry_tart
dtype: object
With Regex¶
s = pd.Series(['price: $100', 'cost: $200', 'value: \$300'])
# Remove dollar amounts
print(s.str.replace(r'\$\d+', 'XXX', regex=True))
0 price: XXX
1 cost: XXX
2 value: XXX
dtype: object
Splitting and Joining¶
split()¶
s = pd.Series(['a-b-c', 'x-y-z'])
print(s.str.split('-'))
0 [a, b, c]
1 [x, y, z]
dtype: object
split() with expand¶
Create multiple columns from split.
s = pd.Series(['John Doe', 'Jane Smith', 'Bob Wilson'])
names = s.str.split(' ', expand=True)
names.columns = ['first', 'last']
print(names)
first last
0 John Doe
1 Jane Smith
2 Bob Wilson
join()¶
Join lists in each element.
s = pd.Series([['a', 'b', 'c'], ['x', 'y', 'z']])
print(s.str.join('-'))
0 a-b-c
1 x-y-z
dtype: object
cat()¶
Concatenate strings with separator.
s = pd.Series(['A', 'B', 'C'])
print(s.str.cat(sep='-')) # A-B-C
Stripping Whitespace¶
strip(), lstrip(), rstrip()¶
s = pd.Series([' hello ', ' world '])
print(s.str.strip()) # 'hello', 'world'
print(s.str.lstrip()) # 'hello ', 'world '
print(s.str.rstrip()) # ' hello', ' world'
Padding¶
pad(), ljust(), rjust(), center()¶
s = pd.Series(['a', 'bb', 'ccc'])
print(s.str.pad(5, fillchar='_')) # ____a, ___bb, __ccc
print(s.str.ljust(5, fillchar='_')) # a____, bb___, ccc__
print(s.str.center(5, fillchar='_')) # __a__, _bb__, _ccc_
zfill()¶
Pad with zeros.
s = pd.Series(['1', '12', '123'])
print(s.str.zfill(5)) # 00001, 00012, 00123
Extracting Data¶
extract()¶
Extract groups from regex pattern.
s = pd.Series(['A-123', 'B-456', 'C-789'])
# Extract letter and number
extracted = s.str.extract(r'([A-Z])-(\d+)')
extracted.columns = ['letter', 'number']
print(extracted)
letter number
0 A 123
1 B 456
2 C 789
extractall()¶
Extract all matches (returns MultiIndex).
s = pd.Series(['a1b2c3', 'x9y8z7'])
print(s.str.extractall(r'(\d)'))
Handling Missing Values¶
The str accessor handles NaN values gracefully.
s = pd.Series(['Hello', None, 'World'])
print(s.str.upper())
0 HELLO
1 NaN
2 WORLD
dtype: object
Explicit NA Handling¶
# Some methods have na parameter
s = pd.Series(['Apple', None, 'Banana'])
print(s.str.contains('a', na=False)) # False for NaN
print(s.str.contains('a', na=True)) # True for NaN
Method Chaining¶
String methods can be chained together.
s = pd.Series([' HELLO WORLD ', ' PYTHON PANDAS '])
result = (s
.str.strip()
.str.lower()
.str.replace(' ', '_')
)
print(result)
0 hello_world
1 python_pandas
dtype: object
Financial Example: Ticker Cleaning¶
# Raw ticker data with inconsistencies
tickers = pd.Series([' aapl ', 'MSFT', 'googl ', ' AMZN'])
# Clean and standardize
clean_tickers = (tickers
.str.strip()
.str.upper()
)
print(clean_tickers)
0 AAPL
1 MSFT
2 GOOGL
3 AMZN
dtype: object
Summary of Common Methods¶
| Method | Purpose | Example |
|---|---|---|
upper(), lower() |
Case conversion | s.str.upper() |
capitalize() |
Capitalize first char | s.str.capitalize() |
len() |
String length | s.str.len() |
contains() |
Pattern search | s.str.contains('pat') |
startswith() |
Prefix check | s.str.startswith('A') |
replace() |
String replacement | s.str.replace('a', 'b') |
split() |
Split strings | s.str.split('-') |
strip() |
Remove whitespace | s.str.strip() |
extract() |
Regex extraction | s.str.extract(r'(\d+)') |
match() |
Regex matching | s.str.match(r'^[A-Z]') |