String Accessor (str)¶
The str accessor in pandas provides vectorized string operations on Series containing string data. This allows you to apply string methods to entire columns without explicit loops.
Mental Model
Python's built-in string methods work on one string at a time. The .str accessor broadcasts those same methods over an entire column, automatically skipping NaN values. Think of it as lifting str.upper(), str.contains(), and friends from scalar operations to column-wide operations.
Overview¶
```python import pandas as pd
s = pd.Series(['Alice', 'Bob', 'Charlie'])
Access string methods via .str accessor¶
print(s.str.upper()) ```
0 ALICE
1 BOB
2 CHARLIE
dtype: object
Why Use the str Accessor?¶
- Vectorized Operations: Apply string methods to all elements at once
- Automatic NaN Handling: Missing values are handled gracefully
- Method Chaining: Chain multiple string operations together
- Performance: Optimized for pandas data structures
```python
Without str accessor (slow, verbose)¶
result = [x.upper() if pd.notna(x) else x for x in s]
With str accessor (fast, concise)¶
result = s.str.upper() ```
Case Transformation¶
upper() and lower()¶
```python s = pd.Series(['Hello', 'World', 'Python'])
print(s.str.upper()) # HELLO, WORLD, PYTHON print(s.str.lower()) # hello, world, python ```
capitalize()¶
Capitalize first character, lowercase the rest.
python
s = pd.Series(['alice', 'JOHN', 'mARY', 'bOB'])
print(s.str.capitalize())
0 Alice
1 John
2 Mary
3 Bob
dtype: object
Practical Example (LeetCode 1667): Fix names in a table.
```python users = pd.DataFrame({ 'user_id': [1, 2, 3, 4], 'name': ['alice', 'john', 'MARY', 'bOB'] })
users['name'] = users['name'].str.capitalize() print(users) ```
user_id name
0 1 Alice
1 2 John
2 3 Mary
3 4 Bob
title()¶
Capitalize first character of each word.
python
s = pd.Series(['hello world', 'PYTHON PANDAS'])
print(s.str.title())
0 Hello World
1 Python Pandas
dtype: object
swapcase()¶
python
s = pd.Series(['Hello World'])
print(s.str.swapcase()) # hELLO wORLD
String Length¶
len()¶
python
s = pd.Series(['Hello', 'World', 'Python'])
print(s.str.len())
0 5
1 5
2 6
dtype: int64
Practical Example (LeetCode 1683): Find tweets longer than 15 characters.
```python tweets = pd.DataFrame({ 'tweet_id': [1, 2, 3, 4], 'content': ['Hello world!', 'This is a very long tweet!', 'Short', 'Another tweet'] })
Filter tweets with content > 15 characters¶
long_tweets = tweets[tweets['content'].str.len() > 15] print(long_tweets) ```
tweet_id content
1 2 This is a very long tweet!
Substring Operations¶
Indexing with []¶
```python s = pd.Series(['Alice', 'Bob', 'Charlie'])
print(s.str[0]) # First character: A, B, C print(s.str[-1]) # Last character: e, b, e print(s.str[:3]) # First 3 characters: Ali, Bob, Cha ```
slice()¶
python
s = pd.Series(['Apple', 'Banana', 'Cherry'])
print(s.str.slice(0, 3)) # App, Ban, Che
get()¶
Get character at position (like indexing but safer with NaN).
python
s = pd.Series(['Alice', None, 'Charlie'])
print(s.str.get(0)) # A, NaN, C
Searching and Matching¶
contains()¶
Check if pattern is contained in each string.
python
s = pd.Series(['apple', 'banana', 'cherry', 'date'])
print(s.str.contains('an'))
0 False
1 True
2 False
3 False
dtype: bool
Practical Example (LeetCode 620): Find movies without "boring" in description.
```python cinema = pd.DataFrame({ 'id': [1, 2, 3, 4, 5], 'description': ['A thrilling adventure', 'A boring documentary', 'An exciting drama', 'A slow and boring tale', 'A great comedy'] })
Filter out boring movies (case-insensitive)¶
not_boring = cinema[~cinema['description'].str.contains('boring', case=False)] print(not_boring) ```
id description
0 1 A thrilling adventure
2 3 An exciting drama
4 5 A great comedy
startswith() and endswith()¶
```python s = pd.Series(['Apple', 'Apricot', 'Banana', 'Avocado'])
print(s.str.startswith('A')) # True, True, False, True print(s.str.endswith('a')) # False, False, True, False ```
Practical Example (LeetCode 1873): Find employees whose names don't start with 'M'.
```python employees = pd.DataFrame({ 'employee_id': [1, 2, 3, 4, 5], 'name': ['Alice', 'Bob', 'Mike', 'Molly', 'Eve'] })
Employees with odd ID and name not starting with 'M'¶
bonus_eligible = employees[ (employees['employee_id'] % 2 != 0) & (~employees['name'].str.startswith('M')) ] print(bonus_eligible) ```
employee_id name
0 1 Alice
4 5 Eve
match()¶
Match regular expression pattern at the start of string.
```python s = pd.Series(['alice@example.com', 'bob@test.org', 'invalid-email'])
Check valid email format¶
pattern = r'^[a-zA-Z][a-zA-Z0-9._-]*@[a-zA-Z]+.[a-zA-Z]+$' print(s.str.match(pattern)) ```
0 True
1 True
2 False
dtype: bool
Practical Example (LeetCode 1517): Find users with valid emails.
```python users = pd.DataFrame({ 'user_id': [1, 2, 3], 'mail': ['alice@leetcode.com', 'bob@leet?com.com', '123@example.com'] })
Valid email pattern: starts with letter, domain is @leetcode.com¶
valid_pattern = r'^[A-Za-z][A-Za-z0-9_.-]*@leetcode.com$' valid_users = users[users['mail'].str.match(valid_pattern)] print(valid_users) ```
find() and rfind()¶
Find position of substring.
python
s = pd.Series(['hello world', 'world hello', 'no match'])
print(s.str.find('world')) # 6, 0, -1
String Replacement¶
replace()¶
Replace occurrences of pattern.
python
s = pd.Series(['apple-pie', 'banana-split', 'cherry-tart'])
print(s.str.replace('-', '_'))
0 apple_pie
1 banana_split
2 cherry_tart
dtype: object
With Regex¶
```python s = pd.Series(['price: $100', 'cost: $200', 'value: $300'])
Remove dollar amounts¶
print(s.str.replace(r'$\d+', 'XXX', regex=True)) ```
0 price: XXX
1 cost: XXX
2 value: XXX
dtype: object
Splitting and Joining¶
split()¶
python
s = pd.Series(['a-b-c', 'x-y-z'])
print(s.str.split('-'))
0 [a, b, c]
1 [x, y, z]
dtype: object
split() with expand¶
Create multiple columns from split.
python
s = pd.Series(['John Doe', 'Jane Smith', 'Bob Wilson'])
names = s.str.split(' ', expand=True)
names.columns = ['first', 'last']
print(names)
first last
0 John Doe
1 Jane Smith
2 Bob Wilson
join()¶
Join lists in each element.
python
s = pd.Series([['a', 'b', 'c'], ['x', 'y', 'z']])
print(s.str.join('-'))
0 a-b-c
1 x-y-z
dtype: object
cat()¶
Concatenate strings with separator.
python
s = pd.Series(['A', 'B', 'C'])
print(s.str.cat(sep='-')) # A-B-C
Stripping Whitespace¶
strip(), lstrip(), rstrip()¶
```python s = pd.Series([' hello ', ' world '])
print(s.str.strip()) # 'hello', 'world' print(s.str.lstrip()) # 'hello ', 'world ' print(s.str.rstrip()) # ' hello', ' world' ```
Padding¶
pad(), ljust(), rjust(), center()¶
```python s = pd.Series(['a', 'bb', 'ccc'])
print(s.str.pad(5, fillchar='')) # ____a, ___bb, __ccc print(s.str.ljust(5, fillchar='')) # a_, bb, ccc__ print(s.str.center(5, fillchar='')) # __a__, _bb__, _ccc ```
zfill()¶
Pad with zeros.
python
s = pd.Series(['1', '12', '123'])
print(s.str.zfill(5)) # 00001, 00012, 00123
Extracting Data¶
extract()¶
Extract groups from regex pattern.
```python s = pd.Series(['A-123', 'B-456', 'C-789'])
Extract letter and number¶
extracted = s.str.extract(r'([A-Z])-(\d+)') extracted.columns = ['letter', 'number'] print(extracted) ```
letter number
0 A 123
1 B 456
2 C 789
extractall()¶
Extract all matches (returns MultiIndex).
python
s = pd.Series(['a1b2c3', 'x9y8z7'])
print(s.str.extractall(r'(\d)'))
Handling Missing Values¶
The str accessor handles NaN values gracefully.
```python s = pd.Series(['Hello', None, 'World'])
print(s.str.upper()) ```
0 HELLO
1 NaN
2 WORLD
dtype: object
Explicit NA Handling¶
```python
Some methods have na parameter¶
s = pd.Series(['Apple', None, 'Banana']) print(s.str.contains('a', na=False)) # False for NaN print(s.str.contains('a', na=True)) # True for NaN ```
Method Chaining¶
String methods can be chained together.
```python s = pd.Series([' HELLO WORLD ', ' PYTHON PANDAS '])
result = (s .str.strip() .str.lower() .str.replace(' ', '_') ) print(result) ```
0 hello_world
1 python_pandas
dtype: object
Financial Example: Ticker Cleaning¶
```python
Raw ticker data with inconsistencies¶
tickers = pd.Series([' aapl ', 'MSFT', 'googl ', ' AMZN'])
Clean and standardize¶
clean_tickers = (tickers .str.strip() .str.upper() ) print(clean_tickers) ```
0 AAPL
1 MSFT
2 GOOGL
3 AMZN
dtype: object
Summary of Common Methods¶
| Method | Purpose | Example |
|---|---|---|
upper(), lower() |
Case conversion | s.str.upper() |
capitalize() |
Capitalize first char | s.str.capitalize() |
len() |
String length | s.str.len() |
contains() |
Pattern search | s.str.contains('pat') |
startswith() |
Prefix check | s.str.startswith('A') |
replace() |
String replacement | s.str.replace('a', 'b') |
split() |
Split strings | s.str.split('-') |
strip() |
Remove whitespace | s.str.strip() |
extract() |
Regex extraction | s.str.extract(r'(\d+)') |
match() |
Regex matching | s.str.match(r'^[A-Z]') |
Exercises¶
Exercise 1.
Given a Series of full names like ['Alice Smith', 'Bob Jones', 'Carol Lee'], use the .str accessor to extract only the first names (everything before the space) into a new Series.
Solution to Exercise 1
Use .str.split() and access the first element.
import pandas as pd
names = pd.Series(['Alice Smith', 'Bob Jones', 'Carol Lee'])
first_names = names.str.split(' ').str[0]
print(first_names)
Exercise 2.
Given a Series of email addresses, use .str.contains() to create a boolean mask that identifies emails from the domain 'gmail.com'. Then filter the Series to show only Gmail addresses.
Solution to Exercise 2
Use .str.contains() with case=False for safety.
import pandas as pd
emails = pd.Series(['alice@gmail.com', 'bob@yahoo.com', 'carol@gmail.com', 'dave@outlook.com'])
gmail_mask = emails.str.contains('gmail.com')
gmail_only = emails[gmail_mask]
print(gmail_only)
Exercise 3.
Given a Series of messy product codes like [' AB-123 ', 'cd-456', ' EF-789'], chain .str methods to (1) strip whitespace, (2) convert to uppercase, and (3) replace the hyphen with an underscore. The result should be ['AB_123', 'CD_456', 'EF_789'].
Solution to Exercise 3
Chain strip(), upper(), and replace().
import pandas as pd
codes = pd.Series([' AB-123 ', 'cd-456', ' EF-789'])
cleaned = codes.str.strip().str.upper().str.replace('-', '_')
print(cleaned)