sub and split¶
re.sub() — Substitution¶
re.sub(pattern, repl, string, count=0, flags=0) replaces all occurrences of the pattern with a replacement string:
import re
text = "Hello World Hello Python"
re.sub(r'Hello', 'Hi', text)
# 'Hi World Hi Python'
Basic Replacement¶
import re
# Remove all digits
re.sub(r'\d', '', 'Room 404, Floor 3')
# 'Room , Floor '
# Replace whitespace sequences with a single space
re.sub(r'\s+', ' ', 'too many spaces')
# 'too many spaces'
# Remove HTML tags
re.sub(r'<[^>]+>', '', '<p>Hello <b>world</b></p>')
# 'Hello world'
Limiting Replacements with count¶
The count parameter limits the number of replacements:
import re
text = "aaa bbb ccc aaa bbb"
re.sub(r'aaa', 'xxx', text, count=1)
# 'xxx bbb ccc aaa bbb'
Backreferences in Replacement¶
The replacement string can reference captured groups using \1, \2, etc., or \g<name> for named groups:
import re
# Swap first and last name
text = "Smith, John"
re.sub(r'(\w+), (\w+)', r'\2 \1', text)
# 'John Smith'
# Using named groups
re.sub(r'(?P<last>\w+), (?P<first>\w+)', r'\g<first> \g<last>', text)
# 'John Smith'
# Surround numbers with brackets
re.sub(r'(\d+)', r'[\1]', 'error 404 at line 52')
# 'error [404] at line [52]'
Function as Replacement¶
When repl is a callable, it receives a Match object and returns the replacement string. This enables dynamic replacements:
import re
# Double all numbers
def double_match(match):
return str(int(match.group()) * 2)
re.sub(r'\d+', double_match, 'score: 42, bonus: 15')
# 'score: 84, bonus: 30'
# Lambda version
re.sub(r'\d+', lambda m: str(int(m.group()) * 2), 'score: 42, bonus: 15')
# 'score: 84, bonus: 30'
More complex transformations:
import re
# Convert snake_case to camelCase
def to_camel(match):
return match.group(1).upper()
text = "my_variable_name"
re.sub(r'_([a-z])', to_camel, text)
# 'myVariableName'
# Censor certain words
words_to_censor = {'bad', 'ugly'}
def censor(match):
word = match.group()
if word.lower() in words_to_censor:
return '*' * len(word)
return word
re.sub(r'\b\w+\b', censor, 'The bad and the ugly')
# 'The *** and the ****'
re.subn()¶
re.subn() works like re.sub() but returns a tuple of (new_string, count):
import re
result, count = re.subn(r'\d+', 'NUM', 'error 404 at line 52')
print(result) # 'error NUM at line NUM'
print(count) # 2
re.split() — Splitting¶
re.split(pattern, string, maxsplit=0, flags=0) splits a string by pattern matches. Unlike str.split(), it accepts regex patterns as delimiters.
Basic Splitting¶
import re
# Split on one or more whitespace characters
re.split(r'\s+', 'hello world python')
# ['hello', 'world', 'python']
# Split on commas with optional surrounding whitespace
re.split(r'\s*,\s*', 'a, b ,c , d')
# ['a', 'b', 'c', 'd']
# Split on multiple delimiter types
re.split(r'[;,\s]+', 'a,b; c d,e')
# ['a', 'b', 'c', 'd', 'e']
re.split() vs str.split()¶
# str.split() — fixed string delimiter
'a, b, c'.split(', ')
# ['a', 'b', 'c']
# But fails with inconsistent spacing
'a, b ,c , d'.split(', ')
# ['a', 'b ,c ', 'd'] — messy!
# re.split() — pattern delimiter handles variation
import re
re.split(r'\s*,\s*', 'a, b ,c , d')
# ['a', 'b', 'c', 'd'] — clean
Limiting Splits with maxsplit¶
import re
re.split(r'\s+', 'one two three four five', maxsplit=2)
# ['one', 'two', 'three four five']
Keeping the Delimiters¶
When the pattern contains a capturing group, the delimiter text is included in the result:
import re
# Without capturing group — delimiters dropped
re.split(r'\d+', 'abc123def456ghi')
# ['abc', 'def', 'ghi']
# With capturing group — delimiters kept
re.split(r'(\d+)', 'abc123def456ghi')
# ['abc', '123', 'def', '456', 'ghi']
This is useful when you need to reconstruct the original string or process delimiters:
import re
# Split sentences but keep the punctuation
text = "Hello! How are you? I'm fine."
parts = re.split(r'([.!?])\s*', text)
print(parts)
# ['Hello', '!', 'How are you', '?', "I'm fine", '.', '']
Edge Cases¶
import re
# Empty strings at boundaries
re.split(r',', ',a,,b,')
# ['', 'a', '', 'b', '']
# Pattern at start/end produces empty strings
re.split(r'\d+', '123abc456')
# ['', 'abc', '']
Practical Examples¶
Cleaning Text Data¶
import re
raw = " Hello, World! This is messy text. "
# Normalize whitespace
cleaned = re.sub(r'\s+', ' ', raw).strip()
print(cleaned) # 'Hello, World! This is messy text.'
Parsing CSV with Quoted Fields¶
import re
line = 'John,"New York, NY",42,"He said ""hello"""'
# Split on commas not inside quotes (simplified)
fields = re.findall(r'(?:"([^"]*(?:""[^"]*)*)"|([^,]+))', line)
print(fields)
Redacting Sensitive Information¶
import re
text = "SSN: 123-45-6789, Phone: 555-123-4567"
# Redact SSN (keep last 4)
redacted = re.sub(r'\b\d{3}-\d{2}-(\d{4})\b', r'***-**-\1', text)
print(redacted) # 'SSN: ***-**-6789, Phone: 555-123-4567'
Tokenizing Expressions¶
import re
expr = "3 + 42 * (x - 7)"
tokens = re.findall(r'\d+|[a-zA-Z_]\w*|[+\-*/()]', expr)
print(tokens)
# ['3', '+', '42', '*', '(', 'x', '-', '7', ')']
Summary¶
| Function | Purpose | Key Feature |
|---|---|---|
re.sub() |
Replace matches | Supports backreferences and callable replacement |
re.subn() |
Replace matches | Returns (result, count) tuple |
re.split() |
Split by pattern | More flexible than str.split() |
count param |
Limit replacements | re.sub(..., count=1) for first only |
| Capturing in split | Keep delimiters | re.split(r'(\d+)', ...) |