Skip to content

sub and split

re.sub() — Substitution

re.sub(pattern, repl, string, count=0, flags=0) replaces all occurrences of the pattern with a replacement string:

import re

text = "Hello World Hello Python"
re.sub(r'Hello', 'Hi', text)
# 'Hi World Hi Python'

Basic Replacement

import re

# Remove all digits
re.sub(r'\d', '', 'Room 404, Floor 3')
# 'Room , Floor '

# Replace whitespace sequences with a single space
re.sub(r'\s+', ' ', 'too   many     spaces')
# 'too many spaces'

# Remove HTML tags
re.sub(r'<[^>]+>', '', '<p>Hello <b>world</b></p>')
# 'Hello world'

Limiting Replacements with count

The count parameter limits the number of replacements:

import re

text = "aaa bbb ccc aaa bbb"
re.sub(r'aaa', 'xxx', text, count=1)
# 'xxx bbb ccc aaa bbb'

Backreferences in Replacement

The replacement string can reference captured groups using \1, \2, etc., or \g<name> for named groups:

import re

# Swap first and last name
text = "Smith, John"
re.sub(r'(\w+), (\w+)', r'\2 \1', text)
# 'John Smith'

# Using named groups
re.sub(r'(?P<last>\w+), (?P<first>\w+)', r'\g<first> \g<last>', text)
# 'John Smith'

# Surround numbers with brackets
re.sub(r'(\d+)', r'[\1]', 'error 404 at line 52')
# 'error [404] at line [52]'

Function as Replacement

When repl is a callable, it receives a Match object and returns the replacement string. This enables dynamic replacements:

import re

# Double all numbers
def double_match(match):
    return str(int(match.group()) * 2)

re.sub(r'\d+', double_match, 'score: 42, bonus: 15')
# 'score: 84, bonus: 30'

# Lambda version
re.sub(r'\d+', lambda m: str(int(m.group()) * 2), 'score: 42, bonus: 15')
# 'score: 84, bonus: 30'

More complex transformations:

import re

# Convert snake_case to camelCase
def to_camel(match):
    return match.group(1).upper()

text = "my_variable_name"
re.sub(r'_([a-z])', to_camel, text)
# 'myVariableName'

# Censor certain words
words_to_censor = {'bad', 'ugly'}
def censor(match):
    word = match.group()
    if word.lower() in words_to_censor:
        return '*' * len(word)
    return word

re.sub(r'\b\w+\b', censor, 'The bad and the ugly')
# 'The *** and the ****'

re.subn()

re.subn() works like re.sub() but returns a tuple of (new_string, count):

import re

result, count = re.subn(r'\d+', 'NUM', 'error 404 at line 52')
print(result)  # 'error NUM at line NUM'
print(count)   # 2

re.split() — Splitting

re.split(pattern, string, maxsplit=0, flags=0) splits a string by pattern matches. Unlike str.split(), it accepts regex patterns as delimiters.

Basic Splitting

import re

# Split on one or more whitespace characters
re.split(r'\s+', 'hello   world   python')
# ['hello', 'world', 'python']

# Split on commas with optional surrounding whitespace
re.split(r'\s*,\s*', 'a, b ,c , d')
# ['a', 'b', 'c', 'd']

# Split on multiple delimiter types
re.split(r'[;,\s]+', 'a,b; c  d,e')
# ['a', 'b', 'c', 'd', 'e']

re.split() vs str.split()

# str.split() — fixed string delimiter
'a, b, c'.split(', ')
# ['a', 'b', 'c']

# But fails with inconsistent spacing
'a, b ,c , d'.split(', ')
# ['a', 'b ,c ', 'd']  — messy!

# re.split() — pattern delimiter handles variation
import re
re.split(r'\s*,\s*', 'a, b ,c , d')
# ['a', 'b', 'c', 'd']  — clean

Limiting Splits with maxsplit

import re

re.split(r'\s+', 'one two three four five', maxsplit=2)
# ['one', 'two', 'three four five']

Keeping the Delimiters

When the pattern contains a capturing group, the delimiter text is included in the result:

import re

# Without capturing group — delimiters dropped
re.split(r'\d+', 'abc123def456ghi')
# ['abc', 'def', 'ghi']

# With capturing group — delimiters kept
re.split(r'(\d+)', 'abc123def456ghi')
# ['abc', '123', 'def', '456', 'ghi']

This is useful when you need to reconstruct the original string or process delimiters:

import re

# Split sentences but keep the punctuation
text = "Hello! How are you? I'm fine."
parts = re.split(r'([.!?])\s*', text)
print(parts)
# ['Hello', '!', 'How are you', '?', "I'm fine", '.', '']

Edge Cases

import re

# Empty strings at boundaries
re.split(r',', ',a,,b,')
# ['', 'a', '', 'b', '']

# Pattern at start/end produces empty strings
re.split(r'\d+', '123abc456')
# ['', 'abc', '']

Practical Examples

Cleaning Text Data

import re

raw = "  Hello,   World!  This  is   messy   text.  "

# Normalize whitespace
cleaned = re.sub(r'\s+', ' ', raw).strip()
print(cleaned)  # 'Hello, World! This is messy text.'

Parsing CSV with Quoted Fields

import re

line = 'John,"New York, NY",42,"He said ""hello"""'
# Split on commas not inside quotes (simplified)
fields = re.findall(r'(?:"([^"]*(?:""[^"]*)*)"|([^,]+))', line)
print(fields)

Redacting Sensitive Information

import re

text = "SSN: 123-45-6789, Phone: 555-123-4567"

# Redact SSN (keep last 4)
redacted = re.sub(r'\b\d{3}-\d{2}-(\d{4})\b', r'***-**-\1', text)
print(redacted)  # 'SSN: ***-**-6789, Phone: 555-123-4567'

Tokenizing Expressions

import re

expr = "3 + 42 * (x - 7)"
tokens = re.findall(r'\d+|[a-zA-Z_]\w*|[+\-*/()]', expr)
print(tokens)
# ['3', '+', '42', '*', '(', 'x', '-', '7', ')']

Summary

Function Purpose Key Feature
re.sub() Replace matches Supports backreferences and callable replacement
re.subn() Replace matches Returns (result, count) tuple
re.split() Split by pattern More flexible than str.split()
count param Limit replacements re.sub(..., count=1) for first only
Capturing in split Keep delimiters re.split(r'(\d+)', ...)