Skip to content

sub and split

Mental Model

re.sub is find-and-replace powered by regex — it locates every match and swaps it with a replacement string (or the return value of a callback function). re.split is str.split on steroids — it splits a string on a regex pattern instead of a fixed delimiter. Together they handle the two most common text-transformation tasks: replacing and splitting.

re.sub() — Substitution

re.sub(pattern, repl, string, count=0, flags=0) replaces all occurrences of the pattern with a replacement string:

```python import re

text = "Hello World Hello Python" re.sub(r'Hello', 'Hi', text)

'Hi World Hi Python'

```

Basic Replacement

```python import re

Remove all digits

re.sub(r'\d', '', 'Room 404, Floor 3')

'Room , Floor '

Replace whitespace sequences with a single space

re.sub(r'\s+', ' ', 'too many spaces')

'too many spaces'

Remove HTML tags

re.sub(r'<[^>]+>', '', '

Hello world

')

'Hello world'

```

Limiting Replacements with count

The count parameter limits the number of replacements:

```python import re

text = "aaa bbb ccc aaa bbb" re.sub(r'aaa', 'xxx', text, count=1)

'xxx bbb ccc aaa bbb'

```

Backreferences in Replacement

The replacement string can reference captured groups using \1, \2, etc., or \g<name> for named groups:

```python import re

Swap first and last name

text = "Smith, John" re.sub(r'(\w+), (\w+)', r'\2 \1', text)

'John Smith'

Using named groups

re.sub(r'(?P\w+), (?P\w+)', r'\g \g', text)

'John Smith'

Surround numbers with brackets

re.sub(r'(\d+)', r'[\1]', 'error 404 at line 52')

'error [404] at line [52]'

```

Function as Replacement

When repl is a callable, it receives a Match object and returns the replacement string. This enables dynamic replacements:

```python import re

Double all numbers

def double_match(match): return str(int(match.group()) * 2)

re.sub(r'\d+', double_match, 'score: 42, bonus: 15')

'score: 84, bonus: 30'

Lambda version

re.sub(r'\d+', lambda m: str(int(m.group()) * 2), 'score: 42, bonus: 15')

'score: 84, bonus: 30'

```

More complex transformations:

```python import re

Convert snake_case to camelCase

def to_camel(match): return match.group(1).upper()

text = "my_variable_name" re.sub(r'_([a-z])', to_camel, text)

'myVariableName'

Censor certain words

words_to_censor = {'bad', 'ugly'} def censor(match): word = match.group() if word.lower() in words_to_censor: return '*' * len(word) return word

re.sub(r'\b\w+\b', censor, 'The bad and the ugly')

'The *** and the ****'

```

re.subn()

re.subn() works like re.sub() but returns a tuple of (new_string, count):

```python import re

result, count = re.subn(r'\d+', 'NUM', 'error 404 at line 52') print(result) # 'error NUM at line NUM' print(count) # 2 ```

re.split() — Splitting

re.split(pattern, string, maxsplit=0, flags=0) splits a string by pattern matches. Unlike str.split(), it accepts regex patterns as delimiters.

Basic Splitting

```python import re

Split on one or more whitespace characters

re.split(r'\s+', 'hello world python')

['hello', 'world', 'python']

Split on commas with optional surrounding whitespace

re.split(r'\s,\s', 'a, b ,c , d')

['a', 'b', 'c', 'd']

Split on multiple delimiter types

re.split(r'[;,\s]+', 'a,b; c d,e')

['a', 'b', 'c', 'd', 'e']

```

re.split() vs str.split()

```python

str.split() — fixed string delimiter

'a, b, c'.split(', ')

['a', 'b', 'c']

But fails with inconsistent spacing

'a, b ,c , d'.split(', ')

['a', 'b ,c ', 'd'] — messy!

re.split() — pattern delimiter handles variation

import re re.split(r'\s,\s', 'a, b ,c , d')

['a', 'b', 'c', 'd'] — clean

```

Limiting Splits with maxsplit

```python import re

re.split(r'\s+', 'one two three four five', maxsplit=2)

['one', 'two', 'three four five']

```

Keeping the Delimiters

When the pattern contains a capturing group, the delimiter text is included in the result:

```python import re

Without capturing group — delimiters dropped

re.split(r'\d+', 'abc123def456ghi')

['abc', 'def', 'ghi']

With capturing group — delimiters kept

re.split(r'(\d+)', 'abc123def456ghi')

['abc', '123', 'def', '456', 'ghi']

```

This is useful when you need to reconstruct the original string or process delimiters:

```python import re

Split sentences but keep the punctuation

text = "Hello! How are you? I'm fine." parts = re.split(r'([.!?])\s*', text) print(parts)

['Hello', '!', 'How are you', '?', "I'm fine", '.', '']

```

Edge Cases

```python import re

Empty strings at boundaries

re.split(r',', ',a,,b,')

['', 'a', '', 'b', '']

Pattern at start/end produces empty strings

re.split(r'\d+', '123abc456')

['', 'abc', '']

```

Practical Examples

Cleaning Text Data

```python import re

raw = " Hello, World! This is messy text. "

Normalize whitespace

cleaned = re.sub(r'\s+', ' ', raw).strip() print(cleaned) # 'Hello, World! This is messy text.' ```

Parsing CSV with Quoted Fields

```python import re

line = 'John,"New York, NY",42,"He said ""hello"""'

Split on commas not inside quotes (simplified)

fields = re.findall(r'(?:"([^"](?:""[^"])*)"|([^,]+))', line) print(fields) ```

Redacting Sensitive Information

```python import re

text = "SSN: 123-45-6789, Phone: 555-123-4567"

Redact SSN (keep last 4)

redacted = re.sub(r'\b\d{3}-\d{2}-(\d{4})\b', r'-*-\1', text) print(redacted) # 'SSN: -*-6789, Phone: 555-123-4567' ```

Tokenizing Expressions

```python import re

expr = "3 + 42 * (x - 7)" tokens = re.findall(r'\d+|[a-zA-Z_]\w|[+-/()]', expr) print(tokens)

['3', '+', '42', '*', '(', 'x', '-', '7', ')']

```

Summary

Function Purpose Key Feature
re.sub() Replace matches Supports backreferences and callable replacement
re.subn() Replace matches Returns (result, count) tuple
re.split() Split by pattern More flexible than str.split()
count param Limit replacements re.sub(..., count=1) for first only
Capturing in split Keep delimiters re.split(r'(\d+)', ...)

Exercises

Exercise 1. Use re.sub to convert a string from camelCase to snake_case. For example, "myVariableName" should become "my_variable_name". Hint: insert an underscore before each uppercase letter and lowercase everything.

Solution to Exercise 1

```python import re

def camel_to_snake(name): result = re.sub(r'([A-Z])', r'\1', name) return result.lower().lstrip('')

Test

print(camel_to_snake("myVariableName")) # my_variable_name print(camel_to_snake("HTMLParser")) # h_t_m_l_parser print(camel_to_snake("simpleTest")) # simple_test ```


Exercise 2. Use re.split to split a mathematical expression string into tokens (numbers and operators). For example, "3+42*7-10/2" should return ["3", "+", "42", "*", "7", "-", "10", "/", "2"]. Use a capturing group in the split pattern to keep the delimiters.

Solution to Exercise 2

```python import re

expression = "3+427-10/2" tokens = re.split(r'([+-/])', expression) print(tokens)

['3', '+', '42', '*', '7', '-', '10', '/', '2']

```


Exercise 3. Use re.sub with a replacement function to censor specific words in a text by replacing them with asterisks of the same length. Write a function censor that takes a text and a list of words to censor. For example, censor("The password is secret123", ["password", "secret123"]) should return "The ******** is *********".

Solution to Exercise 3

```python import re

def censor(text, words): for word in words: pattern = re.compile(re.escape(word), re.IGNORECASE) text = pattern.sub(lambda m: '*' * len(m.group()), text) return text

Test

print(censor("The password is secret123", ["password", "secret123"]))

The * is **

```