sub and split¶

Mental Model

re.sub is find-and-replace powered by regex — it locates every match and swaps it with a replacement string (or the return value of a callback function). re.split is str.split on steroids — it splits a string on a regex pattern instead of a fixed delimiter. Together they handle the two most common text-transformation tasks: replacing and splitting.

`re.sub()` — Substitution¶

re.sub(pattern, repl, string, count=0, flags=0) replaces all occurrences of the pattern with a replacement string:

```python import re

text = "Hello World Hello Python" re.sub(r'Hello', 'Hi', text)

'Hi World Hi Python'¶

```

Basic Replacement¶

```python import re

Remove all digits¶

re.sub(r'\d', '', 'Room 404, Floor 3')

'Room , Floor '¶

Replace whitespace sequences with a single space¶

re.sub(r'\s+', ' ', 'too many spaces')

'too many spaces'¶

Remove HTML tags¶

re.sub(r'<[^>]+>', '', '

Hello world

')

'Hello world'¶

```

Limiting Replacements with `count`¶

The count parameter limits the number of replacements:

```python import re

text = "aaa bbb ccc aaa bbb" re.sub(r'aaa', 'xxx', text, count=1)

'xxx bbb ccc aaa bbb'¶

```

Backreferences in Replacement¶

The replacement string can reference captured groups using \1, \2, etc., or \g<name> for named groups:

```python import re

Swap first and last name¶

text = "Smith, John" re.sub(r'(\w+), (\w+)', r'\2 \1', text)

'John Smith'¶

Using named groups¶

re.sub(r'(?P\w+), (?P\w+)', r'\g \g', text)

'John Smith'¶

Surround numbers with brackets¶

re.sub(r'(\d+)', r'[\1]', 'error 404 at line 52')

'error [404] at line [52]'¶

```

Function as Replacement¶

When repl is a callable, it receives a Match object and returns the replacement string. This enables dynamic replacements:

```python import re

Double all numbers¶

def double_match(match): return str(int(match.group()) * 2)

re.sub(r'\d+', double_match, 'score: 42, bonus: 15')

'score: 84, bonus: 30'¶

Lambda version¶

re.sub(r'\d+', lambda m: str(int(m.group()) * 2), 'score: 42, bonus: 15')

'score: 84, bonus: 30'¶

```

More complex transformations:

```python import re

Convert snake_case to camelCase¶

def to_camel(match): return match.group(1).upper()

text = "my_variable_name" re.sub(r'_([a-z])', to_camel, text)

'myVariableName'¶

Censor certain words¶

words_to_censor = {'bad', 'ugly'} def censor(match): word = match.group() if word.lower() in words_to_censor: return '*' * len(word) return word

re.sub(r'\b\w+\b', censor, 'The bad and the ugly')

'The * and the **'¶

```

`re.subn()`¶

re.subn() works like re.sub() but returns a tuple of (new_string, count):

```python import re

result, count = re.subn(r'\d+', 'NUM', 'error 404 at line 52') print(result) # 'error NUM at line NUM' print(count) # 2 ```

`re.split()` — Splitting¶

re.split(pattern, string, maxsplit=0, flags=0) splits a string by pattern matches. Unlike str.split(), it accepts regex patterns as delimiters.

Basic Splitting¶

```python import re

Split on one or more whitespace characters¶

re.split(r'\s+', 'hello world python')

['hello', 'world', 'python']¶

Split on commas with optional surrounding whitespace¶

re.split(r'\s,\s', 'a, b ,c , d')

['a', 'b', 'c', 'd']¶

Split on multiple delimiter types¶

re.split(r'[;,\s]+', 'a,b; c d,e')

['a', 'b', 'c', 'd', 'e']¶

```

`re.split()` vs `str.split()`¶

```python

str.split() — fixed string delimiter¶

'a, b, c'.split(', ')

['a', 'b', 'c']¶

But fails with inconsistent spacing¶

'a, b ,c , d'.split(', ')

['a', 'b ,c ', 'd'] — messy!¶

re.split() — pattern delimiter handles variation¶

import re re.split(r'\s,\s', 'a, b ,c , d')

['a', 'b', 'c', 'd'] — clean¶

```

Limiting Splits with `maxsplit`¶

```python import re

re.split(r'\s+', 'one two three four five', maxsplit=2)

['one', 'two', 'three four five']¶

```

Keeping the Delimiters¶

When the pattern contains a capturing group, the delimiter text is included in the result:

```python import re

Without capturing group — delimiters dropped¶

re.split(r'\d+', 'abc123def456ghi')

['abc', 'def', 'ghi']¶

With capturing group — delimiters kept¶

re.split(r'(\d+)', 'abc123def456ghi')

['abc', '123', 'def', '456', 'ghi']¶

```

This is useful when you need to reconstruct the original string or process delimiters:

```python import re

Split sentences but keep the punctuation¶

text = "Hello! How are you? I'm fine." parts = re.split(r'([.!?])\s*', text) print(parts)

['Hello', '!', 'How are you', '?', "I'm fine", '.', '']¶

```

Edge Cases¶

```python import re

Empty strings at boundaries¶

re.split(r',', ',a,,b,')

['', 'a', '', 'b', '']¶

Pattern at start/end produces empty strings¶

re.split(r'\d+', '123abc456')

['', 'abc', '']¶

```

Practical Examples¶

Cleaning Text Data¶

```python import re

raw = " Hello, World! This is messy text. "

Normalize whitespace¶

cleaned = re.sub(r'\s+', ' ', raw).strip() print(cleaned) # 'Hello, World! This is messy text.' ```

Parsing CSV with Quoted Fields¶

```python import re

line = 'John,"New York, NY",42,"He said ""hello"""'

Split on commas not inside quotes (simplified)¶

fields = re.findall(r'(?:"([^"](?:""[^"])*)"|([^,]+))', line) print(fields) ```

Redacting Sensitive Information¶

```python import re

text = "SSN: 123-45-6789, Phone: 555-123-4567"

Redact SSN (keep last 4)¶

redacted = re.sub(r'\b\d{3}-\d{2}-(\d{4})\b', r'-*-\1', text) print(redacted) # 'SSN: -*-6789, Phone: 555-123-4567' ```

Tokenizing Expressions¶

```python import re

expr = "3 + 42 * (x - 7)" tokens = re.findall(r'\d+|[a-zA-Z_]\w|[+-/()]', expr) print(tokens)

['3', '+', '42', '*', '(', 'x', '-', '7', ')']¶

```

Summary¶

Function	Purpose	Key Feature
`re.sub()`	Replace matches	Supports backreferences and callable replacement
`re.subn()`	Replace matches	Returns `(result, count)` tuple
`re.split()`	Split by pattern	More flexible than `str.split()`
`count` param	Limit replacements	`re.sub(..., count=1)` for first only
Capturing in split	Keep delimiters	`re.split(r'(\d+)', ...)`

Exercises¶

Exercise 1. Use re.sub to convert a string from camelCase to snake_case. For example, "myVariableName" should become "my_variable_name". Hint: insert an underscore before each uppercase letter and lowercase everything.

Solution to Exercise 1

```python import re

def camel_to_snake(name): result = re.sub(r'([A-Z])', r'\1', name) return result.lower().lstrip('')

Test¶

print(camel_to_snake("myVariableName")) # my_variable_name print(camel_to_snake("HTMLParser")) # h_t_m_l_parser print(camel_to_snake("simpleTest")) # simple_test ```

Exercise 2. Use re.split to split a mathematical expression string into tokens (numbers and operators). For example, "3+42*7-10/2" should return ["3", "+", "42", "*", "7", "-", "10", "/", "2"]. Use a capturing group in the split pattern to keep the delimiters.

Solution to Exercise 2

```python import re

expression = "3+427-10/2" tokens = re.split(r'([+-/])', expression) print(tokens)

['3', '+', '42', '*', '7', '-', '10', '/', '2']¶

```

Exercise 3. Use re.sub with a replacement function to censor specific words in a text by replacing them with asterisks of the same length. Write a function censor that takes a text and a list of words to censor. For example, censor("The password is secret123", ["password", "secret123"]) should return "The ******** is *********".

Solution to Exercise 3

```python import re

def censor(text, words): for word in words: pattern = re.compile(re.escape(word), re.IGNORECASE) text = pattern.sub(lambda m: '*' * len(m.group()), text) return text

Test¶

print(censor("The password is secret123", ["password", "secret123"]))

The * is **¶

```

sub and split¶

re.sub() — Substitution¶

'Hi World Hi Python'¶

Basic Replacement¶

Remove all digits¶

'Room , Floor '¶

Replace whitespace sequences with a single space¶

'too many spaces'¶

Remove HTML tags¶

'Hello world'¶

Limiting Replacements with count¶

'xxx bbb ccc aaa bbb'¶

Backreferences in Replacement¶

Swap first and last name¶

'John Smith'¶

Using named groups¶

'John Smith'¶

Surround numbers with brackets¶

'error [404] at line [52]'¶

Function as Replacement¶

Double all numbers¶

'score: 84, bonus: 30'¶

Lambda version¶

'score: 84, bonus: 30'¶

Convert snake_case to camelCase¶

'myVariableName'¶

Censor certain words¶

'The *** and the ****'¶

re.subn()¶

re.split() — Splitting¶

Basic Splitting¶

Split on one or more whitespace characters¶

['hello', 'world', 'python']¶

Split on commas with optional surrounding whitespace¶

['a', 'b', 'c', 'd']¶

Split on multiple delimiter types¶

['a', 'b', 'c', 'd', 'e']¶

re.split() vs str.split()¶

str.split() — fixed string delimiter¶

['a', 'b', 'c']¶

But fails with inconsistent spacing¶

['a', 'b ,c ', 'd'] — messy!¶

re.split() — pattern delimiter handles variation¶

['a', 'b', 'c', 'd'] — clean¶

Limiting Splits with maxsplit¶

['one', 'two', 'three four five']¶

Keeping the Delimiters¶

Without capturing group — delimiters dropped¶

['abc', 'def', 'ghi']¶

With capturing group — delimiters kept¶

['abc', '123', 'def', '456', 'ghi']¶

Split sentences but keep the punctuation¶

['Hello', '!', 'How are you', '?', "I'm fine", '.', '']¶

Edge Cases¶

Empty strings at boundaries¶

['', 'a', '', 'b', '']¶

Pattern at start/end produces empty strings¶

['', 'abc', '']¶

Practical Examples¶

Cleaning Text Data¶

Normalize whitespace¶

Parsing CSV with Quoted Fields¶

Split on commas not inside quotes (simplified)¶

Redacting Sensitive Information¶

Redact SSN (keep last 4)¶

Tokenizing Expressions¶

['3', '+', '42', '*', '(', 'x', '-', '7', ')']¶

Summary¶

Exercises¶

Test¶

['3', '+', '42', '*', '7', '-', '10', '/', '2']¶

Test¶

The * is **¶

`re.sub()` — Substitution¶

Limiting Replacements with `count`¶

'The * and the **'¶

`re.subn()`¶

`re.split()` — Splitting¶

`re.split()` vs `str.split()`¶

Limiting Splits with `maxsplit`¶