Pattern Syntax Basics¶

Literal Characters¶

The simplest regex patterns are literal characters that match themselves exactly:

import re

re.search(r'cat', 'The cat sat on the mat')
# <re.Match object; span=(4, 7), match='cat'>

re.findall(r'at', 'The cat sat on the mat')
# ['at', 'at', 'at']

Most characters match themselves, but certain characters have special meaning in regex and must be escaped with a backslash to match literally.

Metacharacters¶

These characters have special meaning in regex:

.  ^  $  *  +  ?  {  }  [  ]  \  |  (  )

Metacharacter	Meaning
`.`	Match any character except newline
`^`	Match the start of the string
`$`	Match the end of the string
`*`	Zero or more repetitions
`+`	One or more repetitions
`?`	Zero or one repetition
`{m,n}`	Between m and n repetitions
`[]`	Character class (set of characters)
`\`	Escape character
`\\|`	Alternation (OR)
`()`	Grouping and capturing

The Dot (`.`)¶

The dot matches any single character except a newline:

import re

re.findall(r'c.t', 'cat cot cut c\nt c9t')
# ['cat', 'cot', 'cut', 'c9t']

With re.DOTALL, the dot also matches newlines:

re.findall(r'c.t', 'cat c\nt', re.DOTALL)
# ['cat', 'c\nt']

Anchors (`^` and `$`)¶

Anchors match positions, not characters:

import re

text = "hello world"

re.search(r'^hello', text)   # Matches — 'hello' is at start
re.search(r'^world', text)   # None — 'world' is not at start
re.search(r'world$', text)   # Matches — 'world' is at end
re.search(r'hello$', text)   # None — 'hello' is not at end

With re.MULTILINE, ^ and $ match at the start/end of each line:

text = "first line\nsecond line\nthird line"

re.findall(r'^\w+', text)              # ['first']
re.findall(r'^\w+', text, re.M)        # ['first', 'second', 'third']
re.findall(r'\w+$', text, re.M)        # ['line', 'line', 'line']

Alternation (`|`)¶

The pipe acts as a logical OR:

import re

re.findall(r'cat|dog', 'I have a cat and a dog')
# ['cat', 'dog']

# Alternation applies to the whole expression on each side
re.findall(r'gray|grey', 'gray and grey')
# ['gray', 'grey']

# Use parentheses to limit alternation scope
re.findall(r'gr(a|e)y', 'gray and grey')
# ['a', 'e']  — returns captured groups!

# Non-capturing group to get full match
re.findall(r'gr(?:a|e)y', 'gray and grey')
# ['gray', 'grey']

Escaping Metacharacters¶

To match a metacharacter literally, precede it with a backslash:

import re

# Match a literal dot
re.findall(r'\.', 'version 3.14.1')
# ['.', '.']

# Match a literal dollar sign
re.search(r'\$\d+', 'Price: \$25')
# <re.Match object; span=(7, 10), match='\$25'>

# Match literal parentheses
re.search(r'\(.*?\)', 'func(arg)')
# <re.Match object; span=(4, 9), match='(arg)'>

You can also use re.escape() to escape all metacharacters in a string:

special = "price is \$5.00 (USD)"
escaped = re.escape(special)
print(escaped)
# price\ is\ \$5\.00\ \(USD\)

This is useful when building patterns from user input.

Shorthand Character Classes¶

These shortcuts match common character categories:

Shorthand	Meaning	Equivalent Class
`\d`	Any digit	`[0-9]`
`\D`	Any non-digit	`[^0-9]`
`\w`	Any word character	`[a-zA-Z0-9_]`
`\W`	Any non-word character	`[^a-zA-Z0-9_]`
`\s`	Any whitespace	`[ \t\n\r\f\v]`
`\S`	Any non-whitespace	`[^ \t\n\r\f\v]`

import re

text = "Agent 007 arrived at 14:30"

re.findall(r'\d+', text)    # ['007', '14', '30']
re.findall(r'\w+', text)    # ['Agent', '007', 'arrived', 'at', '14', '30']
re.findall(r'\s+', text)    # [' ', ' ', ' ', ' ']

Word Boundary (`\b`)¶

\b matches the boundary between a word character and a non-word character (or start/end of string). It matches a position, not a character:

import re

text = "cat concatenate category"

re.findall(r'cat', text)     # ['cat', 'cat', 'cat'] — matches inside words
re.findall(r'\bcat\b', text) # ['cat'] — only the standalone word
re.findall(r'\bcat', text)   # ['cat', 'cat'] — words starting with 'cat'
re.findall(r'cat\b', text)   # ['cat', 'cat'] — words ending with 'cat'

Word Boundary and Raw Strings

Always use raw strings with \b. In a regular Python string, \b is the backspace character (ASCII 8), not a word boundary.

Building Patterns Incrementally¶

Complex patterns are best built step by step:

import re

# Goal: match a date like "2024-01-15"
# Step 1: match four digits
r'\d{4}'

# Step 2: add a hyphen and two digits
r'\d{4}-\d{2}'

# Step 3: complete the pattern
r'\d{4}-\d{2}-\d{2}'

text = "Date: 2024-01-15, updated: 2024-12-31"
re.findall(r'\d{4}-\d{2}-\d{2}', text)
# ['2024-01-15', '2024-12-31']

Summary¶

Concept	Key Takeaway
Literal characters	Match themselves exactly
Metacharacters	`. ^ $ * + ? { } [ ] \ \\| ( )` have special meaning
`.` (dot)	Matches any character except newline
`^` / `$`	Match start/end of string (or line with `re.M`)
`\\|`	Alternation — logical OR
`\` escape	Precede metacharacters with `\` to match literally
`\d \w \s`	Shorthand for digit, word, whitespace classes
`\b`	Word boundary (position, not character)