Skip to content

Pattern Syntax Basics

Literal Characters

The simplest regex patterns are literal characters that match themselves exactly:

import re

re.search(r'cat', 'The cat sat on the mat')
# <re.Match object; span=(4, 7), match='cat'>

re.findall(r'at', 'The cat sat on the mat')
# ['at', 'at', 'at']

Most characters match themselves, but certain characters have special meaning in regex and must be escaped with a backslash to match literally.

Metacharacters

These characters have special meaning in regex:

.  ^  $  *  +  ?  {  }  [  ]  \  |  (  )
Metacharacter Meaning
. Match any character except newline
^ Match the start of the string
$ Match the end of the string
* Zero or more repetitions
+ One or more repetitions
? Zero or one repetition
{m,n} Between m and n repetitions
[] Character class (set of characters)
\ Escape character
\| Alternation (OR)
() Grouping and capturing

The Dot (.)

The dot matches any single character except a newline:

import re

re.findall(r'c.t', 'cat cot cut c\nt c9t')
# ['cat', 'cot', 'cut', 'c9t']

With re.DOTALL, the dot also matches newlines:

re.findall(r'c.t', 'cat c\nt', re.DOTALL)
# ['cat', 'c\nt']

Anchors (^ and $)

Anchors match positions, not characters:

import re

text = "hello world"

re.search(r'^hello', text)   # Matches — 'hello' is at start
re.search(r'^world', text)   # None — 'world' is not at start
re.search(r'world$', text)   # Matches — 'world' is at end
re.search(r'hello$', text)   # None — 'hello' is not at end

With re.MULTILINE, ^ and $ match at the start/end of each line:

text = "first line\nsecond line\nthird line"

re.findall(r'^\w+', text)              # ['first']
re.findall(r'^\w+', text, re.M)        # ['first', 'second', 'third']
re.findall(r'\w+$', text, re.M)        # ['line', 'line', 'line']

Alternation (|)

The pipe acts as a logical OR:

import re

re.findall(r'cat|dog', 'I have a cat and a dog')
# ['cat', 'dog']

# Alternation applies to the whole expression on each side
re.findall(r'gray|grey', 'gray and grey')
# ['gray', 'grey']

# Use parentheses to limit alternation scope
re.findall(r'gr(a|e)y', 'gray and grey')
# ['a', 'e']  — returns captured groups!

# Non-capturing group to get full match
re.findall(r'gr(?:a|e)y', 'gray and grey')
# ['gray', 'grey']

Escaping Metacharacters

To match a metacharacter literally, precede it with a backslash:

import re

# Match a literal dot
re.findall(r'\.', 'version 3.14.1')
# ['.', '.']

# Match a literal dollar sign
re.search(r'\$\d+', 'Price: \$25')
# <re.Match object; span=(7, 10), match='\$25'>

# Match literal parentheses
re.search(r'\(.*?\)', 'func(arg)')
# <re.Match object; span=(4, 9), match='(arg)'>

You can also use re.escape() to escape all metacharacters in a string:

special = "price is \$5.00 (USD)"
escaped = re.escape(special)
print(escaped)
# price\ is\ \$5\.00\ \(USD\)

This is useful when building patterns from user input.

Shorthand Character Classes

These shortcuts match common character categories:

Shorthand Meaning Equivalent Class
\d Any digit [0-9]
\D Any non-digit [^0-9]
\w Any word character [a-zA-Z0-9_]
\W Any non-word character [^a-zA-Z0-9_]
\s Any whitespace [ \t\n\r\f\v]
\S Any non-whitespace [^ \t\n\r\f\v]
import re

text = "Agent 007 arrived at 14:30"

re.findall(r'\d+', text)    # ['007', '14', '30']
re.findall(r'\w+', text)    # ['Agent', '007', 'arrived', 'at', '14', '30']
re.findall(r'\s+', text)    # [' ', ' ', ' ', ' ']

Word Boundary (\b)

\b matches the boundary between a word character and a non-word character (or start/end of string). It matches a position, not a character:

import re

text = "cat concatenate category"

re.findall(r'cat', text)     # ['cat', 'cat', 'cat'] — matches inside words
re.findall(r'\bcat\b', text) # ['cat'] — only the standalone word
re.findall(r'\bcat', text)   # ['cat', 'cat'] — words starting with 'cat'
re.findall(r'cat\b', text)   # ['cat', 'cat'] — words ending with 'cat'

Word Boundary and Raw Strings

Always use raw strings with \b. In a regular Python string, \b is the backspace character (ASCII 8), not a word boundary.

Building Patterns Incrementally

Complex patterns are best built step by step:

import re

# Goal: match a date like "2024-01-15"
# Step 1: match four digits
r'\d{4}'

# Step 2: add a hyphen and two digits
r'\d{4}-\d{2}'

# Step 3: complete the pattern
r'\d{4}-\d{2}-\d{2}'

text = "Date: 2024-01-15, updated: 2024-12-31"
re.findall(r'\d{4}-\d{2}-\d{2}', text)
# ['2024-01-15', '2024-12-31']

Summary

Concept Key Takeaway
Literal characters Match themselves exactly
Metacharacters . ^ $ * + ? { } [ ] \ \| ( ) have special meaning
. (dot) Matches any character except newline
^ / $ Match start/end of string (or line with re.M)
\| Alternation — logical OR
\ escape Precede metacharacters with \ to match literally
\d \w \s Shorthand for digit, word, whitespace classes
\b Word boundary (position, not character)