Pattern Syntax Basics¶
Mental Model
A regex pattern is read left to right, one token at a time. Most characters match themselves literally; special characters (. * + ? ^ $ | \ [ ] ( )) have meta-meaning. If you need a special character to match literally, escape it with a backslash. Master these few meta-characters and you can read any regex.
Literal Characters¶
The simplest regex patterns are literal characters that match themselves exactly:
```python import re
re.search(r'cat', 'The cat sat on the mat')
¶
re.findall(r'at', 'The cat sat on the mat')
['at', 'at', 'at']¶
```
Most characters match themselves, but certain characters have special meaning in regex and must be escaped with a backslash to match literally.
Metacharacters¶
These characters have special meaning in regex:
. ^ $ * + ? { } [ ] \ | ( )
| Metacharacter | Meaning |
|---|---|
. |
Match any character except newline |
^ |
Match the start of the string |
$ |
Match the end of the string |
* |
Zero or more repetitions |
+ |
One or more repetitions |
? |
Zero or one repetition |
{m,n} |
Between m and n repetitions |
[] |
Character class (set of characters) |
\ |
Escape character |
\| |
Alternation (OR) |
() |
Grouping and capturing |
The Dot (.)¶
The dot matches any single character except a newline:
```python import re
re.findall(r'c.t', 'cat cot cut c\nt c9t')
['cat', 'cot', 'cut', 'c9t']¶
```
With re.DOTALL, the dot also matches newlines:
```python re.findall(r'c.t', 'cat c\nt', re.DOTALL)
['cat', 'c\nt']¶
```
Anchors (^ and $)¶
Anchors match positions, not characters:
```python import re
text = "hello world"
re.search(r'^hello', text) # Matches — 'hello' is at start re.search(r'^world', text) # None — 'world' is not at start re.search(r'world\(', text) # Matches — 'world' is at end re.search(r'hello\)', text) # None — 'hello' is not at end ```
With re.MULTILINE, ^ and $ match at the start/end of each line:
```python text = "first line\nsecond line\nthird line"
re.findall(r'^\w+', text) # ['first'] re.findall(r'^\w+', text, re.M) # ['first', 'second', 'third'] re.findall(r'\w+$', text, re.M) # ['line', 'line', 'line'] ```
Alternation (|)¶
The pipe acts as a logical OR:
```python import re
re.findall(r'cat|dog', 'I have a cat and a dog')
['cat', 'dog']¶
Alternation applies to the whole expression on each side¶
re.findall(r'gray|grey', 'gray and grey')
['gray', 'grey']¶
Use parentheses to limit alternation scope¶
re.findall(r'gr(a|e)y', 'gray and grey')
['a', 'e'] — returns captured groups!¶
Non-capturing group to get full match¶
re.findall(r'gr(?:a|e)y', 'gray and grey')
['gray', 'grey']¶
```
Escaping Metacharacters¶
To match a metacharacter literally, precede it with a backslash:
```python import re
Match a literal dot¶
re.findall(r'.', 'version 3.14.1')
['.', '.']¶
Match a literal dollar sign¶
re.search(r'$\d+', 'Price: $25')
¶
Match literal parentheses¶
re.search(r'\(.*?\)', 'func(arg)')
¶
```
You can also use re.escape() to escape all metacharacters in a string:
```python special = "price is $5.00 (USD)" escaped = re.escape(special) print(escaped)
price\ is\ $5.00\ \(USD\)¶
```
This is useful when building patterns from user input.
Shorthand Character Classes¶
These shortcuts match common character categories:
| Shorthand | Meaning | Equivalent Class |
|---|---|---|
\d |
Any digit | [0-9] |
\D |
Any non-digit | [^0-9] |
\w |
Any word character | [a-zA-Z0-9_] |
\W |
Any non-word character | [^a-zA-Z0-9_] |
\s |
Any whitespace | [ \t\n\r\f\v] |
\S |
Any non-whitespace | [^ \t\n\r\f\v] |
```python import re
text = "Agent 007 arrived at 14:30"
re.findall(r'\d+', text) # ['007', '14', '30'] re.findall(r'\w+', text) # ['Agent', '007', 'arrived', 'at', '14', '30'] re.findall(r'\s+', text) # [' ', ' ', ' ', ' '] ```
Word Boundary (\b)¶
\b matches the boundary between a word character and a non-word character (or start/end of string). It matches a position, not a character:
```python import re
text = "cat concatenate category"
re.findall(r'cat', text) # ['cat', 'cat', 'cat'] — matches inside words re.findall(r'\bcat\b', text) # ['cat'] — only the standalone word re.findall(r'\bcat', text) # ['cat', 'cat'] — words starting with 'cat' re.findall(r'cat\b', text) # ['cat', 'cat'] — words ending with 'cat' ```
Word Boundary and Raw Strings
Always use raw strings with \b. In a regular Python string, \b is the backspace character (ASCII 8), not a word boundary.
Building Patterns Incrementally¶
Complex patterns are best built step by step:
```python import re
Goal: match a date like "2024-01-15"¶
Step 1: match four digits¶
r'\d{4}'
Step 2: add a hyphen and two digits¶
r'\d{4}-\d{2}'
Step 3: complete the pattern¶
r'\d{4}-\d{2}-\d{2}'
text = "Date: 2024-01-15, updated: 2024-12-31" re.findall(r'\d{4}-\d{2}-\d{2}', text)
['2024-01-15', '2024-12-31']¶
```
Summary¶
| Concept | Key Takeaway |
|---|---|
| Literal characters | Match themselves exactly |
| Metacharacters | . ^ $ * + ? { } [ ] \ \| ( ) have special meaning |
. (dot) |
Matches any character except newline |
^ / $ |
Match start/end of string (or line with re.M) |
\| |
Alternation — logical OR |
\ escape |
Precede metacharacters with \ to match literally |
\d \w \s |
Shorthand for digit, word, whitespace classes |
\b |
Word boundary (position, not character) |
Exercises¶
Exercise 1.
Write a regex pattern that matches a simple IP address format: four groups of 1-3 digits separated by dots (e.g., 192.168.1.1). Use the dot metacharacter properly (escaped). Test against "192.168.1.1", "256.1.1.1", and "10.0.0.1".
Solution to Exercise 1
```python import re
pattern = r'^\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}$'
tests = ["192.168.1.1", "256.1.1.1", "10.0.0.1", "1.2.3"] for t in tests: match = re.fullmatch(pattern, t) print(f"'{t}': {'Match' if match else 'No match'}")
Note: this matches format only, not valid ranges (0-255)¶
```
Exercise 2.
Write a regex using alternation (|) that matches either "cat", "dog", or "bird" as whole words only. Test it against "I have a cat and a category" -- it should match "cat" but not the "cat" in "category".
Solution to Exercise 2
```python import re
pattern = r'\b(?:cat|dog|bird)\b' text = "I have a cat and a category and a dog" matches = re.findall(pattern, text) print(matches) # ['cat', 'dog'] ```
Exercise 3.
Write a regex that matches a simple Python variable name: starts with a letter or underscore, followed by zero or more letters, digits, or underscores. Test against "my_var", "_private", "2invalid", "class", and "hello_world_123".
Solution to Exercise 3
```python import re
pattern = r'^[a-zA-Z_]\w*$'
tests = ["my_var", "_private", "2invalid", "class", "hello_world_123"] for t in tests: match = re.fullmatch(pattern, t) print(f"'{t}': {'Valid name' if match else 'Invalid'}")
'my_var': Valid name¶
'_private': Valid name¶
'2invalid': Invalid¶
'class': Valid name (syntactically valid, even if keyword)¶
'hello_world_123': Valid name¶
```