Pattern Syntax Basics¶

Mental Model

A regex pattern is read left to right, one token at a time. Most characters match themselves literally; special characters (. * + ? ^ $ | \ [ ] ( )) have meta-meaning. If you need a special character to match literally, escape it with a backslash. Master these few meta-characters and you can read any regex.

Literal Characters¶

The simplest regex patterns are literal characters that match themselves exactly:

```python import re

re.search(r'cat', 'The cat sat on the mat')

¶

re.findall(r'at', 'The cat sat on the mat')

['at', 'at', 'at']¶

```

Most characters match themselves, but certain characters have special meaning in regex and must be escaped with a backslash to match literally.

Metacharacters¶

These characters have special meaning in regex:

. ^ $ * + ? { } [ ] \ | ( )

Metacharacter	Meaning
`.`	Match any character except newline
`^`	Match the start of the string
`$`	Match the end of the string
`*`	Zero or more repetitions
`+`	One or more repetitions
`?`	Zero or one repetition
`{m,n}`	Between m and n repetitions
`[]`	Character class (set of characters)
`\`	Escape character
`\\|`	Alternation (OR)
`()`	Grouping and capturing

The Dot (`.`)¶

The dot matches any single character except a newline:

```python import re

re.findall(r'c.t', 'cat cot cut c\nt c9t')

['cat', 'cot', 'cut', 'c9t']¶

```

With re.DOTALL, the dot also matches newlines:

```python re.findall(r'c.t', 'cat c\nt', re.DOTALL)

['cat', 'c\nt']¶

```

Anchors (`^` and `$`)¶

Anchors match positions, not characters:

```python import re

text = "hello world"

re.search(r'^hello', text) # Matches — 'hello' is at start re.search(r'^world', text) # None — 'world' is not at start re.search(r'world$', text) # Matches — 'world' is at end re.search(r'hello$', text) # None — 'hello' is not at end ```

With re.MULTILINE, ^ and $ match at the start/end of each line:

```python text = "first line\nsecond line\nthird line"

re.findall(r'^\w+', text) # ['first'] re.findall(r'^\w+', text, re.M) # ['first', 'second', 'third'] re.findall(r'\w+$', text, re.M) # ['line', 'line', 'line'] ```

Alternation (`|`)¶

The pipe acts as a logical OR:

```python import re

re.findall(r'cat|dog', 'I have a cat and a dog')

['cat', 'dog']¶

Alternation applies to the whole expression on each side¶

re.findall(r'gray|grey', 'gray and grey')

['gray', 'grey']¶

Use parentheses to limit alternation scope¶

re.findall(r'gr(a|e)y', 'gray and grey')

['a', 'e'] — returns captured groups!¶

Non-capturing group to get full match¶

re.findall(r'gr(?:a|e)y', 'gray and grey')

['gray', 'grey']¶

```

Escaping Metacharacters¶

To match a metacharacter literally, precede it with a backslash:

```python import re

Match a literal dot¶

re.findall(r'.', 'version 3.14.1')

['.', '.']¶

Match a literal dollar sign¶

re.search(r'$\d+', 'Price: $25')

¶

Match literal parentheses¶

re.search(r'$.*?$', 'func(arg)')

¶

```

You can also use re.escape() to escape all metacharacters in a string:

```python special = "price is $5.00 (USD)" escaped = re.escape(special) print(escaped)

price\ is\ $5.00\ $USD$¶

```

This is useful when building patterns from user input.

Shorthand Character Classes¶

These shortcuts match common character categories:

Shorthand	Meaning	Equivalent Class
`\d`	Any digit	`[0-9]`
`\D`	Any non-digit	`[^0-9]`
`\w`	Any word character	`[a-zA-Z0-9_]`
`\W`	Any non-word character	`[^a-zA-Z0-9_]`
`\s`	Any whitespace	`[ \t\n\r\f\v]`
`\S`	Any non-whitespace	`[^ \t\n\r\f\v]`

```python import re

text = "Agent 007 arrived at 14:30"

re.findall(r'\d+', text) # ['007', '14', '30'] re.findall(r'\w+', text) # ['Agent', '007', 'arrived', 'at', '14', '30'] re.findall(r'\s+', text) # [' ', ' ', ' ', ' '] ```

Word Boundary (`\b`)¶

\b matches the boundary between a word character and a non-word character (or start/end of string). It matches a position, not a character:

```python import re

text = "cat concatenate category"

re.findall(r'cat', text) # ['cat', 'cat', 'cat'] — matches inside words re.findall(r'\bcat\b', text) # ['cat'] — only the standalone word re.findall(r'\bcat', text) # ['cat', 'cat'] — words starting with 'cat' re.findall(r'cat\b', text) # ['cat', 'cat'] — words ending with 'cat' ```

Word Boundary and Raw Strings

Always use raw strings with \b. In a regular Python string, \b is the backspace character (ASCII 8), not a word boundary.

Building Patterns Incrementally¶

Complex patterns are best built step by step:

```python import re

Goal: match a date like "2024-01-15"¶

Step 1: match four digits¶

r'\d{4}'

Step 2: add a hyphen and two digits¶

r'\d{4}-\d{2}'

Step 3: complete the pattern¶

r'\d{4}-\d{2}-\d{2}'

text = "Date: 2024-01-15, updated: 2024-12-31" re.findall(r'\d{4}-\d{2}-\d{2}', text)

['2024-01-15', '2024-12-31']¶

```

Summary¶

Concept	Key Takeaway
Literal characters	Match themselves exactly
Metacharacters	`. ^ $ * + ? { } [ ] \ \\| ( )` have special meaning
`.` (dot)	Matches any character except newline
`^` / `$`	Match start/end of string (or line with `re.M`)
`\\|`	Alternation — logical OR
`\` escape	Precede metacharacters with `\` to match literally
`\d \w \s`	Shorthand for digit, word, whitespace classes
`\b`	Word boundary (position, not character)

Exercises¶

Exercise 1. Write a regex pattern that matches a simple IP address format: four groups of 1-3 digits separated by dots (e.g., 192.168.1.1). Use the dot metacharacter properly (escaped). Test against "192.168.1.1", "256.1.1.1", and "10.0.0.1".

Solution to Exercise 1

```python import re

pattern = r'^\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}$'

tests = ["192.168.1.1", "256.1.1.1", "10.0.0.1", "1.2.3"] for t in tests: match = re.fullmatch(pattern, t) print(f"'{t}': {'Match' if match else 'No match'}")

Note: this matches format only, not valid ranges (0-255)¶

```

Exercise 2. Write a regex using alternation (|) that matches either "cat", "dog", or "bird" as whole words only. Test it against "I have a cat and a category" -- it should match "cat" but not the "cat" in "category".

Solution to Exercise 2

```python import re

pattern = r'\b(?:cat|dog|bird)\b' text = "I have a cat and a category and a dog" matches = re.findall(pattern, text) print(matches) # ['cat', 'dog'] ```

Exercise 3. Write a regex that matches a simple Python variable name: starts with a letter or underscore, followed by zero or more letters, digits, or underscores. Test against "my_var", "_private", "2invalid", "class", and "hello_world_123".

Solution to Exercise 3

```python import re

pattern = r'^[a-zA-Z_]\w*$'

tests = ["my_var", "_private", "2invalid", "class", "hello_world_123"] for t in tests: match = re.fullmatch(pattern, t) print(f"'{t}': {'Valid name' if match else 'Invalid'}")

'my_var': Valid name¶

'_private': Valid name¶

'2invalid': Invalid¶

'class': Valid name (syntactically valid, even if keyword)¶

'hello_world_123': Valid name¶

```

Pattern Syntax Basics¶

Literal Characters¶

¶

['at', 'at', 'at']¶

Metacharacters¶

The Dot (.)¶

['cat', 'cot', 'cut', 'c9t']¶

['cat', 'c\nt']¶

Anchors (^ and $)¶

Alternation (|)¶

['cat', 'dog']¶

Alternation applies to the whole expression on each side¶

['gray', 'grey']¶

Use parentheses to limit alternation scope¶

['a', 'e'] — returns captured groups!¶

Non-capturing group to get full match¶

['gray', 'grey']¶

Escaping Metacharacters¶

Match a literal dot¶

['.', '.']¶

Match a literal dollar sign¶

¶

Match literal parentheses¶

¶

price\ is\ $5.00\ \(USD\)¶

Shorthand Character Classes¶

Word Boundary (\b)¶

Building Patterns Incrementally¶

Goal: match a date like "2024-01-15"¶

Step 1: match four digits¶

Step 2: add a hyphen and two digits¶

Step 3: complete the pattern¶

['2024-01-15', '2024-12-31']¶

Summary¶

Exercises¶

Note: this matches format only, not valid ranges (0-255)¶

'my_var': Valid name¶

'_private': Valid name¶

'2invalid': Invalid¶

'class': Valid name (syntactically valid, even if keyword)¶

'hello_world_123': Valid name¶

The Dot (`.`)¶

Anchors (`^` and `$`)¶

Alternation (`|`)¶

Word Boundary (`\b`)¶