Character Classes¶

What Is a Character Class?¶

A character class (also called a character set) matches one character from a defined set. Character classes are enclosed in square brackets [...].

import re

# Match any vowel
re.findall(r'[aeiou]', 'Hello World')
# ['e', 'o', 'o']

# Match any digit
re.findall(r'[0123456789]', 'Room 404')
# ['4', '0', '4']

Ranges¶

Use a hyphen - inside brackets to specify a range of characters:

import re

text = "Agent 007 has clearance Level-A3"

# Digit range (equivalent to \d)
re.findall(r'[0-9]', text)
# ['0', '0', '7', '3']

# Lowercase letters
re.findall(r'[a-z]+', text)
# ['gent', 'has', 'clearance', 'evel']

# Uppercase letters
re.findall(r'[A-Z]+', text)
# ['A', 'L', 'A']

# Letters and digits combined
re.findall(r'[a-zA-Z0-9]+', text)
# ['Agent', '007', 'has', 'clearance', 'Level', 'A3']

Multiple ranges can be combined in a single class:

# Hexadecimal digits
re.findall(r'[0-9a-fA-F]+', '0xFF 0x1A 255 0xGG')
# ['0', 'FF', '0', '1A', '255', '0', 'GG' won't match fully]
# Actually:
re.findall(r'[0-9a-fA-F]+', '0xFF 0x1A 255 0xGG')
# ['0', 'FF', '0', '1A', '255', '0']

Negated Character Classes¶

A caret ^ at the beginning of a character class negates it — matching any character not in the set:

import re

# Match non-digits
re.findall(r'[^0-9]+', 'Room 404 is on Floor 4')
# ['Room ', ' is on Floor ']

# Match non-vowels
re.findall(r'[^aeiouAEIOU]+', 'Hello World')
# ['H', 'll', ' W', 'rld']

# Match non-whitespace (similar to \S)
re.findall(r'[^ \t\n]+', 'hello   world')
# ['hello', 'world']

Caret Position Matters

The ^ only negates when it appears as the first character inside [...]. Elsewhere, it matches a literal caret: [a^b] matches a, ^, or b.

Special Characters Inside Classes¶

Most metacharacters lose their special meaning inside character classes. Only a few remain special:

Character	Special inside `[...]`?	How to use literally
`]`	Yes — closes the class	`\]` or place first: `[]abc]`
`\`	Yes — escape character	`\\`
`^`	Yes — negation (only if first)	Place after first position: `[a^b]`
`-`	Yes — range operator	`\-` or place first/last: `[-abc]` or `[abc-]`

import re

# Match literal special characters
re.findall(r'[\[\]]', 'array[0] = list[1]')
# ['[', ']', '[', ']']

# Hyphen at end — matches literal hyphen
re.findall(r'[a-z-]+', 'well-known self-driving')
# ['well-known', 'self-driving']

# Dot inside class — just a literal dot
re.findall(r'[.]', 'version 3.14')
# ['.']

Shorthand Classes vs Bracket Notation¶

The shorthand classes \d, \w, \s and their negations can be used inside character classes:

import re

# Digits or hyphens (for phone numbers)
re.findall(r'[\d-]+', 'Call 555-123-4567 today')
# ['555-123-4567']

# Word characters or dots (for filenames)
re.findall(r'[\w.]+', 'file_v2.py and data.csv')
# ['file_v2.py', 'and', 'data.csv']

# Digits and whitespace
re.findall(r'[\d\s]+', 'score: 42 out of 50')
# [' 42 ', ' 50']

POSIX-like Classes (Unicode)¶

Python's \d, \w, and \s match Unicode characters by default. Use the re.ASCII flag to restrict to ASCII:

import re

# \d matches Unicode digits by default
re.findall(r'\d+', '123 ١٢٣ ୧୨୩')
# ['123', '١٢٣', '୧୨୩']

# Restrict to ASCII digits
re.findall(r'\d+', '123 ١٢٣ ୧୨୩', re.ASCII)
# ['123']

Practical Examples¶

Matching Identifiers¶

A valid Python identifier starts with a letter or underscore, followed by letters, digits, or underscores:

import re

text = "x = 42; _name = 'hello'; 3bad = True"
re.findall(r'[a-zA-Z_]\w*', text)
# ['x', '_name', 'hello', 'bad', 'True']

Extracting Vowels and Consonants¶

import re

word = "Mississippi"
vowels = re.findall(r'[aeiouAEIOU]', word)
consonants = re.findall(r'[^aeiouAEIOU]', word)

print(f"Vowels: {vowels}")       # ['i', 'i', 'i', 'i']
print(f"Consonants: {consonants}")  # ['M', 's', 's', 's', 's', 'p', 'p']

Matching Hex Color Codes¶

import re

css = "color: #FF5733; background: #0a0; border: #12ab"
# Full 6-digit or 3-digit hex codes
re.findall(r'#[0-9a-fA-F]{3,6}\b', css)
# ['#FF5733', '#0a0', '#12ab']

# Strictly 6-digit or 3-digit
re.findall(r'#(?:[0-9a-fA-F]{6}|[0-9a-fA-F]{3})\b', css)
# ['#FF5733', '#0a0']

Summary¶

Concept	Key Takeaway
`[abc]`	Matches one character: `a`, `b`, or `c`
`[a-z]`	Matches one character in the range `a` to `z`
`[^abc]`	Matches one character not in the set
`-` in class	Range operator; literal if first or last
`^` in class	Negation only if first character
Metacharacters	Most lose special meaning inside `[...]`
`\d \w \s`	Can be used inside character classes

Runnable Example: `character_classes_tutorial.py`¶

"""
Python Regular Expressions - Tutorial 02: Character Classes
===========================================================

LEARNING OBJECTIVES:
-------------------
1. Understand character classes and their syntax
2. Use predefined character classes (\d, \w, \s, etc.)
3. Create custom character classes with [...]
4. Use character ranges [a-z], [0-9]
5. Understand negated character classes [^...]
6. Combine character classes for complex matching

PREREQUISITES:
-------------
- Tutorial 01: Regex Basics
- Understanding of re.match(), re.search(), re.findall()

DIFFICULTY: BEGINNER
"""

import re

# ==============================================================================
# SECTION 1: INTRODUCTION TO CHARACTER CLASSES
# ==============================================================================

if __name__ == "__main__":

    """
    CHARACTER CLASSES allow you to match one character from a set of characters.
    Instead of matching exact text, you can match "any digit" or "any letter".

    Basic Syntax:
    - [abc]   : Matches 'a', 'b', OR 'c' (any single character from the set)
    - [^abc]  : Matches any character EXCEPT 'a', 'b', or 'c'
    - [a-z]   : Matches any lowercase letter from 'a' to 'z'
    - [0-9]   : Matches any digit from '0' to '9'

    This is much more powerful than literal matching!
    """

    print("="*70)
    print("SECTION 1: BASIC CHARACTER CLASSES")
    print("="*70)

    # Example 1: Simple character class
    # ---------------------------------
    # Match a single vowel
    pattern1 = r"[aeiou]"  # Matches any single vowel
    text1 = "hello world"

    vowels = re.findall(pattern1, text1)
    print(f"Text: '{text1}'")
    print(f"Pattern: '{pattern1}' (matches any vowel)")
    print(f"Matches found: {vowels}")
    print(f"Total vowels: {len(vowels)}")

    print()

    # Example 2: Matching specific characters
    # ---------------------------------------
    # Match only the letters 'c', 'a', 't'
    pattern2 = r"[cat]"
    text2 = "the cat sat on the mat"

    matches = re.findall(pattern2, text2)
    print(f"Text: '{text2}'")
    print(f"Pattern: '{pattern2}' (matches 'c', 'a', or 't')")
    print(f"Matches: {matches}")
    print(f"Count: {len(matches)}")

    print()

    # ==============================================================================
    # SECTION 2: CHARACTER RANGES
    # ==============================================================================

    """
    RANGES allow you to specify a sequence of characters without listing them all.

    Common ranges:
    - [a-z]   : All lowercase letters
    - [A-Z]   : All uppercase letters
    - [0-9]   : All digits
    - [a-zA-Z]: All letters (upper and lower)
    - [a-z0-9]: All lowercase letters and digits
    """

    print("="*70)
    print("SECTION 2: CHARACTER RANGES")
    print("="*70)

    # Example 3: Matching lowercase letters
    # -------------------------------------
    pattern3 = r"[a-z]"
    text3 = "Hello World 123"

    lowercase = re.findall(pattern3, text3)
    print(f"Text: '{text3}'")
    print(f"Pattern: '{pattern3}' (any lowercase letter)")
    print(f"Lowercase letters found: {lowercase}")

    print()

    # Example 4: Matching digits
    # --------------------------
    pattern4 = r"[0-9]"
    text4 = "Room 101, Floor 5, Building A"

    digits = re.findall(pattern4, text4)
    print(f"Text: '{text4}'")
    print(f"Pattern: '{pattern4}' (any digit)")
    print(f"Digits found: {digits}")

    print()

    # Example 5: Combining ranges
    # ---------------------------
    # Match any alphanumeric character (letter or digit)
    pattern5 = r"[a-zA-Z0-9]"
    text5 = "User123! #Password456"

    alphanum = re.findall(pattern5, text5)
    print(f"Text: '{text5}'")
    print(f"Pattern: '{pattern5}' (any letter or digit)")
    print(f"Alphanumeric characters: {alphanum}")
    print(f"Total: {len(alphanum)}")

    print()

    # ==============================================================================
    # SECTION 3: PREDEFINED CHARACTER CLASSES
    # ==============================================================================

    """
    Python regex provides SHORTHAND notations for common character classes:

    \d  : Digit [0-9]
    \D  : Non-digit [^0-9]
    \w  : Word character [a-zA-Z0-9_] (letters, digits, underscore)
    \W  : Non-word character [^a-zA-Z0-9_]
    \s  : Whitespace [ \t\n\r\f\v] (space, tab, newline, etc.)
    \S  : Non-whitespace [^ \t\n\r\f\v]

    These are very commonly used and make patterns more readable.
    """

    print("="*70)
    print("SECTION 3: PREDEFINED CHARACTER CLASSES")
    print("="*70)

    # Example 6: Using \d for digits
    # ------------------------------
    pattern6 = r"\d"  # Equivalent to [0-9]
    text6 = "I have 3 cats and 2 dogs"

    digits = re.findall(pattern6, text6)
    print(f"Text: '{text6}'")
    print(f"Pattern: '\\d' (any digit)")
    print(f"Digits: {digits}")

    print()

    # Example 7: Using \w for word characters
    # ---------------------------------------
    pattern7 = r"\w"  # Matches letters, digits, and underscore
    text7 = "hello_world123!@#"

    word_chars = re.findall(pattern7, text7)
    print(f"Text: '{text7}'")
    print(f"Pattern: '\\w' (word characters)")
    print(f"Word characters: {word_chars}")

    print()

    # Example 8: Using \s for whitespace
    # ----------------------------------
    pattern8 = r"\s"  # Matches spaces, tabs, newlines
    text8 = "hello\tworld\ntest"

    spaces = re.findall(pattern8, text8)
    print(f"Text: 'hello\\tworld\\ntest'")
    print(f"Pattern: '\\s' (whitespace)")
    print(f"Whitespace characters found: {len(spaces)}")
    print(f"Types: {repr(spaces)}")  # repr() shows special characters

    print()

    # Example 9: Using \D, \W, \S (negated versions)
    # ----------------------------------------------
    text9 = "hello123"

    # \D matches anything that's NOT a digit
    non_digits = re.findall(r"\D", text9)
    print(f"Text: '{text9}'")
    print(f"Pattern: '\\D' (non-digits)")
    print(f"Non-digit characters: {non_digits}")

    # \W matches anything that's NOT a word character
    text10 = "hello-world!"
    non_word = re.findall(r"\W", text10)
    print(f"\nText: '{text10}'")
    print(f"Pattern: '\\W' (non-word chars)")
    print(f"Non-word characters: {non_word}")

    # \S matches anything that's NOT whitespace
    text11 = "a b c"
    non_space = re.findall(r"\S", text11)
    print(f"\nText: '{text11}'")
    print(f"Pattern: '\\S' (non-whitespace)")
    print(f"Non-whitespace characters: {non_space}")

    print()

    # ==============================================================================
    # SECTION 4: NEGATED CHARACTER CLASSES
    # ==============================================================================

    """
    NEGATED CHARACTER CLASSES match any character EXCEPT those in the class.
    Syntax: [^characters]

    The ^ symbol at the START of a character class means "not".
    Note: This is different from ^ as an anchor (which we'll learn later).
    """

    print("="*70)
    print("SECTION 4: NEGATED CHARACTER CLASSES")
    print("="*70)

    # Example 10: Matching non-vowels
    # -------------------------------
    pattern10 = r"[^aeiou]"  # Matches anything that's NOT a vowel
    text10 = "hello"

    non_vowels = re.findall(pattern10, text10)
    print(f"Text: '{text10}'")
    print(f"Pattern: '[^aeiou]' (not a vowel)")
    print(f"Non-vowels: {non_vowels}")

    print()

    # Example 11: Matching non-digits
    # -------------------------------
    pattern11 = r"[^0-9]"  # Same as \D
    text11 = "Room 404"

    non_digits_custom = re.findall(pattern11, text11)
    print(f"Text: '{text11}'")
    print(f"Pattern: '[^0-9]' (not a digit)")
    print(f"Non-digits: {non_digits_custom}")

    print()

    # Example 12: Excluding specific characters
    # -----------------------------------------
    # Match any character except spaces and punctuation
    pattern12 = r"[^., ]"  # Not period, comma, or space
    text12 = "Hello, World. Test."

    chars = re.findall(pattern12, text12)
    print(f"Text: '{text12}'")
    print(f"Pattern: '[^., ]' (not period, comma, or space)")
    print(f"Characters: {chars}")

    print()

    # ==============================================================================
    # SECTION 5: THE DOT (.) METACHARACTER
    # ==============================================================================

    """
    The DOT (.) is a special metacharacter that matches ANY character except newline.
    It's like a wildcard - use it carefully!

    . : Matches any single character (except \n by default)
    """

    print("="*70)
    print("SECTION 5: THE DOT METACHARACTER")
    print("="*70)

    # Example 13: Using dot to match any character
    # --------------------------------------------
    pattern13 = r"c.t"  # 'c', followed by ANY character, followed by 't'
    text13 = "cat cut cot c t c9t"

    matches = re.findall(pattern13, text13)
    print(f"Text: '{text13}'")
    print(f"Pattern: 'c.t' ('c' + any char + 't')")
    print(f"Matches: {matches}")

    print()

    # Example 14: Dot doesn't match newline by default
    # ------------------------------------------------
    text14 = "hello\nworld"
    pattern14 = r"hello.world"

    match = re.search(pattern14, text14)
    print(f"Text: 'hello\\nworld'")
    print(f"Pattern: 'hello.world'")
    print(f"Match found: {match is not None}")
    print("(The dot doesn't match the newline by default)")

    print()

    # Example 15: Matching a literal dot
    # ----------------------------------
    # To match an actual period, escape it with backslash
    pattern15 = r"\."  # Matches a literal period
    text15 = "3.14 is pi"

    periods = re.findall(pattern15, text15)
    print(f"Text: '{text15}'")
    print(f"Pattern: '\\.' (literal period)")
    print(f"Periods found: {periods}")

    print()

    # ==============================================================================
    # SECTION 6: PRACTICAL EXAMPLES
    # ==============================================================================

    print("="*70)
    print("SECTION 6: PRACTICAL EXAMPLES")
    print("="*70)

    # Example 16: Extracting all words from text
    # ------------------------------------------
    text16 = "Hello, World! This is a test-123."

    # Match sequences of word characters
    words = re.findall(r"\w+", text16)  # \w+ means one or more word characters
    print(f"Text: '{text16}'")
    print(f"Words extracted: {words}")

    print()

    # Example 17: Finding phone number digits
    # ---------------------------------------
    text17 = "Call me at 555-1234 or 555-5678"

    # Extract all digit sequences
    numbers = re.findall(r"\d+", text17)  # \d+ means one or more digits
    print(f"Text: '{text17}'")
    print(f"Number sequences: {numbers}")

    print()

    # Example 18: Identifying non-alphanumeric characters
    # ---------------------------------------------------
    text18 = "email@domain.com"

    # Find all characters that are not letters or digits
    special_chars = re.findall(r"[^a-zA-Z0-9]", text18)
    print(f"Text: '{text18}'")
    print(f"Special characters: {special_chars}")

    print()

    # Example 19: Matching hex digits
    # -------------------------------
    # Hex digits are 0-9 and A-F (or a-f)
    pattern19 = r"[0-9A-Fa-f]"
    text19 = "Color: #FF5733, #00AAFF"

    hex_digits = re.findall(pattern19, text19)
    print(f"Text: '{text19}'")
    print(f"Hex digits: {hex_digits}")
    print(f"Total: {len(hex_digits)}")

    print()

    # Example 20: Validating single character input
    # ---------------------------------------------
    def validate_grade(grade):
        """
        Check if input is a valid letter grade (A, B, C, D, F).
        """
        # ^[ABCDF]$ would check if ENTIRE string is one of these letters
        # But we'll use match for simplicity here
        pattern = r"[ABCDF]"
        match = re.match(pattern, grade)
        return match is not None and len(grade) == 1

    # Test the function
    test_grades = ["A", "B", "C", "D", "F", "E", "Z", "AB"]
    print("Grade validation:")
    for grade in test_grades:
        result = "Valid" if validate_grade(grade) else "Invalid"
        print(f"  '{grade}': {result}")

    print()

    # ==============================================================================
    # SECTION 7: COMBINING CHARACTER CLASSES
    # ==============================================================================

    print("="*70)
    print("SECTION 7: COMBINING CHARACTER CLASSES")
    print("="*70)

    # Example 21: Complex character class
    # -----------------------------------
    # Match letters, digits, and specific symbols
    pattern21 = r"[a-zA-Z0-9_\-.]"  # Letters, digits, underscore, hyphen, period
    text21 = "user_name-123@domain.com"

    valid_chars = re.findall(pattern21, text21)
    print(f"Text: '{text21}'")
    print(f"Pattern: '[a-zA-Z0-9_\\-.]'")
    print(f"Valid characters: {valid_chars}")

    print()

    # Example 22: Using multiple character classes in one pattern
    # ----------------------------------------------------------
    # Match: digit, followed by any letter, followed by digit
    pattern22 = r"\d[a-zA-Z]\d"
    text22 = "Room 3A5, 4B7, and 2Z9"

    matches = re.findall(pattern22, text22)
    print(f"Text: '{text22}'")
    print(f"Pattern: '\\d[a-zA-Z]\\d' (digit-letter-digit)")
    print(f"Matches: {matches}")

    print()

    # ==============================================================================
    # SECTION 8: COMMON MISTAKES TO AVOID
    # ==============================================================================

    print("="*70)
    print("SECTION 8: COMMON MISTAKES TO AVOID")
    print("="*70)

    # Mistake 1: Forgetting to escape special characters in character class
    # --------------------------------------------------------------------
    print("Mistake 1: Special characters in character classes")

    # If you want to match a literal hyphen in a character class,
    # put it at the start or end, or escape it
    pattern_wrong = r"[a-z-0-9]"  # This is interpreted as range 'a' to 'z-0' to '9'
    pattern_right = r"[a-z0-9\-]"  # Escaped hyphen
    pattern_right2 = r"[-a-z0-9]"  # Hyphen at start

    text = "hello-world123"
    print(f"Text: '{text}'")
    print(f"Pattern '[a-z0-9\\-]': {re.findall(pattern_right, text)}")

    print()

    # Mistake 2: Confusing [^...] with \^
    # -----------------------------------
    print("Mistake 2: Understanding negation")
    print("  [^abc] means: match anything EXCEPT a, b, or c")
    print("  \\^ means: match a literal ^ character")

    text = "^hello"
    pattern_neg = r"[^h]"  # Matches anything except 'h'
    pattern_literal = r"\^"  # Matches literal ^

    print(f"Text: '{text}'")
    print(f"[^h]: {re.findall(pattern_neg, text)}")
    print(f"\\^: {re.findall(pattern_literal, text)}")

    print()

    # Mistake 3: Thinking . matches newline
    # -------------------------------------
    print("Mistake 3: The dot doesn't match newline by default")
    text_multiline = "line1\nline2"
    print(f"Text: 'line1\\nline2'")
    print(f"Pattern '.+': {re.findall(r'.+', text_multiline)}")
    print("(Each line matched separately, newline not included)")

    print()

    # ==============================================================================
    # SECTION 9: SUMMARY AND CHEAT SHEET
    # ==============================================================================

    print("="*70)
    print("SUMMARY: CHARACTER CLASS CHEAT SHEET")
    print("="*70)

    cheat_sheet = """
    BASIC CHARACTER CLASSES:
      [abc]         Match 'a', 'b', or 'c'
      [a-z]         Match any lowercase letter
      [A-Z]         Match any uppercase letter
      [0-9]         Match any digit
      [a-zA-Z]      Match any letter
      [a-zA-Z0-9]   Match any letter or digit

    NEGATED CLASSES:
      [^abc]        Match anything EXCEPT 'a', 'b', or 'c'
      [^0-9]        Match anything that's not a digit

    PREDEFINED CLASSES:
      \\d           Digit [0-9]
      \\D           Non-digit [^0-9]
      \\w           Word character [a-zA-Z0-9_]
      \\W           Non-word character
      \\s           Whitespace (space, tab, newline, etc.)
      \\S           Non-whitespace
      .            Any character except newline

    SPECIAL NOTES:
      - Always use raw strings: r"pattern"
      - Escape special chars in classes: [\\-\\.]
      - Put hyphen at start or end to match literally: [-abc] or [abc-]
      - ^ at START of class means negation: [^abc]
      - ^ elsewhere is literal: [a^bc]
    """

    print(cheat_sheet)

    # ==============================================================================
    # PRACTICE EXERCISES
    # ==============================================================================

    print("="*70)
    print("PRACTICE CHALLENGES")
    print("="*70)

    """
    Try these exercises:

    1. Write a pattern to match all vowels (both upper and lowercase)
    2. Extract all punctuation marks from a string
    3. Find all hexadecimal numbers (0-9, A-F, a-f)
    4. Match all characters that are NOT spaces or punctuation
    5. Create a pattern to match DNA sequences (only A, T, G, C)
    6. Extract all word characters followed by a digit
    7. Match all characters except vowels

    Solutions in exercises_01_basics.py
    """

    # ==============================================================================
    # END OF TUTORIAL 02
    # ==============================================================================

    print("\n" + "="*70)
    print("END OF TUTORIAL - Character Classes mastered!")
    print("Next: Tutorial 03 - Quantifiers")
    print("="*70)