Lookahead and Lookbehind¶

What Are Lookarounds?¶

Lookarounds are zero-width assertions — they check whether a pattern exists before or after the current position without consuming any characters. The matched text is not included in the result.

Syntax	Name	Meaning
`(?=...)`	Positive lookahead	Followed by `...`
`(?!...)`	Negative lookahead	NOT followed by `...`
`(?<=...)`	Positive lookbehind	Preceded by `...`
`(?<!...)`	Negative lookbehind	NOT preceded by `...`

         Lookbehind          Lookahead
         (?<=...) (?<!...)   (?=...) (?!...)
                    ↓            ↓
    ... [before] [current pos] [after] ...

Positive Lookahead `(?=...)`¶

Matches a position where the lookahead pattern exists ahead, without consuming it:

import re

# Find "Python" only when followed by a space and a version number
re.findall(r'Python(?=\s\d)', 'Python 3 and Python are great')
# ['Python']  — only the first "Python" (followed by " 3")

# Find words followed by a comma
re.findall(r'\w+(?=,)', 'apple, banana, cherry')
# ['apple', 'banana']

The key insight is that the lookahead text is not consumed:

import re

# Without lookahead — "ing" is consumed
re.findall(r'\w+ing', 'running jumping sitting')
# ['running', 'jumping', 'sitting']

# With lookahead — "ing" is checked but not part of the match
re.findall(r'\w+(?=ing)', 'running jumping sitting')
# ['runn', 'jump', 'sitt']

Negative Lookahead `(?!...)`¶

Matches a position where the lookahead pattern does not exist:

import re

# Match "foo" NOT followed by "bar"
re.findall(r'foo(?!bar)', 'foobar foobaz foo')
# ['foo', 'foo']  — the 'foo' in 'foobaz' and standalone 'foo'

# Match numbers NOT followed by a percent sign
re.findall(r'\d+(?!%)', '42% 100 85% 7')
# ['4', '10', '8', '7']  — careful with greedy matching!

# Better: use word boundary
re.findall(r'\b\d+\b(?!%)', '42% 100 85% 7')
# ['100', '7']

Password Validation Example¶

Negative lookahead is often used for validation logic (checking that something is absent):

import re

def validate_password(pw):
    """
    Requires:
    - At least 8 characters
    - At least one digit
    - At least one uppercase letter
    - At least one lowercase letter
    - No spaces
    """
    pattern = r'^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?!.*\s).{8,}$'
    return bool(re.fullmatch(pattern, pw))

print(validate_password("Abc12345"))     # True
print(validate_password("abc12345"))     # False — no uppercase
print(validate_password("ABC12345"))     # False — no lowercase
print(validate_password("Abcdefgh"))     # False — no digit
print(validate_password("Ab 12345"))     # False — contains space
print(validate_password("Ab12"))         # False — too short

Multiple lookaheads at position 0 act as AND conditions: each must be satisfied independently.

Positive Lookbehind `(?<=...)`¶

Matches a position where the lookbehind pattern exists before:

import re

# Find numbers preceded by a dollar sign
re.findall(r'(?<=\$)\d+', 'Price: \$42, €50, \$100')
# ['42', '100']

# Find words preceded by "@" (mentions)
re.findall(r'(?<=@)\w+', 'Hello @alice and @bob')
# ['alice', 'bob']

Fixed-Width Lookbehind

In Python, lookbehind patterns must match a fixed-length string. Variable-length patterns like (?<=\d+) are not allowed and raise re.error. However, alternations of different fixed lengths are permitted: (?<=ab|cde) works.

import re

# Fixed-width — OK
re.findall(r'(?<=\$)\d+', '\$42')          # ['42']
re.findall(r'(?<=USD\s)\d+', 'USD 42')    # ['42']

# Variable-width — ERROR
try:
    re.findall(r'(?<=\$\d+\.)\d+', '\$3.50')
except re.error as e:
    print(e)  # look-behind requires fixed-width pattern

# Alternation of fixed widths — OK
re.findall(r'(?<=\$|€)\d+', '\$42 €50')    # ['42', '50']

Negative Lookbehind `(?<!...)`¶

Matches a position where the lookbehind pattern does not exist:

import re

# Match numbers NOT preceded by a dollar sign
re.findall(r'(?<!\$)\b\d+', 'Price: $42, quantity: 100, code: \$7')
# ['100']

# Match "test" NOT preceded by "unit"
re.findall(r'(?<!unit)test', 'unittest test mytest')
# ['test', 'test']

Combining Lookarounds¶

Lookarounds can be combined for precise matching:

import re

# Find numbers that are both preceded by $ and followed by a decimal point
re.findall(r'(?<=\$)\d+(?=\.)', '\$42.99 \$100 €50.00')
# ['42']

# Find words surrounded by underscores (like _word_) 
# without including the underscores in the match
re.findall(r'(?<=_)\w+(?=_)', 'This is _bold_ and _italic_ text')
# ['bold', 'italic']

Overlapping Matches¶

Since lookarounds don't consume characters, they enable finding "overlapping" patterns:

import re

# Find all positions where "aa" occurs (including overlapping)
text = "aaa"

# Without lookahead — non-overlapping only
re.findall(r'aa', text)
# ['aa']  — finds only one

# With lookahead — overlapping
re.findall(r'(?=(aa))', text)
# ['aa', 'aa']  — finds both positions (0 and 1)

Practical Examples¶

Number Formatting (Thousands Separator)¶

import re

def add_commas(n):
    """Add thousand separators: 1234567 → '1,234,567'"""
    s = str(n)
    # Insert comma before groups of 3 digits from the right
    # Positive lookahead: followed by groups of exactly 3 digits to the end
    # Positive lookbehind: preceded by a digit
    return re.sub(r'(?<=\d)(?=(\d{3})+$)', ',', s)

print(add_commas(1234567))     # '1,234,567'
print(add_commas(1000000000))  # '1,000,000,000'
print(add_commas(42))          # '42'

Extracting Values After Labels¶

import re

text = "Name: Alice  Age: 30  City: Seoul"

# Extract values after specific labels
labels = re.findall(r'(?<=Name:\s)\w+', text)    # ['Alice']
ages = re.findall(r'(?<=Age:\s)\d+', text)        # ['30']
cities = re.findall(r'(?<=City:\s)\w+', text)     # ['Seoul']

Splitting Without Losing Context¶

import re

# Split before uppercase letters (camelCase → words)
text = "camelCaseVariableName"
re.split(r'(?=[A-Z])', text)
# ['camel', 'Case', 'Variable', 'Name']

# Split after digits
re.split(r'(?<=\d)(?=[a-zA-Z])', 'abc123def456ghi')
# ['abc123', 'def456', 'ghi']

URL Protocol Check¶

import re

urls = [
    "https://example.com",
    "http://test.org",
    "ftp://files.example.com",
    "example.com",
]

# Match URLs that do NOT start with https
for url in urls:
    if re.match(r'(?!https://)\S+', url) and '://' in url:
        print(f"Not HTTPS: {url}")
# Not HTTPS: http://test.org
# Not HTTPS: ftp://files.example.com

Summary¶

Lookaround	Syntax	Meaning	Width
Positive lookahead	`(?=...)`	Must be followed by	Zero
Negative lookahead	`(?!...)`	Must NOT be followed by	Zero
Positive lookbehind	`(?<=...)`	Must be preceded by	Zero (fixed-width only)
Negative lookbehind	`(?<!...)`	Must NOT be preceded by	Zero (fixed-width only)
Combined	Stack multiple	AND conditions at a position	Zero

Runnable Example: `lookahead_lookbehind_tutorial.py`¶

"""
Python Regular Expressions - Tutorial 06: Lookahead and Lookbehind
==================================================================

LEARNING OBJECTIVES:
-------------------
1. Understand lookahead assertions (?=...) and (?!...)
2. Master lookbehind assertions (?<=...) and (?<!...)
3. Combine lookarounds with other patterns
4. Apply lookarounds to complex validation scenarios
5. Understand zero-width assertions

PREREQUISITES:
-------------
- Tutorials 01-05 (all previous tutorials)

DIFFICULTY: ADVANCED
"""

import re

# ==============================================================================
# SECTION 1: INTRODUCTION TO LOOKAROUNDS
# ==============================================================================

if __name__ == "__main__":

    """
    LOOKAROUNDS are zero-width assertions - they match a position, not characters.
    They CHECK if a pattern exists ahead or behind, without consuming characters.

    Types:
      (?=...)   Positive lookahead: must be followed by ...
      (?!...)   Negative lookahead: must NOT be followed by ...
      (?<=...)  Positive lookbehind: must be preceded by ...
      (?<!...)  Negative lookbehind: must NOT be preceded by ...

    Key concept: They don't capture or consume characters!
    """

    print("="*70)
    print("SECTION 1: POSITIVE LOOKAHEAD (?=...)")
    print("="*70)

    # Example 1: Basic positive lookahead
    # -----------------------------------
    """
    Match a pattern only if it's followed by another pattern.
    """

    text1 = "read reading reader"

    # Match "read" only if followed by "ing"
    pattern1 = r"read(?=ing)"
    matches1 = re.findall(pattern1, text1)

    print(f"Text: '{text1}'")
    print(f"Pattern: 'read(?=ing)' (read followed by ing)")
    print(f"Matches: {matches1}")
    print("(Note: 'ing' is not included in the match)")

    print()

    # Example 2: Lookahead for validation
    # -----------------------------------
    """
    Check if a number is followed by a specific unit.
    """

    text2 = "100kg 200g 300ml"

    # Match numbers followed by "kg"
    pattern2 = r"\d+(?=kg)"
    matches2 = re.findall(pattern2, text2)

    print(f"Text: '{text2}'")
    print(f"Pattern: '\\d+(?=kg)' (numbers before kg)")
    print(f"Matches: {matches2}")

    print()

    # ==============================================================================
    # SECTION 2: NEGATIVE LOOKAHEAD (?!...)
    # ==============================================================================

    """
    Match only if NOT followed by a pattern.
    """

    print("="*70)
    print("SECTION 2: NEGATIVE LOOKAHEAD (?!...)")
    print("="*70)

    # Example 3: Negative lookahead
    # -----------------------------
    text3 = "cat cats dog dogs"

    # Match "cat" not followed by "s"
    pattern3 = r"cat(?!s)"
    matches3 = re.findall(pattern3, text3)

    print(f"Text: '{text3}'")
    print(f"Pattern: 'cat(?!s)' (cat not followed by s)")
    print(f"Matches: {matches3}")

    print()

    # Example 4: Excluding specific suffixes
    # --------------------------------------
    text4 = "test.txt test.py test.md readme.txt"

    # Match filenames not ending with .txt
    pattern4 = r"\w+(?!\.txt)\.\w+"
    matches4 = re.findall(pattern4, text4)

    print(f"Text: '{text4}'")
    print(f"Non-.txt files: {matches4}")

    print()

    # ==============================================================================
    # SECTION 3: POSITIVE LOOKBEHIND (?<=...)
    # ==============================================================================

    """
    Match only if preceded by a pattern.
    """

    print("="*70)
    print("SECTION 3: POSITIVE LOOKBEHIND (?<=...)")
    print("="*70)

    # Example 5: Basic lookbehind
    # ---------------------------
    text5 = "$100 €200 £300"

    # Match numbers preceded by $
    pattern5 = r"(?<=\$)\d+"
    matches5 = re.findall(pattern5, text5)

    print(f"Text: '{text5}'")
    print(f"Pattern: '(?<=\\$)\\d+' (numbers after $)")
    print(f"Matches: {matches5}")

    print()

    # Example 6: Extracting values after labels
    # -----------------------------------------
    text6 = "Price: $50 Tax: $10 Total: $60"

    # Match numbers that come after "Total: $"
    pattern6 = r"(?<=Total: \$)\d+"
    total = re.search(pattern6, text6)

    print(f"Text: '{text6}'")
    if total:
        print(f"Total amount: ${total.group()}")

    print()

    # ==============================================================================
    # SECTION 4: NEGATIVE LOOKBEHIND (?<!...)
    # ==============================================================================

    """
    Match only if NOT preceded by a pattern.
    """

    print("="*70)
    print("SECTION 4: NEGATIVE LOOKBEHIND (?<!...)")
    print("="*70)

    # Example 7: Negative lookbehind
    # ------------------------------
    text7 = "123 456 789 012"

    # Match 3-digit numbers NOT preceded by "0"
    pattern7 = r"(?<!0)\d{3}"
    matches7 = re.findall(pattern7, text7)

    print(f"Text: '{text7}'")
    print(f"Pattern: '(?<!0)\\d{{3}}' (3 digits not after 0)")
    print(f"Matches: {matches7}")

    print()

    # ==============================================================================
    # SECTION 5: COMBINING LOOKAROUNDS
    # ==============================================================================

    print("="*70)
    print("SECTION 5: COMBINING LOOKAROUNDS")
    print("="*70)

    # Example 8: Password validation
    # ------------------------------
    """
    Validate password:
    - At least 8 characters
    - Contains at least one digit
    - Contains at least one letter
    - Contains at least one special character
    """

    def validate_password(password):
        # Check minimum length
        if len(password) < 8:
            return False, "Too short"

        # Use lookaheads to check all requirements
        pattern = r"^(?=.*[a-zA-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]+$"

        if re.match(pattern, password):
            return True, "Valid password"
        return False, "Missing required characters"

    passwords = [
        "password",      # No digit or special char
        "pass123",       # No special char
        "Pass@123",      # Valid!
        "12345678",      # No letters or special
        "P@ss1"          # Too short
    ]

    print("Password validation:")
    for pwd in passwords:
        valid, msg = validate_password(pwd)
        status = "✓" if valid else "✗"
        print(f"  {status} '{pwd}': {msg}")

    print()

    # Example 9: Finding words between specific patterns
    # --------------------------------------------------
    text9 = "start apple banana end start cherry date end"

    # Match words that are after "start" and before "end"
    pattern9 = r"(?<=start )\w+(?= \w* end)"
    matches9 = re.findall(pattern9, text9)

    print(f"Text: '{text9}'")
    print(f"Words after 'start': {matches9}")

    print()

    # ==============================================================================
    # SECTION 6: PRACTICAL APPLICATIONS
    # ==============================================================================

    print("="*70)
    print("SECTION 6: PRACTICAL APPLICATIONS")
    print("="*70)

    # Example 10: Currency conversion validation
    # ------------------------------------------
    text10 = "$100.00 €200.50 £300.75 ¥400"

    # Extract amounts in dollars only
    dollar_pattern = r"(?<=\$)\d+\.\d{2}"
    dollar_amounts = re.findall(dollar_pattern, text10)

    print(f"Text: '{text10}'")
    print(f"Dollar amounts: {['$' + amt for amt in dollar_amounts]}")

    print()

    # Example 11: Extracting @mentions (not in emails)
    # ------------------------------------------------
    text11 = "Hi @john! Email: user@example.com, @alice says hi"

    # Match @username but not when it's part of an email
    pattern11 = r"(?<!\w)@\w+(?![\w.])"
    mentions = re.findall(pattern11, text11)

    print(f"Text: '{text11}'")
    print(f"Mentions: {mentions}")
    print("(Excludes @example.com because it's part of email)")

    print()

    # Example 12: Matching numbers not in equations
    # ---------------------------------------------
    text12 = "Age: 25, Score: 3+7=10, Price: 30"

    # Match standalone numbers (not part of equations)
    pattern12 = r"(?<![+=])\d+(?![+=])"
    standalone = re.findall(pattern12, text12)

    print(f"Text: '{text12}'")
    print(f"Standalone numbers: {standalone}")

    print()

    # ==============================================================================
    # SECTION 7: ADVANCED PATTERNS
    # ==============================================================================

    print("="*70)
    print("SECTION 7: ADVANCED LOOKAROUND PATTERNS")
    print("="*70)

    # Example 13: Remove duplicate words but keep one
    # -----------------------------------------------
    text13 = "the the cat sat sat on the mat mat"

    # Remove only the second occurrence of duplicates
    pattern13 = r"\b(\w+)\s+(?=\1\b)"
    result13 = re.sub(pattern13, "", text13)

    print(f"Original: '{text13}'")
    print(f"Deduplicated: '{result13}'")

    print()

    # Example 14: Validate hex color codes
    # ------------------------------------
    """
    Valid: #FFF, #FFFFFF
    Invalid: #FF, #FFFFFFF
    """

    colors = ["#FFF", "#FFFFFF", "#123ABC", "#GG", "#12345"]

    pattern14 = r"^#(?:[0-9A-Fa-f]{3}|[0-9A-Fa-f]{6})$"

    print("Hex color validation:")
    for color in colors:
        valid = bool(re.match(pattern14, color))
        status = "✓" if valid else "✗"
        print(f"  {status} {color}")

    print()

    # ==============================================================================
    # SECTION 8: SUMMARY
    # ==============================================================================

    print("="*70)
    print("LOOKAROUND CHEAT SHEET")
    print("="*70)

    cheat_sheet = """
    LOOKAHEAD (looks forward):
      (?=...)     Positive: must be followed by ...
      (?!...)     Negative: must NOT be followed by ...

    LOOKBEHIND (looks backward):
      (?<=...)    Positive: must be preceded by ...
      (?<!...)    Negative: must NOT be preceded by ...

    KEY FEATURES:
      - Zero-width: don't consume characters
      - Don't capture: not included in match
      - Multiple lookarounds can be combined
      - Lookbehind must be fixed-width in Python

    COMMON PATTERNS:
      (?=.*\d)               Contains at least one digit
      (?!.*admin)            Doesn't contain "admin"
      (?<=\$)\d+             Number after $
      (?<!@)\w+              Word not preceded by @

    PASSWORD VALIDATION:
      ^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,}$
      - At least 8 chars
      - Has lowercase, uppercase, and digit

    CAUTION:
      - Lookbehind must have fixed width
      - Can impact performance
      - Sometimes simpler alternatives exist
    """

    print(cheat_sheet)

    print("\n" + "="*70)
    print("END OF TUTORIAL - Lookahead and Lookbehind mastered!")
    print("Next: Tutorial 07 - Practical Applications")
    print("="*70)