Groups and Capturing¶
Capturing Groups¶
Parentheses (...) serve two purposes in regex: grouping (treating multiple tokens as a unit) and capturing (extracting the matched substring).
import re
text = "2024-01-15"
match = re.search(r'(\d{4})-(\d{2})-(\d{2})', text)
print(match.group(0)) # '2024-01-15' — full match
print(match.group(1)) # '2024' — first group (year)
print(match.group(2)) # '01' — second group (month)
print(match.group(3)) # '15' — third group (day)
print(match.groups()) # ('2024', '01', '15')
Groups are numbered left to right by their opening parenthesis:
import re
# Groups numbered by opening parenthesis position
# 1 2 3
match = re.search(r'((a)(b))', 'ab')
print(match.group(1)) # 'ab' — outer group
print(match.group(2)) # 'a' — first inner
print(match.group(3)) # 'b' — second inner
findall() with Groups¶
When a pattern contains capturing groups, re.findall() returns the group contents instead of the full match:
import re
text = "2024-01-15 and 2024-12-31"
# No groups — returns full matches
re.findall(r'\d{4}-\d{2}-\d{2}', text)
# ['2024-01-15', '2024-12-31']
# One group — returns list of strings (the group)
re.findall(r'(\d{4})-\d{2}-\d{2}', text)
# ['2024', '2024']
# Multiple groups — returns list of tuples
re.findall(r'(\d{4})-(\d{2})-(\d{2})', text)
# [('2024', '01', '15'), ('2024', '12', '31')]
findall() Behavior Changes with Groups
This is a common source of confusion. If you want the full match but also need groups, use re.finditer() and access both .group(0) and .groups().
Non-Capturing Groups¶
Use (?:...) when you need grouping for quantifiers or alternation but do not need to capture:
import re
text = "gray grey"
# Capturing group — findall returns group content
re.findall(r'gr(a|e)y', text)
# ['a', 'e']
# Non-capturing group — findall returns full match
re.findall(r'gr(?:a|e)y', text)
# ['gray', 'grey']
Non-capturing groups are useful for applying quantifiers to a sub-pattern:
import re
# Match repeated word patterns
re.findall(r'(?:ha)+', 'hahaha haha ha')
# ['hahaha', 'haha', 'ha']
# Optional prefix
re.findall(r'(?:un)?happy', 'happy unhappy')
# ['happy', 'unhappy']
Named Groups¶
Named groups use the syntax (?P<name>...) and can be accessed by name via .group('name') or .groupdict():
import re
text = "2024-01-15"
match = re.search(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})', text)
print(match.group('year')) # '2024'
print(match.group('month')) # '01'
print(match.group('day')) # '15'
print(match.groupdict()) # {'year': '2024', 'month': '01', 'day': '15'}
Named groups improve readability, especially in complex patterns:
import re
log = '192.168.1.100 - - [15/Jan/2024:10:30:45] "GET /index.html HTTP/1.1" 200 1234'
pattern = r'(?P<ip>[\d.]+) .+ \[(?P<date>[^\]]+)\] "(?P<method>\w+) (?P<path>\S+)'
match = re.search(pattern, log)
if match:
info = match.groupdict()
print(info)
# {'ip': '192.168.1.100', 'date': '15/Jan/2024:10:30:45',
# 'method': 'GET', 'path': '/index.html'}
Backreferences¶
Backreferences match the same text that was previously captured by a group. Use \1, \2, etc. (or (?P=name) for named groups):
import re
# Find repeated words
text = "the the cat sat on on the mat"
re.findall(r'\b(\w+)\s+\1\b', text)
# ['the', 'on']
# Find matching HTML tags
html = '<b>bold</b> <i>italic</i> <b>mismatch</i>'
re.findall(r'<(\w+)>.*?</\1>', html)
# ['b', 'i'] — only properly closed tags
Named backreference:
import re
# Detect repeated words using named groups
text = "the the quick quick fox"
re.findall(r'\b(?P<word>\w+)\s+(?P=word)\b', text)
# ['the', 'quick']
Groups with Quantifiers¶
When a group is inside a quantifier, only the last iteration is captured:
import re
# Only the last repetition is captured
match = re.search(r'(\d)+', '12345')
print(match.group(0)) # '12345' — full match
print(match.group(1)) # '5' — only the last digit captured
# To capture all, put the quantifier inside the group
match = re.search(r'(\d+)', '12345')
print(match.group(1)) # '12345'
Practical Examples¶
Parsing Key-Value Pairs¶
import re
config = "host=localhost port=5432 db=mydb user=admin"
pairs = re.findall(r'(\w+)=(\S+)', config)
print(pairs)
# [('host', 'localhost'), ('port', '5432'), ('db', 'mydb'), ('user', 'admin')]
print(dict(pairs))
# {'host': 'localhost', 'port': '5432', 'db': 'mydb', 'user': 'admin'}
Swapping Name Order¶
import re
names = "Smith, John\nDoe, Jane\nLee, Alice"
# Swap "Last, First" to "First Last"
result = re.sub(r'(\w+), (\w+)', r'\2 \1', names)
print(result)
# John Smith
# Jane Doe
# Alice Lee
Extracting URLs¶
import re
text = "Visit https://example.com or http://test.org/page?q=1"
urls = re.findall(r'https?://\S+', text)
print(urls)
# ['https://example.com', 'http://test.org/page?q=1']
# With named groups for parts
pattern = r'(?P<scheme>https?)://(?P<host>[\w.]+)(?P<path>/\S*)?'
for m in re.finditer(pattern, text):
print(m.groupdict())
# {'scheme': 'https', 'host': 'example.com', 'path': None}
# {'scheme': 'http', 'host': 'test.org', 'path': '/page?q=1'}
Summary¶
| Concept | Syntax | Key Takeaway |
|---|---|---|
| Capturing group | (...) |
Groups and captures; accessed by number |
| Non-capturing | (?:...) |
Groups without capturing |
| Named group | (?P<name>...) |
Capture accessible by name |
| Backreference | \1 or (?P=name) |
Matches same text as a previous group |
findall + groups |
— | Returns group contents, not full match |
groupdict() |
— | Dictionary of named groups |
Runnable Example: groups_capturing_tutorial.py¶
"""
Python Regular Expressions - Tutorial 05: Groups and Capturing
==============================================================
LEARNING OBJECTIVES:
-------------------
1. Understand capturing groups () and their uses
2. Use non-capturing groups (?:) for grouping without capture
3. Work with named groups (?P<name>) for readability
4. Extract data using groups with match.group()
5. Use backreferences to match repeated patterns
6. Apply groups in re.sub() for replacements
PREREQUISITES:
-------------
- Tutorials 01-04 (Basics through Anchors)
DIFFICULTY: INTERMEDIATE
"""
import re
# ==============================================================================
# SECTION 1: INTRODUCTION TO GROUPS
# ==============================================================================
if __name__ == "__main__":
"""
GROUPS allow you to:
1. Capture parts of a match for extraction
2. Apply quantifiers to multiple characters
3. Create alternatives with | (OR operator)
4. Reference captured text later (backreferences)
Syntax:
(pattern) Capturing group
(?:pattern) Non-capturing group
(?P<name>pattern) Named capturing group
"""
print("="*70)
print("SECTION 1: BASIC CAPTURING GROUPS")
print("="*70)
# Example 1: Simple capturing group
# ---------------------------------
text1 = "My phone is 555-1234"
# Capture the phone number
pattern1 = r"(\d{3}-\d{4})"
match = re.search(pattern1, text1)
if match:
print(f"Text: '{text1}'")
print(f"Full match: '{match.group(0)}'") # group(0) is always the full match
print(f"Group 1: '{match.group(1)}'") # group(1) is first captured group
print()
# Example 2: Multiple capturing groups
# ------------------------------------
text2 = "John Doe, age 30"
# Capture first name, last name, and age
pattern2 = r"(\w+) (\w+), age (\d+)"
match2 = re.search(pattern2, text2)
if match2:
print(f"Text: '{text2}'")
print(f"Full match: '{match2.group(0)}'")
print(f"First name (group 1): '{match2.group(1)}'")
print(f"Last name (group 2): '{match2.group(2)}'")
print(f"Age (group 3): '{match2.group(3)}'")
# Can also use groups() to get all groups as tuple
print(f"All groups: {match2.groups()}")
print()
# Example 3: Using groups with findall()
# --------------------------------------
"""
When using findall() with groups:
- If pattern has groups, findall() returns tuples of groups
- If no groups, findall() returns list of strings
"""
text3 = "Emails: john@example.com, alice@test.org"
# Pattern with groups: username and domain
pattern3 = r"(\w+)@([\w.]+)"
matches = re.findall(pattern3, text3)
print(f"Text: '{text3}'")
print("Email parts (username, domain):")
for username, domain in matches:
print(f" {username} @ {domain}")
print()
# ==============================================================================
# SECTION 2: NON-CAPTURING GROUPS
# ==============================================================================
"""
NON-CAPTURING GROUPS (?:...) group elements but don't capture them.
Use when you need grouping for quantifiers or alternatives,
but don't need to extract the matched text.
"""
print("="*70)
print("SECTION 2: NON-CAPTURING GROUPS")
print("="*70)
# Example 4: Capturing vs non-capturing
# -------------------------------------
text4 = "cat dog cat"
# With capturing group
pattern_capture = r"(cat|dog)"
matches_capture = re.findall(pattern_capture, text4)
print("With capturing group (cat|dog):")
print(f" Result: {matches_capture}")
# With non-capturing group
pattern_no_capture = r"(?:cat|dog)"
matches_no_capture = re.findall(pattern_no_capture, text4)
print("With non-capturing group (?:cat|dog):")
print(f" Result: {matches_no_capture}")
print(" (Same result, but more efficient)")
print()
# Example 5: Practical use of non-capturing groups
# -----------------------------------------------
"""
Extract phone numbers, but don't capture area code separately
when you only care about the full number.
"""
text5 = "Call 800-555-1234 or 888-555-5678"
# Capture only the full number, group area code without capturing
pattern5 = r"((?:\d{3})-\d{3}-\d{4})"
matches5 = re.findall(pattern5, text5)
print(f"Text: '{text5}'")
print(f"Phone numbers: {matches5}")
print("(Area code grouped but not captured separately)")
print()
# ==============================================================================
# SECTION 3: NAMED GROUPS
# ==============================================================================
"""
NAMED GROUPS (?P<name>...) let you refer to groups by name instead of number.
This makes patterns more readable and maintainable.
Syntax: (?P<name>pattern)
Access: match.group('name')
"""
print("="*70)
print("SECTION 3: NAMED CAPTURING GROUPS")
print("="*70)
# Example 6: Basic named groups
# -----------------------------
text6 = "2024-03-15"
# Named groups for date parts
pattern6 = r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
match6 = re.search(pattern6, text6)
if match6:
print(f"Date: '{text6}'")
print(f"Year: {match6.group('year')}")
print(f"Month: {match6.group('month')}")
print(f"Day: {match6.group('day')}")
# Can also access by number
print(f"Year (by index): {match6.group(1)}")
# Get dict of all named groups
print(f"As dict: {match6.groupdict()}")
print()
# Example 7: Practical parsing with named groups
# ----------------------------------------------
log_line = "2024-03-15 14:30:45 ERROR User authentication failed"
# Parse log entry with named groups
log_pattern = r"(?P<date>\S+) (?P<time>\S+) (?P<level>\w+) (?P<message>.+)"
log_match = re.search(log_pattern, log_line)
if log_match:
print("Log entry parsing:")
log_data = log_match.groupdict()
for key, value in log_data.items():
print(f" {key}: {value}")
print()
# ==============================================================================
# SECTION 4: BACKREFERENCES
# ==============================================================================
"""
BACKREFERENCES let you match the same text that was captured earlier.
- \\1, \\2, etc.: Reference captured groups by number
- (?P=name): Reference named groups
Useful for finding repeated patterns, palindromes, matching tags, etc.
"""
print("="*70)
print("SECTION 4: BACKREFERENCES")
print("="*70)
# Example 8: Finding repeated words
# ---------------------------------
text8 = "the the cat sat on the the mat"
# Match repeated words
pattern8 = r"\b(\w+)\s+\1\b" # \1 refers to group 1
matches8 = re.findall(pattern8, text8)
print(f"Text: '{text8}'")
print(f"Repeated words: {matches8}")
print()
# Example 9: Matching HTML/XML tags
# ---------------------------------
html = "<div>Content</div> <span>Text</span> <p>Wrong</div>"
# Match properly closed tags
pattern9 = r"<(\w+)>.*?</\1>" # \1 matches the same tag name
proper_tags = re.findall(pattern9, html)
print(f"HTML: {html}")
print(f"Properly closed tags: {proper_tags}")
print()
# Example 10: Named group backreferences
# --------------------------------------
text10 = "hello hello world world"
# Using named groups and backreferences
pattern10 = r"\b(?P<word>\w+)\s+(?P=word)\b"
matches10 = re.finditer(pattern10, text10)
print(f"Text: '{text10}'")
print("Repeated words (using named groups):")
for match in matches10:
print(f" {match.group('word')}")
print()
# ==============================================================================
# SECTION 5: GROUPS IN SUBSTITUTION
# ==============================================================================
"""
Groups are very powerful in re.sub() for transforming text.
You can reference captured groups in the replacement string.
"""
print("="*70)
print("SECTION 5: USING GROUPS IN re.sub()")
print("="*70)
# Example 11: Swapping parts of text
# ----------------------------------
text11 = "Doe, John"
# Swap last name and first name
pattern11 = r"(\w+), (\w+)"
result11 = re.sub(pattern11, r"\2 \1", text11) # \2 \1 swaps the groups
print(f"Original: '{text11}'")
print(f"Swapped: '{result11}'")
print()
# Example 12: Formatting phone numbers
# ------------------------------------
text12 = "800-555-1234 and 888-555-5678"
# Change format from XXX-XXX-XXXX to (XXX) XXX-XXXX
pattern12 = r"(\d{3})-(\d{3})-(\d{4})"
result12 = re.sub(pattern12, r"(\1) \2-\3", text12)
print(f"Original: '{text12}'")
print(f"Formatted: '{result12}'")
print()
# Example 13: Using named groups in substitution
# ----------------------------------------------
text13 = "Price: $100.00"
# Extract and reformat using named groups
pattern13 = r"\$(?P<dollars>\d+)\.(?P<cents>\d{2})"
result13 = re.sub(pattern13, r"\g<dollars> dollars and \g<cents> cents", text13)
print(f"Original: '{text13}'")
print(f"Converted: '{result13}'")
print("(Note: Use \\g<name> for named groups in replacement)")
print()
# ==============================================================================
# SECTION 6: PRACTICAL APPLICATIONS
# ==============================================================================
print("="*70)
print("SECTION 6: PRACTICAL APPLICATIONS")
print("="*70)
# Example 14: Parsing email addresses
# -----------------------------------
def parse_email(email):
pattern = r"(?P<user>[\w.+-]+)@(?P<domain>[\w.-]+)\.(?P<tld>\w+)"
match = re.match(pattern, email)
if match:
return match.groupdict()
return None
emails = ["john.doe@example.com", "user+tag@test.co.uk"]
print("Email parsing:")
for email in emails:
parts = parse_email(email)
if parts:
print(f" {email}:")
print(f" User: {parts['user']}")
print(f" Domain: {parts['domain']}")
print(f" TLD: {parts['tld']}")
print()
# Example 15: Extracting URLs components
# --------------------------------------
url = "https://www.example.com:8080/path/to/page?query=value#section"
url_pattern = r"(?P<protocol>https?://)?(?P<subdomain>www\.)?(?P<domain>[\w.-]+)(?::(?P<port>\d+))?(?P<path>/[\w/.-]*)?(?:\?(?P<query>[\w=&]+))?(?:#(?P<fragment>\w+))?"
match = re.search(url_pattern, url)
if match:
print(f"URL: {url}")
print("Components:")
for key, value in match.groupdict().items():
if value:
print(f" {key}: {value}")
print()
# ==============================================================================
# SECTION 7: SUMMARY
# ==============================================================================
print("="*70)
print("GROUPS CHEAT SHEET")
print("="*70)
cheat_sheet = """
CAPTURING GROUPS:
(pattern) Capture matched text
\1, \2, \3 Reference captured groups
match.group(1) Access group 1
match.groups() Get all groups as tuple
NON-CAPTURING GROUPS:
(?:pattern) Group without capturing
NAMED GROUPS:
(?P<name>pattern) Named capturing group
(?P=name) Named backreference
match.group('name') Access named group
match.groupdict() Get dict of named groups
IN SUBSTITUTION:
\1, \2, \3 Reference numbered groups
\g<name> Reference named groups
\g<1> Alternative numeric reference
COMMON USES:
(cat|dog) Alternatives
(\w+)\s+\1 Repeated words
<(\w+)>.*?</\1> Matching tags
(\d{3})-(\d{4}) Capture phone parts
KEY POINTS:
- Group 0 is always the full match
- Groups are numbered from 1
- Use (?:...) when you don't need to capture
- Named groups improve readability
- Backreferences match the same text again
"""
print(cheat_sheet)
print("\n" + "="*70)
print("END OF TUTORIAL - Groups and Capturing mastered!")
print("Next: Tutorial 06 - Lookahead and Lookbehind")
print("="*70)