Unicode Case Folding¶

Normalization fixes representation differences, but not case differences. For case-insensitive Unicode comparison, Python provides casefold().

`lower()` vs `casefold()`¶

For ASCII text, lower() and casefold() behave identically. The difference appears with certain Unicode characters: python print("ß".lower()) # ß (unchanged) print("ß".casefold()) # ss (correct for case-insensitive matching)

German ß (U+00DF) has no uppercase form — lower() leaves it alone, but casefold() maps it to ss so that "ß" and "SS" can match after folding.

This makes casefold() the correct choice for user-facing, language-aware case-insensitive comparison.

Combining Normalization and Case Folding¶

Case folding alone does not solve the representation problem from Unicode Normalization. For fully robust comparison, combine both steps: ```python from unicodedata import normalize

def fold_equal(s1, s2): return normalize("NFC", s1).casefold() == normalize("NFC", s2).casefold() ```

This handles both sources of mismatch at once: python print(fold_equal("Café", "CAFÉ")) # True (case difference) print(fold_equal("café", "cafe\u0301")) # True (representation difference) print(fold_equal("ß", "SS")) # True (language-specific folding)

Practical Example: Name Search¶

User-supplied input rarely arrives in a consistent Unicode form or case. fold_equal() makes search robust against both: ```python users = [ {"name": "José", "email": "jose@example.com"}, {"name": "Françoise", "email": "francoise@example.com"}, {"name": "Müller", "email": "muller@example.com"}, ]

def find_user(query): return [u for u in users if fold_equal(u["name"], query)]

print(find_user("jose")) # matches José print(find_user("JOSÉ")) # matches José print(find_user("FRANCOISE")) # matches Françoise print(find_user("muller")) # matches Müller ```

Without fold_equal(), none of these queries would match.

Mental Model¶

text bytes ↓ decode Unicode string ↓ normalize("NFC") standard representation ↓ casefold() comparison-ready text ↓ ==

When to Use What¶

Situation	Approach
ASCII or known-normalized strings	`s1 == s2`
Unicode-aware, case-sensitive	`normalize("NFC", s1) == normalize("NFC", s2)`
Unicode-aware, case-insensitive	`normalize("NFC", s1).casefold() == normalize("NFC", s2).casefold()`

Key Takeaways¶

casefold() is more thorough than lower() for Unicode-aware case matching.
For robust user-facing comparison, always combine normalization and case folding: python normalize("NFC", text).casefold()

Exercises¶

Exercise 1. Show the difference between .lower() and .casefold() using the German character "\u00df" (sharp s). Explain why casefold() is more suitable for case-insensitive comparison.

Solution to Exercise 1

```python sharp_s = "\u00df" print(f"lower(): '{sharp_s.lower()}'") # ß (unchanged) print(f"casefold(): '{sharp_s.casefold()}'") # ss

print("SS".casefold() == sharp_s.casefold()) # True print("SS".lower() == sharp_s.lower()) # False ```

lower() does not change ß because it has no uppercase form. casefold() maps it to ss, enabling correct case-insensitive matching with "SS".

Exercise 2. Write a fold_equal(s1, s2) function that combines NFC normalization and case folding. Test it with "Cafe\u0301" and "CAFE".

Solution to Exercise 2

```python from unicodedata import normalize

def fold_equal(s1, s2): return normalize("NFC", s1).casefold() == normalize("NFC", s2).casefold()

print(fold_equal("Cafe\u0301", "CAFE")) # True print(fold_equal("\u00df", "SS")) # True print(fold_equal("hello", "HELLO")) # True ```

NFC normalization handles representation differences, and casefold() handles case differences, making this the most robust comparison approach.

Exercise 3. Implement a case-insensitive search function search_users(query, users) where users is a list of name strings. The search should match regardless of case and Unicode representation differences.

Solution to Exercise 3