Unicode Comparison Strategies¶
Python provides three levels of string comparison, each appropriate for different situations.
Mental Model
String comparison has three levels of strictness: exact (==), normalized (same visual form), and case-folded (case-insensitive). Each level handles more real-world variation but requires more processing. Choose the simplest level that matches your requirements -- exact for internal keys, normalized for user input, case-folded for search.
Level 1: Exact Comparison¶
Use == when both strings are known to be in the same Unicode form and
case matters:
```python
s1 = "café"
s2 = "café"
print(s1 == s2) # True ```
This is the fastest approach but fails silently if the strings come from different sources with different internal representations.
Level 2: Unicode-Aware Case-Sensitive Comparison¶
Use normalize() when strings may have different internal representations
but case still matters:
```python
from unicodedata import normalize
def nfc_equal(s1, s2): return normalize("NFC", s1) == normalize("NFC", s2)
print(nfc_equal("café", "cafe\u0301")) # True print(nfc_equal("Café", "café")) # False (case-sensitive) ```
Typical use cases: comparing database keys, filenames, identifiers.
Level 3: Unicode-Aware Case-Insensitive Comparison¶
Use normalize() combined with casefold() when strings may differ in
both representation and case:
```python
def fold_equal(s1, s2):
return normalize("NFC", s1).casefold() == normalize("NFC", s2).casefold()
print(fold_equal("Café", "CAFÉ")) # True print(fold_equal("café", "cafe\u0301")) # True print(fold_equal("ß", "SS")) # True ```
Typical use cases: user authentication, search queries, international names.
Summary¶
| Level | Approach | Use When |
|---|---|---|
| 1 | s1 == s2 |
Strings are ASCII or already normalized |
| 2 | nfc_equal(s1, s2) |
Case matters, but representation may vary |
| 3 | fold_equal(s1, s2) |
User-facing input, case and representation may vary |
Each level adds robustness at a small performance cost. For user-facing comparison, always prefer Level 3.
See Also¶
Exercises¶
Exercise 1.
Demonstrate the three levels of Unicode string comparison (exact, normalized, fold-equal) by comparing "Cafe\u0301" with "cafe" at each level.
Solution to Exercise 1
```python from unicodedata import normalize
a = "Cafe\u0301" b = "cafe"
Level 1: Exact¶
print(a == b) # False
Level 2: Normalized (case-sensitive)¶
print(normalize("NFC", a) == normalize("NFC", b)) # False
Level 3: Fold-equal (case-insensitive)¶
print(normalize("NFC", a).casefold() == normalize("NFC", b).casefold()) # True ```
Level 1 fails because representations differ. Level 2 fails because case differs. Level 3 succeeds by handling both differences.
Exercise 2.
Write a function that takes a list of Unicode strings and returns them deduplicated using NFC normalization. For example, ["cafe\u0301", "caf\u00e9"] should return a single entry.
Solution to Exercise 2
```python from unicodedata import normalize
def deduplicate(strings): seen = set() result = [] for s in strings: normalized = normalize("NFC", s) if normalized not in seen: seen.add(normalized) result.append(s) return result
strings = ["cafe\u0301", "caf\u00e9", "hello", "cafe\u0301"] print(deduplicate(strings)) # Two unique entries ```
By normalizing strings before adding to the set, visually identical strings with different internal representations are treated as duplicates.
Exercise 3. Explain when you would use Level 1 (exact), Level 2 (normalized), and Level 3 (fold-equal) comparison. Give a practical example for each level.
Solution to Exercise 3
Level 1 (exact ==): comparing strings you control, like dictionary keys that were all created in the same way. Example: checking if a config key matches a known constant.
Level 2 (normalized): comparing identifiers or filenames from different sources where case matters but representation may vary. Example: matching database keys.
Level 3 (fold-equal): user-facing search or authentication where both case and representation can vary. Example: searching a user directory by name.