Unicode Normalization¶
Unicode allows multiple internal representations for text that looks identical.
For example, the character é may be stored in two different ways:
- as a single precomposed character:
U+00E9 - as two code points:
U+0065(e) followed byU+0301(combining acute accent)
These forms look the same to humans, but Python treats them as different strings:
from unicodedata import normalize
s1 = "caf\u00e9" # precomposed: é is one character (U+00E9)
s2 = "cafe\u0301" # decomposed: e + combining accent (two code points)
print(s1 == s2) # False
print(len(s1)) # 4
print(len(s2)) # 5
The two strings look identical when printed, but have different lengths — proof that they are internally different.
Why This Happens¶
Unicode separates visual appearance from code point sequence. Two strings may render identically while having entirely different internal representations:
| String | Internal Representation |
|---|---|
café |
c a f U+00E9 |
café |
c a f U+0065 U+0301 |
Because the code point sequences differ, direct comparison with == fails.
Normalization Forms¶
Unicode normalization converts text into a standard form so equivalent strings
compare reliably. Python provides this through unicodedata.normalize().
Unicode defines four normalization forms:
| Form | Meaning |
|---|---|
NFC |
Canonical decomposition, then recomposition |
NFD |
Canonical decomposition only |
NFKC |
Compatibility decomposition, then recomposition |
NFKD |
Compatibility decomposition only |
For most string comparison tasks, NFC is the most useful form.
NFC¶
NFC produces precomposed characters — it collapses decomposed sequences into a single code point where possible:
print(normalize("NFC", "cafe\u0301")) # café (length 4)
NFD¶
NFD produces decomposed characters — it expands precomposed characters into base character + combining marks:
print(normalize("NFD", "caf\u00e9")) # café (length 5, e + U+0301)
Both render as café, but their internal lengths differ.
Unicode-Aware Comparison¶
With normalization, visually identical strings compare equal:
s1 = "caf\u00e9"
s2 = "cafe\u0301"
print(normalize("NFC", s1) == normalize("NFC", s2)) # True
For convenience, wrap this in a helper:
def nfc_equal(s1, s2):
return normalize("NFC", s1) == normalize("NFC", s2)
print(nfc_equal("café", "cafe\u0301")) # True
Note that nfc_equal() is still case-sensitive — "Café" and "CAFÉ" are not equal.
For case-insensitive comparison, see Unicode Case Folding.
Key Takeaways¶
- Unicode strings that look identical may have different internal representations.
normalize("NFC", s)converts to a consistent precomposed form.- Always normalize before comparing Unicode strings from external sources.