Unicode Normalization¶

Unicode allows multiple internal representations for text that looks identical.

For example, the character é may be stored in two different ways:

as a single precomposed character: U+00E9
as two code points: U+0065 (e) followed by U+0301 (combining acute accent)

These forms look the same to humans, but Python treats them as different strings:

from unicodedata import normalize

s1 = "caf\u00e9"   # precomposed: é is one character (U+00E9)
s2 = "cafe\u0301"  # decomposed:  e + combining accent (two code points)

print(s1 == s2)    # False
print(len(s1))     # 4
print(len(s2))     # 5

The two strings look identical when printed, but have different lengths — proof that they are internally different.

Why This Happens¶

Unicode separates visual appearance from code point sequence. Two strings may render identically while having entirely different internal representations:

String	Internal Representation
`café`	`c a f U+00E9`
`café`	`c a f U+0065 U+0301`

Because the code point sequences differ, direct comparison with == fails.

Normalization Forms¶

Unicode normalization converts text into a standard form so equivalent strings compare reliably. Python provides this through unicodedata.normalize().

Unicode defines four normalization forms:

Form	Meaning
`NFC`	Canonical decomposition, then recomposition
`NFD`	Canonical decomposition only
`NFKC`	Compatibility decomposition, then recomposition
`NFKD`	Compatibility decomposition only

For most string comparison tasks, NFC is the most useful form.

NFC¶

NFC produces precomposed characters — it collapses decomposed sequences into a single code point where possible:

print(normalize("NFC", "cafe\u0301"))  # café  (length 4)

NFD¶

NFD produces decomposed characters — it expands precomposed characters into base character + combining marks:

print(normalize("NFD", "caf\u00e9"))  # café  (length 5, e + U+0301)

Both render as café, but their internal lengths differ.

Unicode-Aware Comparison¶

With normalization, visually identical strings compare equal:

s1 = "caf\u00e9"
s2 = "cafe\u0301"

print(normalize("NFC", s1) == normalize("NFC", s2))  # True

For convenience, wrap this in a helper:

def nfc_equal(s1, s2):
    return normalize("NFC", s1) == normalize("NFC", s2)

print(nfc_equal("café", "cafe\u0301"))  # True

Note that nfc_equal() is still case-sensitive — "Café" and "CAFÉ" are not equal. For case-insensitive comparison, see Unicode Case Folding.

Key Takeaways¶

Unicode strings that look identical may have different internal representations.
normalize("NFC", s) converts to a consistent precomposed form.
Always normalize before comparing Unicode strings from external sources.