Skip to content

Unicode Normalization

Unicode allows multiple internal representations for text that looks identical.

For example, the character é may be stored in two different ways:

  • as a single precomposed character: U+00E9
  • as two code points: U+0065 (e) followed by U+0301 (combining acute accent)

These forms look the same to humans, but Python treats them as different strings:

from unicodedata import normalize

s1 = "caf\u00e9"   # precomposed: é is one character (U+00E9)
s2 = "cafe\u0301"  # decomposed:  e + combining accent (two code points)

print(s1 == s2)    # False
print(len(s1))     # 4
print(len(s2))     # 5

The two strings look identical when printed, but have different lengths — proof that they are internally different.


Why This Happens

Unicode separates visual appearance from code point sequence. Two strings may render identically while having entirely different internal representations:

String Internal Representation
café c a f U+00E9
café c a f U+0065 U+0301

Because the code point sequences differ, direct comparison with == fails.


Normalization Forms

Unicode normalization converts text into a standard form so equivalent strings compare reliably. Python provides this through unicodedata.normalize().

Unicode defines four normalization forms:

Form Meaning
NFC Canonical decomposition, then recomposition
NFD Canonical decomposition only
NFKC Compatibility decomposition, then recomposition
NFKD Compatibility decomposition only

For most string comparison tasks, NFC is the most useful form.

NFC

NFC produces precomposed characters — it collapses decomposed sequences into a single code point where possible:

print(normalize("NFC", "cafe\u0301"))  # café  (length 4)

NFD

NFD produces decomposed characters — it expands precomposed characters into base character + combining marks:

print(normalize("NFD", "caf\u00e9"))  # café  (length 5, e + U+0301)

Both render as café, but their internal lengths differ.


Unicode-Aware Comparison

With normalization, visually identical strings compare equal:

s1 = "caf\u00e9"
s2 = "cafe\u0301"

print(normalize("NFC", s1) == normalize("NFC", s2))  # True

For convenience, wrap this in a helper:

def nfc_equal(s1, s2):
    return normalize("NFC", s1) == normalize("NFC", s2)

print(nfc_equal("café", "cafe\u0301"))  # True

Note that nfc_equal() is still case-sensitive — "Café" and "CAFÉ" are not equal. For case-insensitive comparison, see Unicode Case Folding.


Key Takeaways

  • Unicode strings that look identical may have different internal representations.
  • normalize("NFC", s) converts to a consistent precomposed form.
  • Always normalize before comparing Unicode strings from external sources.