Skip to content

str: UTF-8 Encoding

UTF-8 is a variable-length encoding used to convert Unicode characters into bytes.

In the previous section we introduced Unicode code points. This section explains how those code points are encoded into bytes using UTF-8.

flowchart TD

A["Unicode Character"]
B["Unicode Code Point"]
C["UTF-8 Encoding"]
D["Bytes"]

A --> B --> C --> D

Why UTF-8?

ASCII Limitations

ASCII supports only 128 characters (0–127), which is insufficient for representing most languages.

# ASCII works for English
char = 'A'

# But not for other scripts
char = '好'

ASCII cannot represent characters from most writing systems.


UTF-8 Advantages

UTF-8 solves this problem while preserving compatibility with ASCII.

UTF-8 provides:

  • Backward compatibility with ASCII
  • Compact storage for common characters
  • Full Unicode support
  • Self-synchronizing byte sequences

Because of these properties, UTF-8 has become the dominant text encoding on the web and in modern systems.


UTF-8 Encoding Structure

UTF-8 encodes Unicode code points using 1 to 4 bytes.

Code Point Range Byte Pattern UTF-8 Bytes
U+0000 – U+007F 0xxxxxxx 1
U+0080 – U+07FF 110xxxxx 10xxxxxx 2
U+0800 – U+FFFF 1110xxxx 10xxxxxx 10xxxxxx 3
U+10000 – U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 4

How UTF-8 Packs Bits

UTF-8 encodes a Unicode code point by inserting its binary bits into predefined byte templates.

flowchart LR

A["Unicode Code Point Bits"]
B["UTF-8 Template"]
C["Encoded Bytes"]

A --> B --> C
Bytes UTF-8 Pattern Code Point Bits
1 0xxxxxxx 7 bits
2 110xxxxx 10xxxxxx 11 bits
3 1110xxxx 10xxxxxx 10xxxxxx 16 bits
4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 21 bits

The x positions are filled with bits from the Unicode code point.


Example: Encoding '世'

Unicode code point:

U+4E16

Binary:

0100111000010110

Fill into the 3-byte template:

1110xxxx 10xxxxxx 10xxxxxx

Result:

11100100 10111000 10010110

Hex:

E4 B8 96

Leading Bit Patterns

The first byte of a UTF-8 sequence indicates the number of bytes in the encoding.

Pattern Meaning
0xxxxxxx Single-byte ASCII character
110xxxxx Start of 2-byte sequence
1110xxxx Start of 3-byte sequence
11110xxx Start of 4-byte sequence
10xxxxxx Continuation byte

Continuation bytes always start with 10xxxxxx.

This makes UTF-8 self-synchronizing, meaning corrupted bytes rarely affect surrounding characters.


Why Continuation Bytes Start with 10

A decoder can immediately determine whether a byte is:

  • a single-byte ASCII character
  • the start of a multi-byte sequence
  • a continuation byte

No valid UTF-8 character starts with 10. Because of this rule, a decoder can scan any byte stream and determine where characters begin.

Example byte sequence:

11100100 10111000 10010110

Breakdown:

11100100   → start of 3-byte sequence
10111000   → continuation
10010110   → continuation

This represents the character '世'.

If a byte is corrupted, the decoder can resynchronize quickly because valid continuation bytes must start with 10xxxxxx.

UTF-8 continuation bytes always begin with 10, allowing decoders to reliably detect character boundaries.


Encoding Examples

ASCII Character

ASCII characters remain identical in UTF-8.

'A' → U+0041 → 01000001

This uses 1 byte.


Accented Character

'ñ' → U+00F1 → 11000011 10110001

This uses 2 bytes.


Chinese Character

'世' → U+4E16 → 11100100 10111000 10010110

UTF-8 bytes:

E4 B8 96

Musical Symbol (Supplementary Plane)

'𝄞' → U+1D11E → 11110000 10011101 10000100 10011110

This uses 4 bytes.


UTF-8 Encoding in Python

Python provides built-in methods for converting strings into UTF-8 bytes.

def main():
    text = "A ñ 世 😀"

    encoded = text.encode("utf-8")

    print("string:", text)
    print("bytes:", encoded)
    print("byte values:", list(encoded))

if __name__ == "__main__":
    main()

Example output:

string: A ñ 世 😀
bytes: b'A \xc3\xb1 \xe4\xb8\x96 \xf0\x9f\x98\x80'
byte values: [65, 32, 195, 177, 32, 228, 184, 150, 32, 240, 159, 152, 128]

Comparison with Other Encodings

Encoding Bytes per Character Typical Use
UTF-8 1–4 Web, Linux, modern systems
UTF-16 2–4 Windows, Java
UTF-32 4 Internal processing
ASCII 1 English text

ASCII Is a Subset of UTF-8

UTF-8 was designed so that all ASCII characters remain unchanged.

The ASCII range (U+0000 – U+007F) is encoded in UTF-8 using exactly one byte, with the same value as ASCII.

Character ASCII UTF-8
A 41 41
0 30 30
! 21 21

So an ASCII file is already a valid UTF-8 file.

def main():
    text = "Hello"

    print(text.encode("ascii"))
    print(text.encode("utf-8"))

if __name__ == "__main__":
    main()

Output:

b'Hello'
b'Hello'

The byte sequences are identical.

UTF-8 was designed so that the entire ASCII character set is encoded identically, making every ASCII file a valid UTF-8 file.


Why Python Uses UTF-8 by Default

Python 3 adopted UTF-8 as the default source encoding (PEP 3120).

In Python 2, source files were typically interpreted as ASCII, and non-ASCII characters required an explicit encoding declaration:

# -*- coding: utf-8 -*-
name = "José"

In Python 3, UTF-8 is the default. Unicode characters can appear directly in code:

name = "José"
greeting = "你好"
emoji = "😀"

print(name, greeting, emoji)

No special encoding declaration is required.

In Python 3:

  • str objects represent Unicode text
  • bytes represent encoded binary data

Typical workflow:

text (str) → encode → bytes
bytes → decode → text (str)

Key Takeaways

  • UTF-8 converts Unicode code points into bytes.
  • UTF-8 uses 1–4 bytes per character.
  • ASCII characters remain single-byte in UTF-8.
  • Leading bits identify the length of the sequence.
  • Continuation bytes start with 10, enabling self-synchronization.
  • ASCII is a subset of UTF-8.
  • Python 3 uses UTF-8 as the default source encoding.
  • UTF-8 is the most widely used text encoding today.