str: UTF-8 Encoding¶
UTF-8 is a variable-length encoding used to convert Unicode characters into bytes.
In the previous section we introduced Unicode code points. This section explains how those code points are encoded into bytes using UTF-8.
flowchart TD
A["Unicode Character"]
B["Unicode Code Point"]
C["UTF-8 Encoding"]
D["Bytes"]
A --> B --> C --> D
Why UTF-8?¶
ASCII Limitations¶
ASCII supports only 128 characters (0–127), which is insufficient for representing most languages.
# ASCII works for English
char = 'A'
# But not for other scripts
char = '好'
ASCII cannot represent characters from most writing systems.
UTF-8 Advantages¶
UTF-8 solves this problem while preserving compatibility with ASCII.
UTF-8 provides:
- Backward compatibility with ASCII
- Compact storage for common characters
- Full Unicode support
- Self-synchronizing byte sequences
Because of these properties, UTF-8 has become the dominant text encoding on the web and in modern systems.
UTF-8 Encoding Structure¶
UTF-8 encodes Unicode code points using 1 to 4 bytes.
| Code Point Range | Byte Pattern | UTF-8 Bytes |
|---|---|---|
| U+0000 – U+007F | 0xxxxxxx |
1 |
| U+0080 – U+07FF | 110xxxxx 10xxxxxx |
2 |
| U+0800 – U+FFFF | 1110xxxx 10xxxxxx 10xxxxxx |
3 |
| U+10000 – U+10FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
4 |
How UTF-8 Packs Bits¶
UTF-8 encodes a Unicode code point by inserting its binary bits into predefined byte templates.
flowchart LR
A["Unicode Code Point Bits"]
B["UTF-8 Template"]
C["Encoded Bytes"]
A --> B --> C
| Bytes | UTF-8 Pattern | Code Point Bits |
|---|---|---|
| 1 | 0xxxxxxx |
7 bits |
| 2 | 110xxxxx 10xxxxxx |
11 bits |
| 3 | 1110xxxx 10xxxxxx 10xxxxxx |
16 bits |
| 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
21 bits |
The x positions are filled with bits from the Unicode code point.
Example: Encoding '世'¶
Unicode code point:
U+4E16
Binary:
0100111000010110
Fill into the 3-byte template:
1110xxxx 10xxxxxx 10xxxxxx
Result:
11100100 10111000 10010110
Hex:
E4 B8 96
Leading Bit Patterns¶
The first byte of a UTF-8 sequence indicates the number of bytes in the encoding.
| Pattern | Meaning |
|---|---|
0xxxxxxx |
Single-byte ASCII character |
110xxxxx |
Start of 2-byte sequence |
1110xxxx |
Start of 3-byte sequence |
11110xxx |
Start of 4-byte sequence |
10xxxxxx |
Continuation byte |
Continuation bytes always start with 10xxxxxx.
This makes UTF-8 self-synchronizing, meaning corrupted bytes rarely affect surrounding characters.
Why Continuation Bytes Start with 10¶
A decoder can immediately determine whether a byte is:
- a single-byte ASCII character
- the start of a multi-byte sequence
- a continuation byte
No valid UTF-8 character starts with 10. Because of this rule, a decoder can scan any byte stream and determine where characters begin.
Example byte sequence:
11100100 10111000 10010110
Breakdown:
11100100 → start of 3-byte sequence
10111000 → continuation
10010110 → continuation
This represents the character '世'.
If a byte is corrupted, the decoder can resynchronize quickly because valid continuation bytes must start with 10xxxxxx.
UTF-8 continuation bytes always begin with 10, allowing decoders to reliably detect character boundaries.
Encoding Examples¶
ASCII Character¶
ASCII characters remain identical in UTF-8.
'A' → U+0041 → 01000001
This uses 1 byte.
Accented Character¶
'ñ' → U+00F1 → 11000011 10110001
This uses 2 bytes.
Chinese Character¶
'世' → U+4E16 → 11100100 10111000 10010110
UTF-8 bytes:
E4 B8 96
Musical Symbol (Supplementary Plane)¶
'𝄞' → U+1D11E → 11110000 10011101 10000100 10011110
This uses 4 bytes.
UTF-8 Encoding in Python¶
Python provides built-in methods for converting strings into UTF-8 bytes.
def main():
text = "A ñ 世 😀"
encoded = text.encode("utf-8")
print("string:", text)
print("bytes:", encoded)
print("byte values:", list(encoded))
if __name__ == "__main__":
main()
Example output:
string: A ñ 世 😀
bytes: b'A \xc3\xb1 \xe4\xb8\x96 \xf0\x9f\x98\x80'
byte values: [65, 32, 195, 177, 32, 228, 184, 150, 32, 240, 159, 152, 128]
Comparison with Other Encodings¶
| Encoding | Bytes per Character | Typical Use |
|---|---|---|
| UTF-8 | 1–4 | Web, Linux, modern systems |
| UTF-16 | 2–4 | Windows, Java |
| UTF-32 | 4 | Internal processing |
| ASCII | 1 | English text |
ASCII Is a Subset of UTF-8¶
UTF-8 was designed so that all ASCII characters remain unchanged.
The ASCII range (U+0000 – U+007F) is encoded in UTF-8 using exactly one byte, with the same value as ASCII.
| Character | ASCII | UTF-8 |
|---|---|---|
A |
41 |
41 |
0 |
30 |
30 |
! |
21 |
21 |
So an ASCII file is already a valid UTF-8 file.
def main():
text = "Hello"
print(text.encode("ascii"))
print(text.encode("utf-8"))
if __name__ == "__main__":
main()
Output:
b'Hello'
b'Hello'
The byte sequences are identical.
UTF-8 was designed so that the entire ASCII character set is encoded identically, making every ASCII file a valid UTF-8 file.
Why Python Uses UTF-8 by Default¶
Python 3 adopted UTF-8 as the default source encoding (PEP 3120).
In Python 2, source files were typically interpreted as ASCII, and non-ASCII characters required an explicit encoding declaration:
# -*- coding: utf-8 -*-
name = "José"
In Python 3, UTF-8 is the default. Unicode characters can appear directly in code:
name = "José"
greeting = "你好"
emoji = "😀"
print(name, greeting, emoji)
No special encoding declaration is required.
In Python 3:
strobjects represent Unicode textbytesrepresent encoded binary data
Typical workflow:
text (str) → encode → bytes
bytes → decode → text (str)
Key Takeaways¶
- UTF-8 converts Unicode code points into bytes.
- UTF-8 uses 1–4 bytes per character.
- ASCII characters remain single-byte in UTF-8.
- Leading bits identify the length of the sequence.
- Continuation bytes start with
10, enabling self-synchronization. - ASCII is a subset of UTF-8.
- Python 3 uses UTF-8 as the default source encoding.
- UTF-8 is the most widely used text encoding today.