Encoding Issues¶
Text encoding mismatches are a common source of errors. Understanding character encodings and Python text handling prevents encoding-related bugs.
UTF-8 Basics¶
Default Encoding¶
import sys
print(f"Default: {sys.getdefaultencoding()}")
text = "Hello 世界 🌍"
print(f"Text: {text}")
print(f"Bytes (UTF-8): {text.encode('utf-8')}")
Output:
Default: utf-8
Text: Hello 世界 🌍
Bytes (UTF-8): b'Hello \xe4\xb8\x96\xe7\x95\x8c \xf0\x9f\x8c\x8d'
Encoding/Decoding¶
Encoding Text to Bytes¶
text = "café"
utf8 = text.encode('utf-8')
latin1 = text.encode('latin-1')
ascii_err = text.encode('ascii', errors='replace')
print(f"UTF-8: {utf8}")
print(f"Latin-1: {latin1}")
print(f"ASCII (replace): {ascii_err}")
Output:
UTF-8: b'caf\xc3\xa9'
Latin-1: b'caf\xe9'
ASCII (replace): b'caf?'
Decoding Bytes to Text¶
utf8_bytes = b'caf\xc3\xa9'
latin1_bytes = b'caf\xe9'
print(f"UTF-8: {utf8_bytes.decode('utf-8')}")
print(f"UTF-8 as Latin-1: {utf8_bytes.decode('latin-1')}")
Output:
UTF-8: café
UTF-8 as Latin-1: café
File Encoding¶
Specifying File Encoding¶
import io
import tempfile
import os
with tempfile.NamedTemporaryFile(mode='w', encoding='utf-8', delete=False) as f:
f.write("Hello 世界")
temp_file = f.name
with open(temp_file, 'r', encoding='utf-8') as f:
content = f.read()
print(f"Read: {content}")
os.unlink(temp_file)
Output:
Read: Hello 世界
Error Handling¶
Encoding Error Strategies¶
text = "café"
replace = text.encode('ascii', errors='replace')
ignore = text.encode('ascii', errors='ignore')
print(f"Replace: {replace}")
print(f"Ignore: {ignore}")
Output:
Replace: b'caf?'
Ignore: b'caf'