Encoding Issues¶
Text encoding mismatches are a common source of errors. Understanding character encodings and Python text handling prevents encoding-related bugs.
Mental Model
Encoding errors happen when the decoder guesses wrong about which encoding was used to write the bytes. The fix is always the same: know your encoding, specify it explicitly (open(f, encoding='utf-8')), and never rely on the platform default. When in doubt, UTF-8 is the right choice for nearly all modern text.
UTF-8 Basics¶
Default Encoding¶
```python import sys
print(f"Default: {sys.getdefaultencoding()}")
text = "Hello 世界 🌍" print(f"Text: {text}") print(f"Bytes (UTF-8): {text.encode('utf-8')}") ```
Output:
Default: utf-8
Text: Hello 世界 🌍
Bytes (UTF-8): b'Hello \xe4\xb8\x96\xe7\x95\x8c \xf0\x9f\x8c\x8d'
Encoding/Decoding¶
Encoding Text to Bytes¶
```python text = "café"
utf8 = text.encode('utf-8') latin1 = text.encode('latin-1') ascii_err = text.encode('ascii', errors='replace')
print(f"UTF-8: {utf8}") print(f"Latin-1: {latin1}") print(f"ASCII (replace): {ascii_err}") ```
Output:
UTF-8: b'caf\xc3\xa9'
Latin-1: b'caf\xe9'
ASCII (replace): b'caf?'
Decoding Bytes to Text¶
```python utf8_bytes = b'caf\xc3\xa9' latin1_bytes = b'caf\xe9'
print(f"UTF-8: {utf8_bytes.decode('utf-8')}") print(f"UTF-8 as Latin-1: {utf8_bytes.decode('latin-1')}") ```
Output:
UTF-8: café
UTF-8 as Latin-1: café
File Encoding¶
Specifying File Encoding¶
```python import io import tempfile import os
with tempfile.NamedTemporaryFile(mode='w', encoding='utf-8', delete=False) as f: f.write("Hello 世界") temp_file = f.name
with open(temp_file, 'r', encoding='utf-8') as f: content = f.read() print(f"Read: {content}")
os.unlink(temp_file) ```
Output:
Read: Hello 世界
Error Handling¶
Encoding Error Strategies¶
```python text = "café"
replace = text.encode('ascii', errors='replace') ignore = text.encode('ascii', errors='ignore')
print(f"Replace: {replace}") print(f"Ignore: {ignore}") ```
Output:
Replace: b'caf?'
Ignore: b'caf'
Exercises¶
Exercise 1.
Write a script that writes the string "Hello, world!" to a file using UTF-8 encoding, then tries to read it back using ASCII encoding. What happens? Handle the error gracefully.
Solution to Exercise 1
```python
with open("/tmp/test.txt", "w", encoding="utf-8") as f:
f.write("Hello, \u4e16\u754c!") # Hello, 世界!
try:
with open("/tmp/test.txt", "r", encoding="ascii") as f:
content = f.read()
except UnicodeDecodeError as e:
print(f"Error: {e}")
# Read correctly with UTF-8
with open("/tmp/test.txt", "r", encoding="utf-8") as f:
print(f.read()) # Hello, 世界!
```
ASCII cannot decode multi-byte UTF-8 characters, raising UnicodeDecodeError.
Exercise 2.
Write a function detect_encoding(filepath) that reads the first 100 bytes of a file and returns "utf-8" if they decode successfully as UTF-8, or "unknown" otherwise.
Solution to Exercise 2
```python
def detect_encoding(filepath):
with open(filepath, "rb") as f:
raw = f.read(100)
try:
raw.decode("utf-8")
return "utf-8"
except UnicodeDecodeError:
return "unknown"
```
This is a simple heuristic. For production use, consider the chardet library for more reliable detection.
Exercise 3.
Demonstrate the difference between errors="ignore", errors="replace", and errors="strict" when decoding bytes that contain invalid UTF-8 sequences.
Solution to Exercise 3
```python
data = b"Hello \xff\xfe World"
print(data.decode("utf-8", errors="ignore")) # Hello World
print(data.decode("utf-8", errors="replace")) # Hello �� World
try:
data.decode("utf-8", errors="strict")
except UnicodeDecodeError as e:
print(f"strict: {e}")
```
"ignore" silently skips invalid bytes. "replace" inserts replacement characters. "strict" (the default) raises an exception.