Skip to content

Encoding Issues

Text encoding mismatches are a common source of errors. Understanding character encodings and Python text handling prevents encoding-related bugs.


UTF-8 Basics

Default Encoding

import sys

print(f"Default: {sys.getdefaultencoding()}")

text = "Hello 世界 🌍"
print(f"Text: {text}")
print(f"Bytes (UTF-8): {text.encode('utf-8')}")

Output:

Default: utf-8
Text: Hello 世界 🌍
Bytes (UTF-8): b'Hello \xe4\xb8\x96\xe7\x95\x8c \xf0\x9f\x8c\x8d'

Encoding/Decoding

Encoding Text to Bytes

text = "café"

utf8 = text.encode('utf-8')
latin1 = text.encode('latin-1')
ascii_err = text.encode('ascii', errors='replace')

print(f"UTF-8: {utf8}")
print(f"Latin-1: {latin1}")
print(f"ASCII (replace): {ascii_err}")

Output:

UTF-8: b'caf\xc3\xa9'
Latin-1: b'caf\xe9'
ASCII (replace): b'caf?'

Decoding Bytes to Text

utf8_bytes = b'caf\xc3\xa9'
latin1_bytes = b'caf\xe9'

print(f"UTF-8: {utf8_bytes.decode('utf-8')}")
print(f"UTF-8 as Latin-1: {utf8_bytes.decode('latin-1')}")

Output:

UTF-8: café
UTF-8 as Latin-1: café

File Encoding

Specifying File Encoding

import io
import tempfile
import os

with tempfile.NamedTemporaryFile(mode='w', encoding='utf-8', delete=False) as f:
    f.write("Hello 世界")
    temp_file = f.name

with open(temp_file, 'r', encoding='utf-8') as f:
    content = f.read()
    print(f"Read: {content}")

os.unlink(temp_file)

Output:

Read: Hello 世界

Error Handling

Encoding Error Strategies

text = "café"

replace = text.encode('ascii', errors='replace')
ignore = text.encode('ascii', errors='ignore')

print(f"Replace: {replace}")
print(f"Ignore: {ignore}")

Output:

Replace: b'caf?'
Ignore: b'caf'