UnicodeDecodeError: 'utf-8' codec can't decode byte の原因と解決策

概要

UnicodeDecodeError は、Pythonでファイルや文字列を読み込む際に、指定したエンコーディングでデコードできない場合に発生するエラーです。

エラーメッセージ

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 0: invalid start byte

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 5: ordinal not in range(128)

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 100: character maps to <undefined>

原因

1. ファイルのエンコーディングが異なる

1
2
3
# UTF-8として読み込もうとしているが、実際はShift_JIS
with open('file.txt', 'r') as f:
    content = f.read()  # UnicodeDecodeError

2. バイナリファイルをテキストとして読み込み

1
2
3
# 画像ファイルをテキストとして開く
with open('image.png', 'r') as f:
    content = f.read()  # UnicodeDecodeError

3. 古いWindows環境のファイル

Windows環境で作成されたファイルがCP932（Shift_JIS）エンコーディングの場合。

解決策

1. 正しいエンコーディングを指定

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Shift_JIS (CP932)
with open('file.txt', 'r', encoding='cp932') as f:
    content = f.read()

# Latin-1
with open('file.txt', 'r', encoding='latin-1') as f:
    content = f.read()

# UTF-16
with open('file.txt', 'r', encoding='utf-16') as f:
    content = f.read()

2. エンコーディングを自動検出

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import chardet

# バイナリモードで読み込んでエンコーディングを検出
with open('file.txt', 'rb') as f:
    raw = f.read()
    result = chardet.detect(raw)
    encoding = result['encoding']

# 検出したエンコーディングで読み込み
with open('file.txt', 'r', encoding=encoding) as f:
    content = f.read()

3. エラーを無視または置換

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# エラーを無視
with open('file.txt', 'r', encoding='utf-8', errors='ignore') as f:
    content = f.read()

# 置換文字で置き換え
with open('file.txt', 'r', encoding='utf-8', errors='replace') as f:
    content = f.read()

# バックスラッシュエスケープ
with open('file.txt', 'r', encoding='utf-8', errors='backslashreplace') as f:
    content = f.read()

4. バイナリモードで読み込み

1
2
3
4
5
6
7
8
9
# バイナリモードで読み込み
with open('file.txt', 'rb') as f:
    content = f.read()

# 後でデコードを試みる
try:
    text = content.decode('utf-8')
except UnicodeDecodeError:
    text = content.decode('cp932')

5. Pandasでの読み込み

1
2
3
4
5
6
7
import pandas as pd

# エンコーディングを指定
df = pd.read_csv('file.csv', encoding='cp932')

# エンコーディングエラーを無視
df = pd.read_csv('file.csv', encoding='utf-8', encoding_errors='ignore')

6. Webリクエストの場合

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import requests

response = requests.get('https://example.com')

# レスポンスのエンコーディングを確認
print(response.encoding)

# 明示的にエンコーディングを設定
response.encoding = 'utf-8'
content = response.text

よく使うエンコーディング

エンコーディング	用途
utf-8	標準、多言語対応
cp932, shift_jis	日本語Windows
euc-jp	日本語Unix
latin-1, iso-8859-1	西欧言語
utf-16	Windowsの一部ファイル

UnicodeDecodeError: 'utf-8' codec can't decode byte

概要

エラーメッセージ

原因

1. ファイルのエンコーディングが異なる

2. バイナリファイルをテキストとして読み込み

3. 古いWindows環境のファイル

解決策

1. 正しいエンコーディングを指定

2. エンコーディングを自動検出

3. エラーを無視または置換

4. バイナリモードで読み込み

5. Pandasでの読み込み

6. Webリクエストの場合

よく使うエンコーディング

関連エラー

関連エラー

Python の他のエラー