I've got about 1000 filenames read by os.listdir[]
, some of them are encoded in UTF8 and some are CP1252.
I want to decode all of them to Unicode for further processing in my script. Is there a way to get the source encoding to correctly decode into Unicode?
Example:
for item in os.listdir[rootPath]:
#Convert to Unicode
if isinstance[item, str]:
item = item.decode['cp1252'] # or item = item.decode['utf-8']
print item
BartoszKP
33.8k13 gold badges103 silver badges128 bronze badges
asked Apr 10, 2013 at 6:14
0
Use chardet library. It is super easy
import chardet
the_encoding = chardet.detect['your string']['encoding']
and that's it!
in python3 you need to provide type bytes or bytearray so:
import chardet
the_encoding = chardet.detect[b'your string']['encoding']
answered Aug 5, 2017 at 19:08
georgegeorge
1,60015 silver badges16 bronze badges
8
if your files either in cp1252
and utf-8
, then there is an easy way.
import logging
def force_decode[string, codecs=['utf8', 'cp1252']]:
for i in codecs:
try:
return string.decode[i]
except UnicodeDecodeError:
pass
logging.warn["cannot decode url %s" % [[string]]]
for item in os.listdir[rootPath]:
#Convert to Unicode
if isinstance[item, str]:
item = force_decode[item]
print item
otherwise, there is a charset detect lib.
Python - detect charset and convert to utf-8
//pypi.python.org/pypi/chardet
answered Apr 10, 2013 at 6:27
lucemialucemia
6,0975 gold badges39 silver badges74 bronze badges
0
You also can use json
package to detect encoding.
import json
json.detect_encoding[b"Hello"]
answered May 5, 2021 at 12:51
Suyog ShimpiSuyog Shimpi
6121 gold badge7 silver badges15 bronze badges
0
chardet
detected encoding can be used to decode an bytearray without any exception, but the output string may not be correct.
The try ... except ...
way works perfectly for
known encodings, but it does not work for all scenarios.
We can use try ... except ...
first and then chardet
as plan B:
def decode[byte_array: bytearray, preferred_encodings: List[str] = None]:
if preferred_encodings is None:
preferred_encodings = [
'utf8', # Works for most cases
'cp1252' # Other encodings may appear in your project
]
for encoding in preferred_encodings:
# Try preferred encodings first
try:
return byte_array.decode[encoding]
except UnicodeDecodeError:
pass
else:
# Use detected encoding
encoding = chardet.detect[byte_array]['encoding']
return byte_array.decode[encoding]
answered Feb 24 at 6:21
Shawn HuShawn Hu
1991 silver badge7 bronze badges