Python detect encoding of string

I've got about 1000 filenames read by os.listdir[], some of them are encoded in UTF8 and some are CP1252.

I want to decode all of them to Unicode for further processing in my script. Is there a way to get the source encoding to correctly decode into Unicode?

Example:

for item in os.listdir[rootPath]:

    #Convert to Unicode
    if isinstance[item, str]:
        item = item.decode['cp1252']  # or item = item.decode['utf-8']
    print item

BartoszKP

33.8k13 gold badges103 silver badges128 bronze badges

asked Apr 10, 2013 at 6:14

0

Use chardet library. It is super easy

import chardet

the_encoding = chardet.detect['your string']['encoding']

and that's it!

in python3 you need to provide type bytes or bytearray so:

import chardet
the_encoding = chardet.detect[b'your string']['encoding']

answered Aug 5, 2017 at 19:08

georgegeorge

1,60015 silver badges16 bronze badges

8

if your files either in cp1252 and utf-8, then there is an easy way.

import logging
def force_decode[string, codecs=['utf8', 'cp1252']]:
    for i in codecs:
        try:
            return string.decode[i]
        except UnicodeDecodeError:
            pass

    logging.warn["cannot decode url %s" % [[string]]]

for item in os.listdir[rootPath]:
    #Convert to Unicode
    if isinstance[item, str]:
        item = force_decode[item]
    print item

otherwise, there is a charset detect lib.

Python - detect charset and convert to utf-8

//pypi.python.org/pypi/chardet

answered Apr 10, 2013 at 6:27

lucemialucemia

6,0975 gold badges39 silver badges74 bronze badges

0

You also can use json package to detect encoding.

import json

json.detect_encoding[b"Hello"]

answered May 5, 2021 at 12:51

Suyog ShimpiSuyog Shimpi

6121 gold badge7 silver badges15 bronze badges

0

chardet detected encoding can be used to decode an bytearray without any exception, but the output string may not be correct.

The try ... except ... way works perfectly for known encodings, but it does not work for all scenarios.

We can use try ... except ... first and then chardet as plan B:

    def decode[byte_array: bytearray, preferred_encodings: List[str] = None]:
        if preferred_encodings is None:
            preferred_encodings = [
                'utf8',       # Works for most cases
                'cp1252'      # Other encodings may appear in your project
            ]

        for encoding in preferred_encodings:
            # Try preferred encodings first
            try:
                return byte_array.decode[encoding]
            except UnicodeDecodeError:
                pass
        else:
            # Use detected encoding
            encoding = chardet.detect[byte_array]['encoding']
            return byte_array.decode[encoding]

answered Feb 24 at 6:21

Shawn HuShawn Hu

1991 silver badge7 bronze badges

How do I know the encoding of a string?

To detect encoding of the strings you should use detect_str_enc[] function. It is vectorized and accepts the character vector. Missing values will be skipped. All strings in R could be only in three encodings - UTF-8 , Latin1 and native .

How do I check if a file is UTF

Could be simpler by using only one line: codecs. open["path/to/file", encoding="utf-8", errors="strict"].

How do I check if a string is Unicode?

You can call decode ..
str s are UTFx for any x [eg. UTF8].
str s are Unicode..
str s are ordered collections of Unicode characters..

What does .encode do in Python?

The encode[] method encodes the string, using the specified encoding. If no encoding is specified, UTF-8 will be used.

Chủ Đề