programming python Utf-8-sig

Hướng dẫn utf-8 to ansi python

MS Notepad gives the user a choice of 4 encodings, expressed in clumsy confusing terminology:

"Unicode" is UTF-16, written little-endian. "Unicode big endian" is UTF-16, written big-endian. In both UTF-16 cases, this means that the appropriate BOM will be written. Use utf-16 to decode such a file.

"UTF-8" is UTF-8; Notepad explicitly writes a "UTF-8 BOM". Use utf-8-sig to decode such a file.

"ANSI" is a shocker. This is MS terminology for "whatever the default legacy encoding is on this computer".

Here is a list of Windows encodings that I know of and the languages/scripts that they are used for:

cp874  Thai
cp932  Japanese 
cp936  Unified Chinese [P.R. China, Singapore]
cp949  Korean 
cp950  Traditional Chinese [Taiwan, Hong Kong, Macao[?]]
cp1250 Central and Eastern Europe 
cp1251 Cyrillic [ Belarusian, Bulgarian, Macedonian, Russian, Serbian, Ukrainian]
cp1252 Western European languages
cp1253 Greek 
cp1254 Turkish 
cp1255 Hebrew 
cp1256 Arabic script
cp1257 Baltic languages 
cp1258 Vietnamese
cp???? languages/scripts of India

If the file has been created on the computer where it is being read, then you can obtain the "ANSI" encoding by locale.getpreferredencoding[]. Otherwise if you know where it came from, you can specify what encoding to use if it's not UTF-16. Failing that, guess.

Be careful using codecs.open[] to read files on Windows. The docs say: """Note Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing.""" This means that your lines will end in \r\n and you will need/want to strip those off.

Putting it all together:

Sample text file, saved with all 4 encoding choices, looks like this in Notepad:

The quick brown fox jumped over the lazy dogs.
àáâãäå

Here is some demo code:

import locale

def guess_notepad_encoding[filepath, default_ansi_encoding=None]:
    with open[filepath, 'rb'] as f:
        data = f.read[3]
    if data[:2] in ['\xff\xfe', '\xfe\xff']:
        return 'utf-16'
    if data == u''.encode['utf-8-sig']:
        return 'utf-8-sig'
    # presumably "ANSI"
    return default_ansi_encoding or locale.getpreferredencoding[]

if __name__ == "__main__":
    import sys, glob, codecs
    defenc = sys.argv[1]
    for fpath in glob.glob[sys.argv[2]]:
        print
        print [fpath, defenc]
        with open[fpath, 'rb'] as f:
            print "raw:", repr[f.read[]]
        enc = guess_notepad_encoding[fpath, defenc]
        print "guessed encoding:", enc
        with codecs.open[fpath, 'r', enc] as f:
            for lino, line in enumerate[f, 1]:
                print lino, repr[line]
                print lino, repr[line.rstrip['\r\n']]

and here is the output when run in a Windows "Command Prompt" window using the command \python27\python read_notepad.py "" t1-*.txt

['t1-ansi.txt', '']
raw: 'The quick brown fox jumped over the lazy dogs.\r\n\xe0\xe1\xe2\xe3\xe4\xe5
\r\n'
guessed encoding: cp1252
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

['t1-u8.txt', '']
raw: '\xef\xbb\xbfThe quick brown fox jumped over the lazy dogs.\r\n\xc3\xa0\xc3
\xa1\xc3\xa2\xc3\xa3\xc3\xa4\xc3\xa5\r\n'
guessed encoding: utf-8-sig
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

['t1-uc.txt', '']
raw: '\xff\xfeT\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\x00w
\x00n\x00 \x00f\x00o\x00x\x00 \x00j\x00u\x00m\x00p\x00e\x00d\x00 \x00o\x00v\x00e
\x00r\x00 \x00t\x00h\x00e\x00 \x00l\x00a\x00z\x00y\x00 \x00d\x00o\x00g\x00s\x00.
\x00\r\x00\n\x00\xe0\x00\xe1\x00\xe2\x00\xe3\x00\xe4\x00\xe5\x00\r\x00\n\x00'
guessed encoding: utf-16
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

['t1-ucb.txt', '']
raw: '\xfe\xff\x00T\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\
x00w\x00n\x00 \x00f\x00o\x00x\x00 \x00j\x00u\x00m\x00p\x00e\x00d\x00 \x00o\x00v\
x00e\x00r\x00 \x00t\x00h\x00e\x00 \x00l\x00a\x00z\x00y\x00 \x00d\x00o\x00g\x00s\
x00.\x00\r\x00\n\x00\xe0\x00\xe1\x00\xe2\x00\xe3\x00\xe4\x00\xe5\x00\r\x00\n'
guessed encoding: utf-16
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

Things to be aware of:

[1] "mbcs" is a file-system pseudo-encoding which has no relevance at all to decoding the contents of files. On a system where the default encoding is cp1252, it makes like latin1 [aarrgghh!!]; see below

>>> all_bytes = "".join[map[chr, range[256]]]
>>> u1 = all_bytes.decode['cp1252', 'replace']
>>> u2 = all_bytes.decode['mbcs', 'replace']
>>> u1 == u2
False
>>> [[i, u1[i], u2[i]] for i in xrange[256] if u1[i] != u2[i]]
[[129, u'\ufffd', u'\x81'], [141, u'\ufffd', u'\x8d'], [143, u'\ufffd', u'\x8f']
, [144, u'\ufffd', u'\x90'], [157, u'\ufffd', u'\x9d']]
>>>

[2] chardet is very good at detecting encodings based on non-Latin scripts [Chinese/Japanese/Korean, Cyrillic, Hebrew, Greek] but not much good at Latin-based encodings [Western/Central/Eastern Europe, Turkish, Vietnamese] and doesn't grok Arabic at all.

Bài Viết Liên Quan

Toplist mới

Bài mới nhất

Chủ Đề