Convert windows-1252 to utf-8 python

I want to convert from windows-1252 to utf-8 in python, I wrote this code:

def encode(input_file, output_file):
        f = open(input_file, "r")
        data = f.read()
        f.close()

        # Convert from Windows-1252 to UTF-8
        encoded = data.encode('Windows-1252').decode('utf-8')
        with safe_open_w(output_file) as f:
            f.write(encoded)

but I have this error:

encoded = data.encode('Windows-1252').decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 5653: invalid continuation byte

I tried to convert a html with this meta tag:


asked Jan 3, 2021 at 18:18

1

You are converting the wrong way. You want to decode from cp1252 and then encode into UTF-8. But the latter isn't really necessary; Python already does it for you.

When you decode something, the input should be bytes and the result is a Python string. Writing a string to a file already implicitly converts it, and you can actually do the same for reading, too, by specifying an encoding.

Additionally, reading the entire file into memory is inelegant and wasteful.

with open(input_file, 'r', encoding='cp1252') as inp,\
        open(output_file, 'w', encoding='utf-8') as outp:
    for line in inp:
        outp.write(line)

answered Jan 3, 2021 at 18:50

Convert windows-1252 to utf-8 python

tripleeetripleee

163k27 gold badges243 silver badges295 bronze badges

2

When reading and writing from various systems, it is not uncommon to encounter encoding issues when the systems have different locales. In this post I show several options for handling such issues.

Example

Say you have a field containing names and there’s a Czech name "Mořic" containing an r with caron, which you have to export to a csv with Windows-12521 encoding. This will fail:

>>> example = 'Mořic'
>>> example.encode('WINDOWS-1252')
UnicodeEncodeError: 'charmap' codec can't encode character '\u0159' in position 2: character maps to 

Unfortunately, Windows-1252 does not support this character and thus an exception is raised, so we need a way to handle such encoding issues.

Encoding options

Since Python 3.32, the str type is represented in Unicode. Unicode characters have no representation in bytes; this is what character encoding does – a mapping from Unicode characters to bytes. Each encoding handles the mapping differently, and not all encodings supports all Unicode characters, possibly resulting in issues when converting from one encoding to the other. Only the UTF family supports all Unicode characters. The most commonly used encoding is UTF-8, so stick with that whenever possible.

With str.encode you have several error handling options. The default signature is str.encode(encoding="utf-8", errors="strict"). Given the example "Mořic", the error options are:

Errors valueDescriptionResult
strict Encoding errors raise a UnicodeError (default). Exception
ignore Ignore erroneous characters. Moic
replace Replace erroneous characters with ?. Mo?ic
xmlcharrefreplace Replace erroneous characters with XML character reference. Mořic
backslashreplace Replace erroneous characters with backslashed escape sequence. Mo\\u0159ic
namereplace Replace erroneous characters with \N{...} escape sequence. Mo\\N{LATIN SMALL LETTER R WITH CARON}ic

A nice alternative is to normalise the data first with unicodedata.normalize. The unicode standard defines some characters as composed from multiple other characters. For example, "ř" is composed of "r" (Latin small letter r (U+0072)) and "ˇ" (Combining caron (U+030C)). Not all character information websites show this information, but e.g. this page displays the composed characters and normalisation forms: https://chars.suikawiki.org/char/0159.

Normalisation can be applied in four forms:

Normal FormFull Name
NFD Normalisation Form Canonical Decomposition
NFC Normalisation Form Canonical Composition
NFKD Normalisation Form Compatibility Decomposition
NFKC Normalisation Form Compatibility Composition

To understand Unicode normal forms, we need a bit of background information first.

Unicode and composed characters

In Unicode, characters are mapped to so-called code points. Every character in the Unicode universe3 is expressed by a code point written as U+ and four hexadecimal digits; e.g. U+0061 represents lowercase "a".

The Unicode standard provides two ways for specifying composed characters:

  1. Decomposed: as a sequence of combining characters
  2. Precomposed: as a single combined character

For example, the character "ã" (lowercase a with tilde) in decomposed form is given as U+0061 (a) U+0303 (˜), or in precomposed form as U+00E3 (ã).

Composition and Decomposition

Composition is the process of combining multiple characters to form a single character, typically a base character and one or more marks4. Decomposition is the reverse; splitting a composed character into multiple characters.

Before diving into normalisation, let’s define a function for printing the Unicode code points for each character in a string:

>>> def unicodes(string):
>>>     return ' '.join('U+{:04X}'.format(ord(c)) for c in string)
>>>
>>> example = 'Mořic'
>>> print(unicodes(example))
U+004D U+006F U+0159 U+0069 U+0063

Canonical and Compatibility Equivalence

A problem arises when characters have multiple representations. For example the Ångström symbol Å (one Ångström unit equals one ten-billionth of a meter) can be represented in three ways:

U+212B
U+00C5
U+0041 U+030A

How can we determine if strings are equal when their decomposed forms are different? Unicode equivalence is defined in two ways:

  1. Canonical equivalence
  2. Compatibility equivalence

When a character from different code points has the same appearance and meaning, it is considered canonically equivalent. For example all three representations of the Ångström example above have the same appearance and meaning, and are thus canonically equivalent.

Compatibility equivalence is defined as a sequence of code points which only have the same meaning, but are not equal visually. For example fractions are considered compatible equivalent: ¼ (U+00BC) and 1⁄4 (U+0031 U+2044 U+0034) do not have the same visual appearance, but do have the same meaning and are thus compatibility equivalent.

Compatibility equivalence is considered a weaker equivalence form and a subset of canonical equivalence. When a character is canonically equivalent, it is also compatibility equivalent, but not vice versa.

Applying Unicode normalisation forms

Now, with this background information, we can get to the Unicode normal forms. Given the example "Mořic" at the start, we can apply normalisation before encoding this string with Windows-1252:

>>> import unicodedata
>>>
>>> def unicodes(string):
>>>     return ' '.join('U+{:04X}'.format(ord(c)) for c in string)
>>>
>>> example = "Mořic"
>>>
>>> print(unicodes(example))
U+004D U+006F U+0159 U+0069 U+0063
# 5 Unicode code points, so the ř is given in precomposed form

>>> example.encode("WINDOWS-1252")
UnicodeEncodeError: 'charmap' codec cant encode character '\u0159' in position 2: character maps to undefined>
# Windows-1252 cannot encode U+0159 (ř)

>>> nfd_example = unicodedata.normalize("NFD", example)
>>> print(unicodes(nfd_example))
U+004D U+006F U+0072 U+030C U+0069 U+0063
# 6 Unicode code points, so the ř is given in decomposed form

>>> print(nfd_example)
Mořic
# Python shell with UTF-8 encoding still displays the r with caret

>>> nfd_example.encode("WINDOWS-1252")
UnicodeEncodeError: 'charmap' codec cant encode character '\u030c' in position 3: character maps to undefined>
# Windows-1252 can now encode U+0072 (r), but not U+030C (ˇ)

>>> print(nfd_example.encode('WINDOWS-1252', 'ignore'))
Moric
# Successfully encoded Windows-1252 and ignored U+030C (ˇ)

That’s it! With unicodedata.normalize("NFD", "Mořic").encode('WINDOWS-1252', 'ignore') we can normalise first and then encode Windows-1252, ignoring the unknown characters for Windows-1252, resulting in Moric. I like this alternative, usually people are okay with this since it doesn’t mingle the data too much and keeps it readable.

Improve your Python skills, learn from the experts!

At GoDataDriven we offer a host of Python courses from beginner to expert, taught by the very best professionals in the field. Join us and level up your Python game:

  • Python Essentials – Great if you are just starting with Python.
  • Data Science with Python Foundation – Want to make the step up from data analysis and visualization to true data science? This is the right course.
  • Advanced Data Science with Python – Learn to productionize your models like a pro and use Python for machine learning.

References

  • https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
  • https://medium.com/concerning-pharo/an-implementation-of-unicode-normalization-7c6719068f43
  • http://unicode.org/reports/tr15/#Canon_Compat_Equivalence
  • https://www.b-list.org/weblog/2017/sep/05/how-python-does-unicode

Subscribe to our newsletter

Stay up to date on the latest insights and best-practices by registering for the GoDataDriven newsletter.

How do I change the encoding from cp1252 to UTF

“cp1252 to utf-8 python” Code Answer's.
with open(ff_name, 'rb') as source_file:.
with open(target_file_name, 'w+b') as dest_file:.
contents = source_file. read().
dest_file. write(contents. decode('utf-16'). encode('utf-8')).

Is Windows

Windows-1252 is a subset of UTF-8 in terms of 'what characters are available', but not in terms of their byte-by-byte representation. Windows-1252 has characters between bytes 127 and 255 that UTF-8 has a different encoding for. Any visible character in the ASCII range (127 and below) are encoded 1:1 in UTF-8.

How do I convert data to UTF

How to Convert a String to UTF-8 in Python?.
string1 = "apple" string2 = "Preeti125" string3 = "12345" string4 = "pre@12".
string. encode(encoding = 'UTF-8', errors = 'strict').
# unicode string string = 'pythön!' # default encoding to utf-8 string_utf = string. encode() print('The encoded version is:', string_utf).

What is encoding UTF

UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding.