Python read file with mixed encoding
If you try to decode this string as utf-8, as you already know, you will get an "UnicodeDecode" error, as these spurious cp1252 characters are invalid utf-8 - Show
However, Python codecs allow you to register a callback to handle encoding/decoding errors, with the codecs.register_error function - it gets the UnicodeDecodeerror a a parameter - you can write such a handler that atempts to decode the data as "cp1252", and continues the decoding in utf-8 for the rest of the string. In my utf-8 terminal, I can build a mixed incorrect string like this:
I wrote the said callback function here, and found a catch: even if you increment the position from which to decode the string by 1, so that it would start on the next chratcer, if the next character is also not utf-8 and out of range(128), the error is raised at the first out of range(128) character - that means, the decoding "walks back" if consecutive non-ascii, non-utf-8 chars are found. The worka round this is to have a state variable in the error_handler which detects this "walking back" and resume decoding from the last call to it - on this short example, I implemented it as a global variable - (it will have to be manually reset to "-1" before each call to the decoder):
And on the console: Dave Angel d at davea.name Mon Nov 7 09:33:33 EST 2011
On 11/07/2011 09:23 AM, Jaroslav Dobrek wrote: > Hello, > > in Python3, I often have this problem: I want to do something with > every line of a file. Like Python3, I presuppose that every line is > encoded in utf-8. If this isn't the case, I would like Python3 to do > something specific (like skipping the line, writing the line to > standard error, ...) > > Like so: > > try: > .... > except UnicodeDecodeError: > ... > > Yet, there is no place for this construction. If I simply do: > > for line in f: > print(line) > > this will result in a UnicodeDecodeError if some line is not utf-8, > but I can't tell Python3 to stop: > > This will not work: > > for line in f: > try: > print(line) > except UnicodeDecodeError: > ... > > because the UnicodeDecodeError is caused in the "for line in f"-part. > > How can I catch such exceptions? > > Note that recoding the file before opening it is not an option, > because often files contain many different strings in many different > encodings. > > Jaroslav A file with mixed encodings isn't a text file. So open it with 'rb' mode, and use read() on it. Find your own line-endings, since a given '\n' byte may or may not be a line-ending. Once you've got something that looks like a line, explicitly decode it using utf-8. Some invalid lines will give an exception and some will not. But perhaps you've got some other gimmick to tell the encoding for each line. -- DaveA
More information about the Python-list mailing list A recent discussion on the python-ideas mailing list made it clear that we (i.e.
the core Python developers) need to provide some clearer guidance on how to handle text processing tasks that trigger exceptions by default in Python 3, but were previously swept under the rug by Python 2’s blithe assumption that all files are encoded in “latin-1”. While we’ll have something in the official docs before too long, this is my own preliminary attempt at summarising the options for processing text files, and the
various trade-offs between them. The obvious question to ask is what changed in Python 3 so that the common approaches that
developers used to use for text processing in Python 2 have now started to throw The key difference is that the default text processing behaviour in Python 3 aims to detect text encoding problems as early as possible - either when reading improperly encoded text (indicated by This contrasts with the Python 2
approach which allowed data corruption by default and strict correctness checks had to be requested explicitly. That could certainly be convenient when the data being processed was predominantly ASCII text, and the occasional bit of data corruption was unlikely to be even detected, let alone cause problems, but it’s hardly a solid foundation for building robust multilingual applications (as anyone that has ever had to track down an errant However,
Python 3 does provide a number of mechanisms for relaxing the default strict checks in order to handle various text processing use cases (in particular, use cases where “best effort” processing is acceptable, and strict correctness is not required). This article aims to explain some of them by looking at cases where it would be appropriate to use them. Note that many of the features I discuss below are available in Python 2 as well, but you have to explicitly access them via the To process text effectively
in Python 3, it’s necessary to learn at least a tiny amount about Unicode and text encodings: Unicode Error Handlers¶To help standardise various techniques for dealing with Unicode encoding and decoding errors, Python includes a concept of Unicode error handlers that are automatically invoked whenever a problem is encountered in the process of encoding or decoding text. I’m not going to cover all of them in this article, but three are of particular significance:
The Binary Option¶One alternative that is always available is to open files in binary mode and process them as bytes rather than as text. This can work in many cases, especially those where the ASCII markers are embedded in genuinely arbitrary binary data. However, for both “text data with unknown encoding” and “text data with known encoding, but potentially containing encoding errors”, it is often preferable to get them into a form that can be handled as text strings. In particular, some APIs that accept both bytes and text may be very strict about the encoding of the bytes they accept (for example, the Text File Processing¶This section explores a number of use cases that can arise when processing text. Text encoding is a sufficiently complex topic that there’s no one size fits all answer - the right answer for a given application will depend on factors like:
Files in an ASCII compatible encoding, best effort is acceptable¶Use case: the files to be processed are in an ASCII compatible encoding, but you don’t know exactly which one. All files must be processed without triggering any exceptions, but some risk of data corruption is deemed acceptable (e.g. collating log files from multiple sources where some data errors are acceptable, so long as the logs remain largely intact). Approach: use the “latin-1” encoding to map byte values directly to the first 256 Unicode code points. This is the closest equivalent Python 3 offers to the permissive Python 2 text handling model. Example: Note While the Windows Consequences:
Files in an ASCII compatible encoding, minimise risk of data corruption¶Use case: the files to be processed are in an ASCII compatible encoding, but you don’t know exactly which one. All files must be processed without triggering any exceptions, but some Unicode related errors are acceptable in order to reduce the risk of data corruption (e.g. collating log files from multiple sources, but wanting more explicit notification when the collated data is at risk of corruption due to programming errors that violate the assumption of writing the data back out only in its original encoding) Approach: use the Example: Consequences:
Files in a typical platform specific encoding¶Use case: the files to be processed are in a consistent encoding, the encoding can be determined from the OS details and locale settings and it is acceptable to refuse to process files that are not properly encoded. Approach: simply open the file in text mode. This use case describes the default behaviour in Python 3. Example: Consequences:
Files in a consistent, known encoding¶Use case: the files to be processed are nominally in a consistent encoding, you know the exact encoding in advance and it is acceptable to refuse to process files that are not properly encoded. This is becoming more and more common, especially with many text file formats beginning to standardise on UTF-8 as the preferred text encoding. Approach: open the file in text mode with the appropriate encoding Example: Consequences:
Files with a reliable encoding marker¶Use case: the files to be processed include markers that specify the nominal encoding (with a default encoding assumed if no marker is present) and it is acceptable to refuse to process files that are not properly encoded. Approach: first open the file in binary mode to look for the encoding marker, then reopen in text mode with the identified encoding. Example: Consequences:
Comments powered by |