Hướng dẫn read file utf-8 python
You have stumbled over the general problem with encodings: How can I tell in which encoding a file is? Show Answer: You can't unless the file format provides for this. XML, for example, begins with:
This header was carefully chosen so that it can be read no matter the encoding. In your case, there is no such hint, hence neither your editor nor Python has any idea what is going on. Therefore, you must use the As for your editor, you must check if it offers some way to set the encoding of a file. The point of UTF-8 is to be able to encode 21-bit characters (Unicode) as an 8-bit data stream (because that's the only thing all computers in the world can handle). But since most OSs predate the Unicode era, they don't have suitable tools to attach the encoding information to files on the hard disk. The next issue is the representation in Python. This is explained perfectly in the comment by heikogerlach. You must understand that your console can only display ASCII. In order to display Unicode or anything >= charcode 128, it must use some means of escaping. In your editor, you must not type the escaped display string but what the string means (in this case, you must enter the umlaut and save the file). That said, you can use the Python function eval() to turn an escaped string into a string:
As you can see, the string "\xc3" has been turned into a single character. This is now an 8-bit string, UTF-8 encoded. To get Unicode:
Gregg Lind asked: I think there are some pieces missing here: the file f2 contains: hex:
Answer: That depends on what you mean. ASCII can't represent characters > 127. So you need some way to say "the next few characters mean something special" which is what the sequence "\x" does. It says: The next two characters are the code of a single character. "\u" does the same using four characters to encode Unicode up to 0xFFFF (65535). So you can't directly write Unicode to ASCII (because ASCII simply doesn't contain the same characters). You can write it as string escapes (as in f2); in this case, the file can be represented as ASCII. Or you can write it as UTF-8, in which case, you need an 8-bit safe stream. Your solution using Remember that a file is just a sequence of bytes with 8 bits. Neither the bits nor the bytes have a meaning. It's you who says "65 means 'A'".
Since You have stumbled over the general problem with encodings: How can I tell in which encoding a file is? Nội dung chính
Answer: You can't unless the file format provides for this. XML, for example, begins with:
This header was carefully chosen so that it can be read no matter the encoding. In your case, there is no such hint, hence neither your editor nor Python has any idea what is going on. Therefore, you must use the As for your editor, you must check if it offers some way to set the encoding of a file. The point of UTF-8 is to be able to encode 21-bit characters (Unicode) as an 8-bit data stream (because that's the only thing all computers in the world can handle). But since most OSs predate the Unicode era, they don't have suitable tools to attach the encoding information to files on the hard disk. The next issue is the representation in Python. This is explained perfectly in the comment by heikogerlach. You must understand that your console can only display ASCII. In order to display Unicode or anything >= charcode 128, it must use some means of escaping. In your editor, you must not type the escaped display string but what the string means (in this case, you must enter the umlaut and save the file). That said, you can use the Python function eval() to turn an escaped string into a string:
As you can see, the string "\xc3" has been turned into a single character. This is now an 8-bit string, UTF-8 encoded. To get Unicode:
Gregg Lind asked: I think there are some pieces missing here: the file f2 contains: hex:
Answer: That depends on what you mean. ASCII can't represent characters > 127. So you need some way to say "the next few characters mean something special" which is what the sequence "\x" does. It says: The next two characters are the code of a single character. "\u" does the same using four characters to encode Unicode up to 0xFFFF (65535). So you can't directly write Unicode to ASCII (because ASCII simply doesn't contain the same characters). You can write it as string escapes (as in f2); in this case, the file can be represented as ASCII. Or you can write it as UTF-8, in which case, you need an 8-bit safe stream. Your solution using Remember that a file is just a sequence of bytes with 8 bits. Neither the bits nor the
bytes have a meaning. It's you who says "65 means 'A'". Since On this page: open(), file.read(), file.readlines(), file.write(), file.writelines(). Opening and Closing a "File Object"As seen in Tutorials #12 and #13, file IO (input/output) operations are done through a file data object. It typically proceeds as follows:
myfile = open('alice.txt', 'r') # Reading. 'r' can be omitted # ... read from myfile ... myfile.close() # Closing file foo.pyBelow, myfile is opened for writing. In the second instance, the 'a' switch makes sure that the new content is tacked on at the end of the existing text file. Had you used 'w' instead, the original file would have been overwritten. myfile = open('results.txt', 'w') # The file is newly created where foo.py is # ... write to myfile ... myfile.close() # Closing file. VERY IMPORTANT! myfile = open('results.txt', 'a') # 'a': appending instead of overwriting. # ... add text to the file ... myfile.close() # Closing file. DON'T FORGET! foo.pyThere is one more piece of crucial information: encoding. Some files may have to be read as a particular encoding type, and sometimes you need to write out a file in a specific encoding system. For such cases, the open() statement should include an encoding spcification, with the encoding='xxx' switch: myfile = open('alice.txt', encoding='utf-8') # Reading a UTF-8 file; 'r' is omitted myfile = open('results.txt', 'w', encoding='utf-8') # File will be written in UTF-8 foo.pyMostly, you will need 'utf-8' (8-bit Unicode), 'utf-16' (16-bit Unicode), or 'utf-32' (32-bit), but it may be something different, especially if you are dealing with a foreign language text. Here is a full list of encodings. Reading from a FileOK, we know how to open and close a file object. But what are the actual commands for reading? There are multiple methods.First off, .read() reads in the entire text content of the file as a single string. Below, the file is read into a variable named marytxt, which ends up being a string-type object. Download mary-short.txt and try out yourself.
f = open('bible-kjv.txt') # This is a big file for line in f: # Using 'for ... in' on file object if 'smite' in line: print(line,) # ',' keeps print from adding a line break f.close() foo.py Writing to a FileWriting methods also come in a pair: .write() and .writelines(). Like the corresponding reading methods, .write() handles a single string, while .writelines() handles a list of strings.Below, .write() writes a single string each time to the designated output file:
Common PitfallsFile I/O is notoriously fraught with stumbling blocks for beginning programmers. Below are the most common ones."No such file or directory" error
Issues with encoding
Entire file content can be read in only ONCE per opening
Only the string type can be written
Your output file is empty This happens to everyone: you write something out, open up the file to view, only to find it empty. In other times, the file content may be incomplete. Curious, isn't it? Well, the cause is simple: YOU FORGOT .close(). Writing out happens in buffers; flushing out the last writing buffer does not happen until you close your file object. ALWAYS REMEMBER TO CLOSE YOUR FILE OBJECT. (Windows) Line breaks do not show up How do I open a file encoding in Python?To open a file, you can use Python's built-in open() function. Inside the open() function parentheses, you insert the filepath to be opened in quotation marks. You should also insert a character encoding, which we will talk more about below. This function returns what's called a file object. How do I open a UTFHow to Open UTF-8 in Excel. Launch Excel and select "Open Other Workbooks" from the opening screen. ... . Select "Computer," and then click "Browse." Navigate to the location of the UTF file, and then change the file type option to "All Files.". Select the UTF file, and then click "Open" to launch the Text Import Wizard.. Can Python read UTFUTF-8 is one of the most commonly used encodings, and Python often defaults to using it. How do you decode a file in Python?decode() is a method specified in Strings in Python 2. This method is used to convert from one encoding scheme, in which argument string is encoded to the desired encoding scheme. This works opposite to the encode. It accepts the encoding of the encoding string to decode it and returns the original string. |