Can python handle special characters?
Python will check the first or second line for an emacs/vim-like encoding specification. Show
Source: PEP 263 (A BOM would also make Python interpret the source as UTF-8. I would recommend, you use this over
In any case,
And the reverse:
There is a technical difference, however: if you use u"something", it will instruct the parser that there is a unicode literal, it should be a bit faster. String operations on string array containing strings with accented &/or special characters alongside regular ascii strings can be quiet an annoyance These days I am involved with web/mobile automation. Other day I had a challenge to parse all strings on page for a generic automation library I am writing. Since I was supposed to write a generic library to parse all strings on page, I didnot have the luxury of using ids for specific
control/component on page. So I used the reliable xpath As the solution was so easy I found it difficult to believe that the code had handled all the edge cases. To clear my doubts, I went about testing it on different applications with different inputs, until I hit a road block where the page was returning a mixture of accented strings, strings containing special characters and regular ascii strings. Here is how the array looked like strs = ["hell°", "hello", "tromsø", "boy", "stävänger", "ölut", "world"] If you have hit similar challenge read on for the solution. Strings with accented or special characters are unicode strings while regular one’s ascii. So to handle unicode strings as regular ascii strings one has to convert unicode strings to ascii. (For a history on unicode read a detailed article) To convert unicode to ascii; one has to encode unicode strings to utf-8 Here is how you do in python text = text.encode(‘utf-8’) Simple isn’t it!! But wait you need to strip out extra escape characters to do string operations. here is how you can strip those out import redef extract_word(text): With the returned string, now you are good to go and do other string operations on the array. (If this has helped you guys, do let me know in comment section…) Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Unicode in Python: Working With Character Encodings Handling character encodings in Python or any other language can at times seem painful. Places such as Stack Overflow have thousands of questions stemming from
confusion over exceptions like This tutorial is different because it’s not language-agnostic but instead deliberately Python-centric. You’ll still get a language-agnostic primer, but you’ll then dive into illustrations in Python, with text-heavy paragraphs kept to a minimum. You’ll see how to use concepts of character encodings in live Python code. By the end of this tutorial, you’ll:
Character encoding and numbering systems are so closely connected that they need to be covered in the same tutorial or else the treatment of either would be totally inadequate. What’s a Character Encoding?There are tens if not hundreds of character encodings. The best way to start understanding what they are is to cover one of the simplest character encodings, ASCII. Whether you’re self-taught or have a formal computer science background, chances are you’ve seen an ASCII table once or twice. ASCII is a good place to start learning about character encoding because it is a small and contained encoding. (Too small, as it turns out.) It encompasses the following:
So what is a more formal definition of a character encoding? At a very high level, it’s a way of translating characters (such as letters, punctuation, symbols, whitespace, and control characters) to integers and ultimately to bits. Each character can be encoded to a unique sequence of bits. Don’t worry if you’re shaky on the concept of bits, because we’ll get to them shortly. The various categories outlined represent groups of characters. Each single character has a corresponding code point, which you can think of as just an integer. Characters are segmented into different ranges within the ASCII table:
The entire ASCII table contains 128 characters. This table captures the complete character set that ASCII permits. If you don’t see a character here, then you simply can’t express it as printed text under the ASCII encoding scheme.
The string ModulePython’s Here’s the core of the module in all its glory:
Most of these constants
should be self-documenting in their identifier name. We’ll cover what You can use these constants for everyday string manipulation: >>>
A Bit of a RefresherNow is a good time for a short refresher on the bit, the most fundamental unit of information that a computer knows. A bit is a signal that has only two possible states. There are different ways of symbolically representing a bit that all mean the same thing:
Our ASCII table from the previous section uses what you and I would just call numbers (0 through 127), but what are more precisely called numbers in base 10 (decimal). You can also express each of these base-10 numbers with a sequence of bits (base 2). Here are the binary versions of 0 through 10 in decimal:
Notice that as the decimal number n increases, you need more significant bits to represent the character set up to and including that number. Here’s a handy way to represent ASCII strings as sequences of bits in Python. Each character from the ASCII string gets pseudo-encoded into 8 bits, with spaces in between the 8-bit sequences that each represent a single character: >>>
The f-string
This trick is mainly just for fun, and it will fail very badly for any character that you don’t see present in the ASCII table. We’ll discuss how other encodings fix this problem later on. We Need More Bits!There’s a critically important formula that’s related to the definition of a bit. Given a number of bits, n, the number of distinct possible values that can be represented in n bits is 2n:
Here’s what that means:
There’s a corollary to this formula: given a range of distinct possible values, how can we find the number of bits, n, that is required for the range to be fully represented? What you’re trying to solve for is n in the equation 2n = x (where you already know x). Here’s what that works out to: >>>
The reason that you need to use a ceiling in >>>
All of this serves to prove one concept: ASCII is, strictly speaking, a 7-bit code. The ASCII table that you saw above contains 128 code points and characters, 0 through 127 inclusive. This requires 7 bits: >>>
The issue with this is that modern computers don’t store much of anything in 7-bit slots. They traffic in units of 8 bits, conventionally known as a byte. This means that the storage space used by ASCII is half-empty. If it’s not clear why this is, think back to the decimal-to-binary table from above. You can express the numbers 0 and 1 with just 1 bit, or you can use 8 bits to express them as 00000000 and 00000001, respectively. You can express the numbers 0 through 3 with just 2 bits, or 00 through 11, or you can use 8 bits to express them as 00000000, 00000001, 00000010, and 00000011, respectively. The highest ASCII code point, 127, requires only 7 significant bits. Knowing this, you can see that >>>
ASCII’s underutilization of the 8-bit bytes offered by modern computers led to a family of conflicting, informalized encodings that each specified additional characters to be used with the remaining 128 available code points allowed in an 8-bit character encoding scheme. Not only did these different encodings clash with each other, but each one of them was by itself still a grossly incomplete representation of the world’s characters, regardless of the fact that they made use of one additional bit. Over the years, one character encoding mega-scheme came to rule them all. However, before we get there, let’s talk for a minute about numbering systems, which are a fundamental underpinning of character encoding schemes. Covering All the Bases: Other Number SystemsIn the discussion of ASCII above, you saw that each character maps to an integer in the range 0 through 127. This range of numbers is expressed in decimal (base 10). It’s the way that you, me, and the rest of us humans are used to counting, for no reason more complicated than that we have 10 fingers. But there are other numbering systems as well that are especially prevalent throughout the CPython source code. While the “underlying number” is the same, all numbering systems are just different ways of expressing the same number. If I asked you what number the string However, this string representation can express different underlying numbers in different numbering systems. In addition to decimal, the alternatives include the following common numbering systems:
But what does it mean for us to say that, in a certain numbering system, numbers are represented in base N? Here is the best way that I know of to articulate what this means: it’s the number of fingers that you’d count on in that system. If you want a much fuller but still gentle introduction to numbering systems, Charles Petzold’s Code is an incredibly cool book that explores the foundations of computer code in detail. One way to demonstrate how different numbering systems interpret the same thing is with Python’s >>>
There’s a more common way of telling Python that your integer is typed in a base other than 10. Python accepts literal forms of each of the 3 alternative numbering systems above:
All of these are sub-forms of integer literals. You can see that these produce the same results, respectively, as the calls to >>>
Here’s how you could type the binary, octal, and hexadecimal equivalents of the decimal numbers 0 through 20. Any of these are perfectly valid in
a Python interpreter shell or source code, and all work out to be of type
It’s amazing just how prevalent these expressions are in the Python Standard Library. If you want to see for yourself, navigate to wherever your
This should work on any Unix system that has What’s the argument for using these alternate Enter UnicodeAs you saw, the problem with ASCII is that it’s not nearly a big enough set of characters to accommodate the world’s set of languages, dialects, symbols, and glyphs. (It’s not even big enough for English alone.) Unicode fundamentally serves the same purpose as ASCII, but it just encompasses a way, way, way bigger set of code points. There are a handful of encodings that emerged chronologically between ASCII and Unicode, but they are not really worth mentioning just yet because Unicode and one of its encoding schemes, UTF-8, has become so predominantly used. Think of Unicode as a massive version of the ASCII table—one that has 1,114,112 possible code points. That’s 0 through 1,114,111, or 0 through 17 * (216) - 1, or
In the interest of being technically exacting, Unicode itself is not an encoding. Rather, Unicode is implemented by different character encodings, which you’ll see soon. Unicode is better thought of as a map (something like a Unicode contains virtually every character that you can imagine, including additional non-printable ones too. One of my favorites is the pesky right-to-left mark, which has code point 8207 and is used in text with both left-to-right and right-to-left language scripts, such as an article containing both English and Arabic paragraphs. Unicode vs UTF-8It didn’t take long for people to realize that all of the world’s characters could not be packed into one byte each. It’s evident from this that modern, more comprehensive encodings would need to use multiple bytes to encode some characters. You also saw above that Unicode is not technically a full-blown character encoding. Why is that? There is one thing that Unicode doesn’t tell you: it doesn’t tell you how to get actual bits from text—just code points. It doesn’t tell you enough about how to convert text to binary data and vice versa. Unicode is an abstract encoding standard, not an encoding. That’s where UTF-8 and other encoding schemes come into play. The Unicode standard (a map of characters to code points) defines several different encodings from its single character set. UTF-8 as well as its lesser-used cousins, UTF-16 and UTF-32, are encoding formats for representing Unicode characters as binary data of one or more bytes per character. We’ll discuss UTF-16 and UTF-32 in a moment, but UTF-8 has taken the largest share of the pie by far. That brings us to a definition that is long overdue. What does it mean, formally, to encode and decode? Encoding and Decoding in Python 3Python 3’s The Encoding and decoding is the process of going from one to the other: Encoding vs decoding (Image: Real Python)In >>>
The results of This is why, when calling >>>
That is, the character Python 3: All-In on UnicodePython 3 is all-in on Unicode and UTF-8 specifically. Here’s what that means:
There is one other property that is more nuanced, which is that the default >>>
Again, the lesson here is to be careful about making assumptions when it comes to the universality of UTF-8, even if it is the predominant encoding. It never hurts to be explicit in your code. One Byte, Two Bytes, Three Bytes, FourA crucial feature is that UTF-8 is a variable-length encoding. It’s tempting to gloss over what this means, but it’s worth delving into. Think back to the section on ASCII. Everything in extended-ASCII-land demands at most one byte of space. You can quickly prove this with the following generator expression: >>>
UTF-8 is quite different. A given Unicode character can occupy anywhere from one to four bytes. Here’s an example of a single Unicode character taking up four bytes: >>>
This is a subtle but important feature of
The table below summarizes what general types of characters fit into each byte-length bucket:
*Such as English, Arabic, Greek, and Irish What About UTF-16 and UTF-32?Let’s get back to two other encoding variants, UTF-16 and UTF-32. The difference between these and UTF-8 is substantial in practice. Here’s an example of how major the difference is with a round-trip conversion: >>>
In this case, encoding four Greek letters
with UTF-8 and then decoding back to text in UTF-16 would produce a text Glaringly wrong results like this are possible when the same encoding isn’t used bidirectionally. Two variations of decoding the same This table summarizes the range or number of bytes under UTF-8, UTF-16, and UTF-32:
One other curious aspect of the UTF family is that UTF-8 will not always take up less space than UTF-16. That may seem mathematically counterintuitive, but it’s quite possible: >>>
The reason for this is that the code points in the range I’m not by any means recommending that you jump aboard the UTF-16 train, regardless of whether or not you operate in a language whose characters are commonly in this range. Among other reasons, one of the strong arguments for using UTF-8 is that, in the world of encoding, it’s a great idea to blend in with the crowd. Not to mention, it’s 2019: computer memory is cheap, so saving 4 bytes by going out of your way to use UTF-16 is arguably not worth it. Python’s Built-In FunctionsYou’ve made it through the hard part. Time to use what you’ve seen thus far in Python. Python has a group of built-in functions that relate in some way to numbering systems and character encoding:
These can be logically grouped together based on their purpose:
Here’s a more detailed look at each of these nine functions:
You can expand the section below to see some examples of each function.
>>>
>>>
>>>
>>>
>>>
>>>
The Python >>>
>>>
Python String Literals: Ways to Skin a CatRather than using the >>>
That may seem easy enough. But the interesting side of things is that, because Python 3 is Unicode-centric through and through, you can “type” Unicode characters that you probably won’t even find on your keyboard. You can copy and paste this right into a Python 3 interpreter shell: >>>
Besides placing the actual, unescaped Unicode characters in the console, there are other ways to type Unicode strings as well. One of the densest sections of Python’s documentation is the portion on lexical analysis, specifically the section on string and bytes literals. Personally, I had to read this section about one, two, or maybe nine times for it to really sink in. Part of what it says is that there are up to six ways that Python will allow you to type the same Unicode character. The first and most common way is to type the character itself literally, as you’ve already seen. The tough part with this method is finding the actual keystrokes. That’s where the other methods for getting and representing characters come into play. Here’s the full list:
Here’s some proof and validation of the above: >>>
Now, there are two main caveats:
For instance, if you consult
unicode-table.com for information on the Gothic letter faihu (or fehu), How do you put this into This also means that the Other Encodings Available in PythonSo far, you’ve seen four character encodings:
There are a ton of other ones out there. One example is Latin-1 (also called ISO-8859-1), which is technically the default for the Hypertext Transfer Protocol (HTTP), per RFC 2616. Windows has its own Latin-1 variant called cp1252. The complete list of accepted encodings is buried way down in the documentation for the There’s
one more useful recognized encoding to be aware of, which is >>>
You Know What They Say About Assumptions…Just because Python makes the assumption of UTF-8 encoding for files and code that you generate doesn’t mean that you, the programmer, should operate with the same assumption for external data. Let’s say that again because it’s a rule to live by: when you receive binary data (bytes) from a third party source, whether it be from a file or over a network, the best practice is to check that the data specifies an encoding. If it doesn’t, then it’s on you to ask. All I/O happens in bytes, not text, and bytes are just ones and zeros to a computer until you tell it otherwise by informing it of an encoding. Here’s an example of where things can go wrong. You’re subscribed to an API that sends you a recipe of the day, which you receive in >>>
It looks as if the recipe calls for some flour, but we don’t know how much: >>>
Uh oh. There’s that pesky >>>
There we go. In Latin-1, every character fits into a single byte, whereas the “¼” character takes up two bytes in UTF-8 ( The lesson here is that it can be dangerous to assume the encoding of any data that is handed off to you. It’s usually UTF-8 these days, but it’s the small percentage of cases where it’s not that will blow things up. If you really do need to abandon ship and guess an encoding, then have a look at the Odds and Ends: unicodedataWe would be remiss not to mention >>>
Wrapping UpIn this article, you’ve decoded the wide and imposing subject of character encoding in Python. You’ve covered a lot of ground here:
Now, go forth and encode! ResourcesFor even more detail about the topics covered here, check out these resources:
The Python docs have two pages on the subject:
Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Unicode in Python: Working With Character Encodings Does Python allow special characters?Apart from these restrictions, Python allows Identifiers to be a combination of lowercase letters (a to z) or uppercase letters (A to Z) or digits (0 to 9) or an underscore (_).
Can you use Unicode in Python?Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters. Unicode (https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code.
Does Python support character?Like many other popular programming languages, strings in Python are arrays of bytes representing unicode characters. However, Python does not have a character data type, a single character is simply a string with a length of 1.
|