The great escapes: >>> e == ue True >>> xc9 == uxc9 False >>> uxc9 == uu00c9 == uU000000c9 True
UTF-8● There is no difference between an ASCII-encoded and a UTF-8 encoded file if no “extended” characters appear in it.● Except if theres a BOM (byte order mark): ● UTF-8: EF BB BF ( ï»¿ ) ● UTF-16: FE FF ( U+FFFE is reserved for this very purpose ) NOT HELPFUL:
Encode/decode:● Encode to bytes● Decode to unicode● or, forget decode completely: >>> fortxc3xa3.decode(utf-8) ufortxe9 >>> unicode(fortxc3xa3, utf-8) ufortxe9
This is why we declare encodings: RIGHT SINGLE QUOTATION MARK U+2019 >>> uu2019.encode(utf-8) xe2x80x99 >>> xe2x80x99.decode(cp1252) uxe2u20acu2122 >>> print uxe2u20acu2122 â€™ All because of a missing <meta charset="utf-8">
If you REALLY need ASCII: >>> print urxe9sumxe9 résumé >>> print urxe9sumxe9.encode(errors=ignore) rsum >>> print urxe9sumxe9.encode(errors=replace) r?sum? $ pip install unidecode >>> from unidecode import unidecode >>> print unidecode(urxe9sumxe9) resume
Combining marks:LATIN SMALL LETTER E LATIN SMALL LETTER E COMBINING DIAERESIS WITH DIAERESIS U+0065 U+0308 U+00EB>>> print uZoxebZoë>>> print uZoeu0308Zoë>>> from unicodedata import normalize>>> normalize(NFC, uZoeu0308)uZoxeb>>> normalize(NFD, uZoxeb)uZoeu0308OS X on HFS+ normalises filenames, others dont
PEP-8Code in the core Python distribution should always use the ASCII or Latin-1encoding (a.k.a. ISO-8859-1). For Python 3.0 and beyond, UTF-8 ispreferred over Latin-1, see PEP 3120.Files using ASCII should not have a coding cookie. Latin-1 (or UTF-8)should only be used when a comment or docstring needs to mention anauthor name that requires Latin-1; otherwise, using x, u or U escapes is thepreferred way to include non-ASCII data in string literals.For Python 3.0 and beyond, the following policy is prescribed for thestandard library (see PEP 3131): All identifiers in the Python standardlibrary MUST use ASCII-only identifiers, and SHOULD use English wordswherever feasible (in many cases, abbreviations and technical terms are usedwhich arent English). In addition, string literals and comments must also bein ASCII. The only exceptions are (a) test cases testing the non-ASCIIfeatures, and (b) names of authors. Authors whose names are not based onthe latin alphabet MUST provide a latin transliteration of their names.
Libraries:● unidecode ● For when you absolutely need ASCII – folds accents and transliterates from many languages.● chardet ● Guesses most likely character encoding of a given bytestring. Based on Mozillas code.● unicode-nazi ● Yells about any implicit unicode/bytestring conversion in your code. Useful when porting code to Python 3.
Links:● All About Python and Unicode ● A detailed reference on all things pertaining to Python and Unicode.● Pragmatic Unicode ● PyCon 2012 talk on Unicode in Python, covering v3 as well.● Love Hotels and Unicode ● A look at the inside politics and other quirky aspects of Unicode.● Python Unicode – Fixing UTF-8 encoded as Latin-1 ● Another poor soul who ran into this problem.● Why the Obama tweet was garbled ● A quick explanation with comments from the people responsible.● Unicode Support Shootout ● An advanced treatise on how most languages (including Python) fail at Unicode.