Except UnicodeError: battling Unicode demons in Python


Published on

Issues that come up in practice when working with Unicode in Python, and how to avoid them.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Except UnicodeError: battling Unicode demons in Python

  1. 1. except UnicodeError: # A practical guide to fighting Unicode demons Aram Dulyan (@Aramgutang) Sydney Python Users group (SyPy) 05 APR 2012
  2. 2. What is Unicode?
  3. 3. Looking inside:
  4. 4. In Python: class unicode(basestring): ...
  5. 5. The great escapes: >>> e == ue True >>> xc9 == uxc9 False >>> uxc9 == uu00c9 == uU000000c9 True
  6. 6. UTF-8● There is no difference between an ASCII-encoded and a UTF-8 encoded file if no “extended” characters appear in it.● Except if theres a BOM (byte order mark): ● UTF-8: EF BB BF (  ) ● UTF-16: FE FF ( U+FFFE is reserved for this very purpose ) NOT HELPFUL:
  7. 7. Encode/decode:● Encode to bytes● Decode to unicode● or, forget decode completely: >>> fortxc3xa3.decode(utf-8) ufortxe9 >>> unicode(fortxc3xa3, utf-8) ufortxe9
  8. 8. This is why we declare encodings: RIGHT SINGLE QUOTATION MARK U+2019 >>> uu2019.encode(utf-8) xe2x80x99 >>> xe2x80x99.decode(cp1252) uxe2u20acu2122 >>> print uxe2u20acu2122 ’ All because of a missing <meta charset="utf-8">
  9. 9. If you REALLY need ASCII: >>> print urxe9sumxe9 résumé >>> print urxe9sumxe9.encode(errors=ignore) rsum >>> print urxe9sumxe9.encode(errors=replace) r?sum? $ pip install unidecode >>> from unidecode import unidecode >>> print unidecode(urxe9sumxe9) resume
  10. 10. The “u” prefix: >>> %s %s % (uunicode, string) uunicode string >>> string + uunicode ustring unicode class Loonie(object): def __str__(self): return Throatwobbler Mangrove def __unicode__(self): return uRichard Luxuryyacht >>> %s % Loonie() Throatwobbler Mangrove >>> u%s % Loonie() uRichard Luxuryyacht >>> %s %s % (Loonie(), uis silly) uThroatwobbler Mangrove is silly
  11. 11. Combining marks:LATIN SMALL LETTER E LATIN SMALL LETTER E COMBINING DIAERESIS WITH DIAERESIS U+0065 U+0308 U+00EB>>> print uZoxebZoë>>> print uZoeu0308Zoë>>> from unicodedata import normalize>>> normalize(NFC, uZoeu0308)uZoxeb>>> normalize(NFD, uZoxeb)uZoeu0308OS X on HFS+ normalises filenames, others dont
  12. 12. Warning:
  13. 13. PEP-8Code in the core Python distribution should always use the ASCII or Latin-1encoding (a.k.a. ISO-8859-1). For Python 3.0 and beyond, UTF-8 ispreferred over Latin-1, see PEP 3120.Files using ASCII should not have a coding cookie. Latin-1 (or UTF-8)should only be used when a comment or docstring needs to mention anauthor name that requires Latin-1; otherwise, using x, u or U escapes is thepreferred way to include non-ASCII data in string literals.For Python 3.0 and beyond, the following policy is prescribed for thestandard library (see PEP 3131): All identifiers in the Python standardlibrary MUST use ASCII-only identifiers, and SHOULD use English wordswherever feasible (in many cases, abbreviations and technical terms are usedwhich arent English). In addition, string literals and comments must also bein ASCII. The only exceptions are (a) test cases testing the non-ASCIIfeatures, and (b) names of authors. Authors whose names are not based onthe latin alphabet MUST provide a latin transliteration of their names.
  14. 14. Libraries:● unidecode ● For when you absolutely need ASCII – folds accents and transliterates from many languages.● chardet ● Guesses most likely character encoding of a given bytestring. Based on Mozillas code.● unicode-nazi ● Yells about any implicit unicode/bytestring conversion in your code. Useful when porting code to Python 3.
  15. 15. Links:● All About Python and Unicode ● A detailed reference on all things pertaining to Python and Unicode.● Pragmatic Unicode ● PyCon 2012 talk on Unicode in Python, covering v3 as well.● Love Hotels and Unicode ● A look at the inside politics and other quirky aspects of Unicode.● Python Unicode – Fixing UTF-8 encoded as Latin-1 ● Another poor soul who ran into this problem.● Why the Obama tweet was garbled ● A quick explanation with comments from the people responsible.● Unicode Support Shootout ● An advanced treatise on how most languages (including Python) fail at Unicode.