Except UnicodeError: battling Unicode demons in Python
Upcoming SlideShare
Loading in...5
×
 

Except UnicodeError: battling Unicode demons in Python

on

  • 865 views

Issues that come up in practice when working with Unicode in Python, and how to avoid them.

Issues that come up in practice when working with Unicode in Python, and how to avoid them.

Statistics

Views

Total Views
865
Views on SlideShare
865
Embed Views
0

Actions

Likes
0
Downloads
7
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Except UnicodeError: battling Unicode demons in Python Except UnicodeError: battling Unicode demons in Python Presentation Transcript

    • except UnicodeError: # A practical guide to fighting Unicode demons Aram Dulyan (@Aramgutang) Sydney Python Users group (SyPy) 05 APR 2012
    • What is Unicode?
    • Looking inside:
    • In Python: class unicode(basestring): ...
    • The great escapes: >>> e == ue True >>> xc9 == uxc9 False >>> uxc9 == uu00c9 == uU000000c9 True
    • UTF-8● There is no difference between an ASCII-encoded and a UTF-8 encoded file if no “extended” characters appear in it.● Except if theres a BOM (byte order mark): ● UTF-8: EF BB BF (  ) ● UTF-16: FE FF ( U+FFFE is reserved for this very purpose ) NOT HELPFUL:
    • Encode/decode:● Encode to bytes● Decode to unicode● or, forget decode completely: >>> fortxc3xa3.decode(utf-8) ufortxe9 >>> unicode(fortxc3xa3, utf-8) ufortxe9
    • This is why we declare encodings: RIGHT SINGLE QUOTATION MARK U+2019 >>> uu2019.encode(utf-8) xe2x80x99 >>> xe2x80x99.decode(cp1252) uxe2u20acu2122 >>> print uxe2u20acu2122 ’ All because of a missing <meta charset="utf-8">
    • If you REALLY need ASCII: >>> print urxe9sumxe9 résumé >>> print urxe9sumxe9.encode(errors=ignore) rsum >>> print urxe9sumxe9.encode(errors=replace) r?sum? $ pip install unidecode >>> from unidecode import unidecode >>> print unidecode(urxe9sumxe9) resume
    • The “u” prefix: >>> %s %s % (uunicode, string) uunicode string >>> string + uunicode ustring unicode class Loonie(object): def __str__(self): return Throatwobbler Mangrove def __unicode__(self): return uRichard Luxuryyacht >>> %s % Loonie() Throatwobbler Mangrove >>> u%s % Loonie() uRichard Luxuryyacht >>> %s %s % (Loonie(), uis silly) uThroatwobbler Mangrove is silly
    • Combining marks:LATIN SMALL LETTER E LATIN SMALL LETTER E COMBINING DIAERESIS WITH DIAERESIS U+0065 U+0308 U+00EB>>> print uZoxebZoë>>> print uZoeu0308Zoë>>> from unicodedata import normalize>>> normalize(NFC, uZoeu0308)uZoxeb>>> normalize(NFD, uZoxeb)uZoeu0308OS X on HFS+ normalises filenames, others dont
    • Warning:
    • PEP-8Code in the core Python distribution should always use the ASCII or Latin-1encoding (a.k.a. ISO-8859-1). For Python 3.0 and beyond, UTF-8 ispreferred over Latin-1, see PEP 3120.Files using ASCII should not have a coding cookie. Latin-1 (or UTF-8)should only be used when a comment or docstring needs to mention anauthor name that requires Latin-1; otherwise, using x, u or U escapes is thepreferred way to include non-ASCII data in string literals.For Python 3.0 and beyond, the following policy is prescribed for thestandard library (see PEP 3131): All identifiers in the Python standardlibrary MUST use ASCII-only identifiers, and SHOULD use English wordswherever feasible (in many cases, abbreviations and technical terms are usedwhich arent English). In addition, string literals and comments must also bein ASCII. The only exceptions are (a) test cases testing the non-ASCIIfeatures, and (b) names of authors. Authors whose names are not based onthe latin alphabet MUST provide a latin transliteration of their names.
    • Libraries:● unidecode ● For when you absolutely need ASCII – folds accents and transliterates from many languages.● chardet ● Guesses most likely character encoding of a given bytestring. Based on Mozillas code.● unicode-nazi ● Yells about any implicit unicode/bytestring conversion in your code. Useful when porting code to Python 3.
    • Links:● All About Python and Unicode ● A detailed reference on all things pertaining to Python and Unicode.● Pragmatic Unicode ● PyCon 2012 talk on Unicode in Python, covering v3 as well.● Love Hotels and Unicode ● A look at the inside politics and other quirky aspects of Unicode.● Python Unicode – Fixing UTF-8 encoded as Latin-1 ● Another poor soul who ran into this problem.● Why the Obama tweet was garbled ● A quick explanation with comments from the people responsible.● Unicode Support Shootout ● An advanced treatise on how most languages (including Python) fail at Unicode.