Your SlideShare is downloading. ×
0
Unicode for Small Children (andChildren at Heart)       Feihong HsuChicago Python Users Group      March 8, 2007
Welcome to the Wonderful World of            Unicorns! A Magical Guide to the Worlds Most Beloved             Mythological...
Welcome to the Useful World of         Unicode!A Practical Guide to the Worlds Most Popular          International Text St...
Top 3 reasons that unicorns are                great● Friendly and wise● Healing power● Bane of evil
Top 3 reasons that Unicode is              important● Comprehensive language  coverage● Multiple languages in a single  do...
The difference between Horses and             Unicorns            Horses         UnicornsHabitat     Grasslands     Enchan...
Difference between ISO 8859 and              Unicode                   ISO 8859   Unicode# supported        Some       A l...
So what, exactly, is Unicode?Unicode is a standard that assigns aunique number to each character in      every human langu...
What is Unicode not?● Doesnt address how the characters  are rendered (thats up to font  makers)● Doesnt deal with imagina...
How does Hollywood “create”             unicorns?● CGI● Horse with horn glued to forehead● Two dudes in a costume
How does a programmer create        Unicode documents?● Technically, you cant make a  Unicode document● Usually you pick a...
Python and UnicornWorking together to combat evil!
Python and UnicodeWorking together to create international             applications!
Unicode-related functions● unichr()● ord()● unicode.encode()● str.decode()
Examples of usage>>> s = unichr(23456)>>> print s宠>>> ord(s)23456>>> s.encode(utf-8)xe5xaexa0>>> s.encode(gb2312)xb3xe8>>>...
unicode and str: two different types!● They have exactly the same API● But they dont have the same  repr()● And they dont ...
unicode and str example>>> u = unicode()>>> type(u)<type unicode>>>> print repr(u)u>>> isinstance(u, str)False>>> s = str(...
Two ways to write a Unicode file● Use the file object returned by  codecs.open()● Use a regular file object along with  un...
Example using codecs.open()>>> import codecs>>> s = uu4f60u597du4e16u754c>>> fout = codecs.open(document.txt, w, utf-8)>>>...
Example using unicode.encode()>>> s = uu4f60u597du4e16u754c>>> fout = open(document.txt, w)>>> fout.write(s.encode(utf-8))...
Two ways to read Unicode files● Use the file object returned by  codecs.open()● Use a regular file object along with  str....
What is Byte Order Mark?● Called BOM for short● In UTF-16 docs, indicates little-  endian or big-endian● Often appears in ...
Example of reading from a UTF-8         file with BOM>>> import codecs>>> fin = codecs.open(bom_document.txt, r, utf-8)>>>...
Reading and writing XML● ElementTree handles everything  implicitly● It even eats the BOM without  complaining● It doesnt ...
File system directory listing● On Windows, os.listdir(.) wont  show you intl characters● You need to use os.listdir(u.) to...
String interpolation● Str template strings can be  interpolated with both unicode and  str objects (automatic conversion  ...
String interpolation example>>> Hello %s % uu98dbu9d3buHello u98dbu9d3b>>> uHello %s % uu98dbu9d3buHello u98dbu9d3b>>> Hel...
Putting Unicode in your Python              source code● Put “# -*- coding: utf-8 -*-” at top of  your file● Idle automati...
Regular expressions● The w special character doesnt  usually match non-ASCII  characters● To match non-ASCII characters,  ...
Regular expression example>>> s = uABCu4f60u597du4e16u754c>>> m = re.match(r"w+", s)>>> m.group()uABC>>> m = re.match(r"w+...
Considerations for web pages● Dont make pages or folders with intl  characters (Firefox doesnt handle intl  URLs well)● Ma...
Web page with <meta> tag<html>  <head>    <meta http-equiv="Content-Type" content="text/html;charset=utf-8">  </head>  <bo...
Web page with character entities<html>  <head>    <meta http-equiv="Content-Type" content="text/html;charset=ascii">  </he...
Processing documents of unknown            encoding● Use the chardet module● chardet.detect() function:   – accepts a stri...
Encoding detection example>>> import chardet, urllib2>>> html = urllib2.urlopen(http://chol.co.kr).read()>>> result = char...
Tools that play nice with Unicode● IDLE (raw_input() accepts  Unicode)● Notepad++ (can autodetect UTF-8  files with BOM)● ...
Libraries that play nice with Unicode● Tkinter● wxPython● Mako● BeautifulSoup● feedparser● Elementtree● lxml
Libraries that dont play nice with               Unicode● cStringIO (StringIO.write() doesnt  accept Unicode strings)● buz...
Databases● SQLite has no problem with  Unicode● SQLAlchemy with SQLite is fine  too● Other databases - ?
Platform-specific issues● Windows DOS prompt has no love for  Unicode● MacOS X IDLE cant handle Unicode● MacOS X terminal ...
Demos● Filesystem demo● Mako template engine demo● chardet demo● pysqlite demo● wxPython demo
Questions?有问题吗?
Unicode for Small            Children (and           Children at Heart)                 Feihong Hsu          Chicago Pytho...
Welcome to the Wonderful World of                    Unicorns!          A Magical Guide to the Worlds Most Beloved        ...
Welcome to the Useful World of         Unicode!A Practical Guide to the Worlds Most Popular          International Text St...
Top 3 reasons that unicorns are                great● Friendly and wise● Healing power● Bane of evil                      ...
Top 3 reasons that Unicode is                     important       ● Comprehensive language         coverage       ● Multip...
The difference between Horses and                     Unicorns                   Horses         Unicorns       Habitat    ...
Difference between ISO 8859 and                     Unicode                          ISO 8859   Unicode       # supported ...
So what, exactly, is Unicode?       Unicode is a standard that assigns a       unique number to each character in         ...
What is Unicode not?       ● Doesnt address how the characters         are rendered (thats up to font         makers)     ...
How does Hollywood “create”                    unicorns?        ● CGI        ● Horse with horn glued to forehead        ● ...
How does a programmer create                Unicode documents?        ● Technically, you cant make a          Unicode docu...
Python and Unicorn                Working together to combat evil!                                                   12I t...
Python and Unicode            Working together to create international                         applications!              ...
Unicode-related functions        ● unichr()        ● ord()        ● unicode.encode()        ● str.decode()                ...
Examples of usage        >>> s = unichr(23456)        >>> print s        宠        >>> ord(s)        23456        >>> s.enc...
unicode and str: two different types!       ● They have exactly the same API       ● But they dont have the same         r...
unicode and str example>>> u = unicode()>>> type(u)<type unicode>>>> print repr(u)u>>> isinstance(u, str)False>>> s = str(...
Two ways to write a Unicode file● Use the file object returned by  codecs.open()● Use a regular file object along with  un...
Example using codecs.open()>>> import codecs>>> s = uu4f60u597du4e16u754c>>> fout = codecs.open(document.txt, w, utf-8)>>>...
Example using unicode.encode()>>> s = uu4f60u597du4e16u754c>>> fout = open(document.txt, w)>>> fout.write(s.encode(utf-8))...
Two ways to read Unicode files● Use the file object returned by  codecs.open()● Use a regular file object along with  str....
What is Byte Order Mark?        ● Called BOM for short        ● In UTF-16 docs, indicates little-          endian or big-e...
Example of reading from a UTF-8         file with BOM>>> import codecs>>> fin = codecs.open(bom_document.txt, r, utf-8)>>>...
Reading and writing XML       ● ElementTree handles everything         implicitly       ● It even eats the BOM without    ...
File system directory listing       ● On Windows, os.listdir(.) wont         show you intl characters       ● You need to ...
String interpolation        ● Str template strings can be          interpolated with both unicode and          str objects...
String interpolation example>>> Hello %s % uu98dbu9d3buHello u98dbu9d3b>>> uHello %s % uu98dbu9d3buHello u98dbu9d3b>>> Hel...
Putting Unicode in your Python                    source code       ● Put “# -*- coding: utf-8 -*-” at top of         your...
Regular expressions        ● The w special character doesnt          usually match non-ASCII          characters        ● ...
Regular expression example>>> s = uABCu4f60u597du4e16u754c>>> m = re.match(r"w+", s)>>> m.group()uABC>>> m = re.match(r"w+...
Considerations for web pages       ● Dont make pages or folders with intl         characters (Firefox doesnt handle intl  ...
Web page with <meta> tag        <html>          <head>            <meta http-equiv="Content-Type"         content="text/ht...
Web page with character entities        <html>          <head>            <meta http-equiv="Content-Type"         content=...
Processing documents of unknown            encoding● Use the chardet module● chardet.detect() function:   – accepts a stri...
Encoding detection example        >>> import chardet, urllib2        >>> html =         urllib2.urlopen(http://chol.co.kr)...
Tools that play nice with Unicode       ● IDLE (raw_input() accepts         Unicode)       ● Notepad++ (can autodetect UTF...
Libraries that play nice with Unicode● Tkinter● wxPython● Mako● BeautifulSoup● feedparser● Elementtree● lxml              ...
Libraries that dont play nice with                 Unicode● cStringIO (StringIO.write() doesnt  accept Unicode strings)● b...
Databases● SQLite has no problem with  Unicode● SQLAlchemy with SQLite is fine  too● Other databases - ?                  ...
Platform-specific issues        ● Windows DOS prompt has no love for          Unicode        ● MacOS X IDLE cant handle Un...
Demos● Filesystem demo● Mako template engine demo● chardet demo● pysqlite demo● wxPython demo                             ...
Click to add title                   Questions?                   有问题吗?                                           42Thanks...
Upcoming SlideShare
Loading in...5
×

Unicode for Small Children (and Children at Heart)

6,237

Published on

An allegorical explanation of Unicode, suitable for small children (sort of).

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
6,237
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Unicode for Small Children (and Children at Heart)"

  1. 1. Unicode for Small Children (andChildren at Heart) Feihong HsuChicago Python Users Group March 8, 2007
  2. 2. Welcome to the Wonderful World of Unicorns! A Magical Guide to the Worlds Most Beloved Mythological Equine
  3. 3. Welcome to the Useful World of Unicode!A Practical Guide to the Worlds Most Popular International Text Standard
  4. 4. Top 3 reasons that unicorns are great● Friendly and wise● Healing power● Bane of evil
  5. 5. Top 3 reasons that Unicode is important● Comprehensive language coverage● Multiple languages in a single document● Standardized
  6. 6. The difference between Horses and Unicorns Horses UnicornsHabitat Grasslands Enchanted forestsDiet Apples, Love, spirit of oats, grass, wonder barley, etc.Abilities Galloping, Sentience, eating, telepathy, laser pooping vision (unconfirmed)
  7. 7. Difference between ISO 8859 and Unicode ISO 8859 Unicode# supported Some A lotlanguages# supported 256 100,000+characters# bytes for each 1 1-4character
  8. 8. So what, exactly, is Unicode?Unicode is a standard that assigns aunique number to each character in every human language Ok, not every language, see next slide
  9. 9. What is Unicode not?● Doesnt address how the characters are rendered (thats up to font makers)● Doesnt deal with imaginary languages like Klingon and Elvish● Doesnt deal with ancient languages● Doesnt deal with obscure languages that no one uses
  10. 10. How does Hollywood “create” unicorns?● CGI● Horse with horn glued to forehead● Two dudes in a costume
  11. 11. How does a programmer create Unicode documents?● Technically, you cant make a Unicode document● Usually you pick an official encoding (UTF-8, UTF-16, etc)● Sometimes you use a language- specific encoding (GB2312, Shift- JIS)
  12. 12. Python and UnicornWorking together to combat evil!
  13. 13. Python and UnicodeWorking together to create international applications!
  14. 14. Unicode-related functions● unichr()● ord()● unicode.encode()● str.decode()
  15. 15. Examples of usage>>> s = unichr(23456)>>> print s宠>>> ord(s)23456>>> s.encode(utf-8)xe5xaexa0>>> s.encode(gb2312)xb3xe8>>> print _³è>>> xe5xaexa0.decode(utf-8)uu5ba0>>> print _宠>>>
  16. 16. unicode and str: two different types!● They have exactly the same API● But they dont have the same repr()● And they dont have the same type()● Use isinstance() to tell them apart
  17. 17. unicode and str example>>> u = unicode()>>> type(u)<type unicode>>>> print repr(u)u>>> isinstance(u, str)False>>> s = str()>>> type(s)<type str>>>> print repr(s)>>> isinstance(s, unicode)False>>>
  18. 18. Two ways to write a Unicode file● Use the file object returned by codecs.open()● Use a regular file object along with unicode.encode()
  19. 19. Example using codecs.open()>>> import codecs>>> s = uu4f60u597du4e16u754c>>> fout = codecs.open(document.txt, w, utf-8)>>> fout.write(s)>>> fout.close()>>> open(document.txt).read().decode(utf- 8)uu4f60u597du4e16u754c>>>
  20. 20. Example using unicode.encode()>>> s = uu4f60u597du4e16u754c>>> fout = open(document.txt, w)>>> fout.write(s.encode(utf-8))>>> fout.close()>>> open(document.txt).read().decode(utf- 8)uu4f60u597du4e16u754c>>>
  21. 21. Two ways to read Unicode files● Use the file object returned by codecs.open()● Use a regular file object along with str.decode()● Watch out for the BOM!
  22. 22. What is Byte Order Mark?● Called BOM for short● In UTF-16 docs, indicates little- endian or big-endian● Often appears in UTF-8 docs to distinguish them from ASCII docs● Use read(1) for UTF-8 documents with BOM
  23. 23. Example of reading from a UTF-8 file with BOM>>> import codecs>>> fin = codecs.open(bom_document.txt, r, utf-8)>>> fin.read(1)uufeff>>> fin.read()uu4f60u597du4e16u754c>>> fin.close()>>>
  24. 24. Reading and writing XML● ElementTree handles everything implicitly● It even eats the BOM without complaining● It doesnt even need the XML declaration (as long as you use ASCII or UTF-8)● cElementTree works great too!
  25. 25. File system directory listing● On Windows, os.listdir(.) wont show you intl characters● You need to use os.listdir(u.) to see the Unicode files● os.getcwd() doesnt show intl characters● Use os.getcwdu() instead
  26. 26. String interpolation● Str template strings can be interpolated with both unicode and str objects (automatic conversion to unicode)● Unicode template strings need to be interpolated with unicode objects
  27. 27. String interpolation example>>> Hello %s % uu98dbu9d3buHello u98dbu9d3b>>> uHello %s % uu98dbu9d3buHello u98dbu9d3b>>> Hello %s % xe9xa3x9bxe9xb4xbbHello xe9xa3x9bxe9xb4xbb>>> uHello %s % xe9xa3x9bxe9xb4xbbTraceback (most recent call last): File "<pyshell#36>", line 1, in ? uHello %s % xe9xa3x9bxe9xb4xbbUnicodeDecodeError: ascii codec cant decode byte 0xe9 in position 0: ordinal not in range(128)>>>
  28. 28. Putting Unicode in your Python source code● Put “# -*- coding: utf-8 -*-” at top of your file● Idle automatically detects non- ASCII characters and prompts to edit your file● Not generally recommended
  29. 29. Regular expressions● The w special character doesnt usually match non-ASCII characters● To match non-ASCII characters, use re.UNICODE flag● Remember that punctuation in different languages uses different characters
  30. 30. Regular expression example>>> s = uABCu4f60u597du4e16u754c>>> m = re.match(r"w+", s)>>> m.group()uABC>>> m = re.match(r"w+", s, re.UNICODE)>>> m.group()uABCu4f60u597du4e16u754c>>>
  31. 31. Considerations for web pages● Dont make pages or folders with intl characters (Firefox doesnt handle intl URLs well)● Make sure you use the <meta> tag when generating web pages● You can display Unicode even in ASCII-encoded pages (use character entities)
  32. 32. Web page with <meta> tag<html> <head> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> </head> <body> <h1> 你好世界 </h1> </body></html>
  33. 33. Web page with character entities<html> <head> <meta http-equiv="Content-Type" content="text/html;charset=ascii"> </head> <body> <h1>&#20320&#22909&#19990&#30028</h1> </body></html>Conversion recipe: s.encode(ascii, xmlcharrefreplace)
  34. 34. Processing documents of unknown encoding● Use the chardet module● chardet.detect() function: – accepts a string – returns a dictionary with two keys: encoding and confidence● Also try BeautifulSoup for web pages
  35. 35. Encoding detection example>>> import chardet, urllib2>>> html = urllib2.urlopen(http://chol.co.kr).read()>>> result = chardet.detect(html)>>> result{confidence: 0.98999999999999999, encoding: EUC-KR}>>> print html.decode(result[encoding])
  36. 36. Tools that play nice with Unicode● IDLE (raw_input() accepts Unicode)● Notepad++ (can autodetect UTF-8 files with BOM)● jEdit
  37. 37. Libraries that play nice with Unicode● Tkinter● wxPython● Mako● BeautifulSoup● feedparser● Elementtree● lxml
  38. 38. Libraries that dont play nice with Unicode● cStringIO (StringIO.write() doesnt accept Unicode strings)● buzhug● Various ID3 libraries● ?
  39. 39. Databases● SQLite has no problem with Unicode● SQLAlchemy with SQLite is fine too● Other databases - ?
  40. 40. Platform-specific issues● Windows DOS prompt has no love for Unicode● MacOS X IDLE cant handle Unicode● MacOS X terminal doesnt like Unicode, likes UTF-8● Recommendation: Use PyCrust?
  41. 41. Demos● Filesystem demo● Mako template engine demo● chardet demo● pysqlite demo● wxPython demo
  42. 42. Questions?有问题吗?
  43. 43. Unicode for Small Children (and Children at Heart) Feihong Hsu Chicago Python Users Group March 8, 2007 1Thanks to Chris McAvoy for the conversation at PyCon that inspired this talk.
  44. 44. Welcome to the Wonderful World of Unicorns! A Magical Guide to the Worlds Most Beloved Mythological Equine 2Completely drawn on my tablet PC using the free Ink Art program. Unfortunately, Ink Art doesnt come with good coloring tools so I just left it colorless.
  45. 45. Welcome to the Useful World of Unicode!A Practical Guide to the Worlds Most Popular International Text Standard 3
  46. 46. Top 3 reasons that unicorns are great● Friendly and wise● Healing power● Bane of evil 4
  47. 47. Top 3 reasons that Unicode is important ● Comprehensive language coverage ● Multiple languages in a single document ● Standardized 5The Unicode Standard is maintained by the Unicode Consortium, an organization based in California.
  48. 48. The difference between Horses and Unicorns Horses Unicorns Habitat Grasslands Enchanted forests Diet Apples, Love, spirit of oats, grass, wonder barley, etc. Abilities Galloping, Sentience, eating, telepathy, laser pooping vision (unconfirmed) 6I really wasnt sure about including the laser vision ability. I honestly thought it was an urban myth. But when a friend of my cousins sisters friend said that she saw it in person, I finally relented.
  49. 49. Difference between ISO 8859 and Unicode ISO 8859 Unicode # supported Some A lot languages # supported 256 100,000+ characters # bytes for each 1 1-4 character 7Somebody noted that ISO 8859 can actually support more than 256 characters through its various extensions, so this is an oversimplification.
  50. 50. So what, exactly, is Unicode? Unicode is a standard that assigns a unique number to each character in every human language Ok, not every language, see next slide 8The “unique number” for each character is called a code point in Unicode terminology.
  51. 51. What is Unicode not? ● Doesnt address how the characters are rendered (thats up to font makers) ● Doesnt deal with imaginary languages like Klingon and Elvish ● Doesnt deal with ancient languages ● Doesnt deal with obscure languages that no one uses 9Although there are many languages that Unicode doesnt directly support, there are extensions to Unicode that are designed to handle these cases.
  52. 52. How does Hollywood “create” unicorns? ● CGI ● Horse with horn glued to forehead ● Two dudes in a costume 10It helps if the two dudes are very high. And if they have circus experience. And if neither of them has a trick leg.
  53. 53. How does a programmer create Unicode documents? ● Technically, you cant make a Unicode document ● Usually you pick an official encoding (UTF-8, UTF-16, etc) ● Sometimes you use a language- specific encoding (GB2312, Shift- JIS) 11In the vast majority of cases, I think UTF-8 is more than adequate. If in doubt, just go with that encoding.
  54. 54. Python and Unicorn Working together to combat evil! 12I think this is a case of the graphic actually undermining the point Im trying to make. This is my attempt to render a dynamic, exciting action scene of a pitched battle between orc, unicorn and python. They are fighting for the fate of the damsel in distress because she is, like, oh so fine (well, at least when shes got her makeup on, which she doesnt in this picture). Unfortunately, the unicorn looks like its about to be stabbed in the ass, and the python seems more interested in biting a chunk out of the damsel than in saving her.
  55. 55. Python and Unicode Working together to create international applications! 13The only time I actually visited the Unicode Consortiums web site was to get a copy of the Unicode logo.
  56. 56. Unicode-related functions ● unichr() ● ord() ● unicode.encode() ● str.decode() 14Thanks to Ian Bicking for pointing out that it should be unicode.encode(), not str.encode().
  57. 57. Examples of usage >>> s = unichr(23456) >>> print s 宠 >>> ord(s) 23456 >>> s.encode(utf-8) xe5xaexa0 >>> s.encode(gb2312) xb3xe8 >>> print _ ³è >>> xe5xaexa0.decode(utf-8) uu5ba0 >>> print _ 宠 >>> 15The PDF version of this presentation doesnt render the Chinese character properly. But if you copy and paste in a Unicode-aware editor, youll probably be able to see it. I admit it is pretty rare to put a Chinese character in Courier New font.
  58. 58. unicode and str: two different types! ● They have exactly the same API ● But they dont have the same repr() ● And they dont have the same type() ● Use isinstance() to tell them apart 16Thanks to Atul Varma for making some comments that led me to adding this slide (and the next one).
  59. 59. unicode and str example>>> u = unicode()>>> type(u)<type unicode>>>> print repr(u)u>>> isinstance(u, str)False>>> s = str()>>> type(s)<type str>>>> print repr(s)>>> isinstance(s, unicode)False>>> 17
  60. 60. Two ways to write a Unicode file● Use the file object returned by codecs.open()● Use a regular file object along with unicode.encode() 18
  61. 61. Example using codecs.open()>>> import codecs>>> s = uu4f60u597du4e16u754c>>> fout = codecs.open(document.txt, w, utf-8)>>> fout.write(s)>>> fout.close()>>> open(document.txt).read().decode(utf- 8)uu4f60u597du4e16u754c>>> 19
  62. 62. Example using unicode.encode()>>> s = uu4f60u597du4e16u754c>>> fout = open(document.txt, w)>>> fout.write(s.encode(utf-8))>>> fout.close()>>> open(document.txt).read().decode(utf- 8)uu4f60u597du4e16u754c>>> 20
  63. 63. Two ways to read Unicode files● Use the file object returned by codecs.open()● Use a regular file object along with str.decode()● Watch out for the BOM! 21
  64. 64. What is Byte Order Mark? ● Called BOM for short ● In UTF-16 docs, indicates little- endian or big-endian ● Often appears in UTF-8 docs to distinguish them from ASCII docs ● Use read(1) for UTF-8 documents with BOM 22The actual value of the BOM is 0xfeff. If you try to print it in the Python interpreter, you wont see anything.
  65. 65. Example of reading from a UTF-8 file with BOM>>> import codecs>>> fin = codecs.open(bom_document.txt, r, utf-8)>>> fin.read(1)uufeff>>> fin.read()uu4f60u597du4e16u754c>>> fin.close()>>> 23
  66. 66. Reading and writing XML ● ElementTree handles everything implicitly ● It even eats the BOM without complaining ● It doesnt even need the XML declaration (as long as you use ASCII or UTF-8) ● cElementTree works great too! 24The lxml module is similarly awesome.
  67. 67. File system directory listing ● On Windows, os.listdir(.) wont show you intl characters ● You need to use os.listdir(u.) to see the Unicode files ● os.getcwd() doesnt show intl characters ● Use os.getcwdu() instead 25The behavior under Mac OS X is somewhat different. I dont know about Linux.
  68. 68. String interpolation ● Str template strings can be interpolated with both unicode and str objects (automatic conversion to unicode) ● Unicode template strings need to be interpolated with unicode objects 26Template engines have these sorts of issues as well. In particular, if you want to render a unicode string in Mako or Myghty, you need to pass unicode strings into the template.
  69. 69. String interpolation example>>> Hello %s % uu98dbu9d3buHello u98dbu9d3b>>> uHello %s % uu98dbu9d3buHello u98dbu9d3b>>> Hello %s % xe9xa3x9bxe9xb4xbbHello xe9xa3x9bxe9xb4xbb>>> uHello %s % xe9xa3x9bxe9xb4xbbTraceback (most recent call last): File "<pyshell#36>", line 1, in ? uHello %s % xe9xa3x9bxe9xb4xbbUnicodeDecodeError: ascii codec cant decode byte 0xe9 in position 0: ordinal not in range(128)>>> 27
  70. 70. Putting Unicode in your Python source code ● Put “# -*- coding: utf-8 -*-” at top of your file ● Idle automatically detects non- ASCII characters and prompts to edit your file ● Not generally recommended 28I dont recommend putting Unicode strings in your source code because people who dont have Unicode-aware editors will just see annoying gibberish.
  71. 71. Regular expressions ● The w special character doesnt usually match non-ASCII characters ● To match non-ASCII characters, use re.UNICODE flag ● Remember that punctuation in different languages uses different characters 29Punctuation characters in English:.?!Compare with punctuation characters in Chinese:。?!Although they only look slightly different, they do have different code points in Unicode.
  72. 72. Regular expression example>>> s = uABCu4f60u597du4e16u754c>>> m = re.match(r"w+", s)>>> m.group()uABC>>> m = re.match(r"w+", s, re.UNICODE)>>> m.group()uABCu4f60u597du4e16u754c>>> 30
  73. 73. Considerations for web pages ● Dont make pages or folders with intl characters (Firefox doesnt handle intl URLs well) ● Make sure you use the <meta> tag when generating web pages ● You can display Unicode even in ASCII-encoded pages (use character entities) 31As Atul Varma pointed out, Firefox mangles the URL but does so in a standard way. However, it still ends up not finding the page. IE can actually find and display pages with Unicode names. This is probably the only thing IE does better than Firefox.
  74. 74. Web page with <meta> tag <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> </head> <body> <h1> 你好世界 </h1> </body> </html> 32The text is Chinese for “Hello World”.
  75. 75. Web page with character entities <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=ascii"> </head> <body> <h1>&#20320&#22909&#19990&#30028</h1> </body> </html> Conversion recipe: s.encode(ascii, xmlcharrefreplace) 33Thanks to Ian Bicking for pointing out a shorter conversion recipe. For the record, the original one is:.join(&#%d % ord(c) for c in s)
  76. 76. Processing documents of unknown encoding● Use the chardet module● chardet.detect() function: – accepts a string – returns a dictionary with two keys: encoding and confidence● Also try BeautifulSoup for web pages 34
  77. 77. Encoding detection example >>> import chardet, urllib2 >>> html = urllib2.urlopen(http://chol.co.kr).read() >>> result = chardet.detect(html) >>> result {confidence: 0.98999999999999999, encoding: EUC-KR} >>> print html.decode(result[encoding]) 35You can also try BeautifulSoup for web pages. Example:content = urllib2.urlopen(url).read()soup = BeautifulSoup(content)encoding = soup.originalEncoding
  78. 78. Tools that play nice with Unicode ● IDLE (raw_input() accepts Unicode) ● Notepad++ (can autodetect UTF-8 files with BOM) ● jEdit 36Note that only IDLE on Windows has this feature.
  79. 79. Libraries that play nice with Unicode● Tkinter● wxPython● Mako● BeautifulSoup● feedparser● Elementtree● lxml 37
  80. 80. Libraries that dont play nice with Unicode● cStringIO (StringIO.write() doesnt accept Unicode strings)● buzhug● Various ID3 libraries● ? 38
  81. 81. Databases● SQLite has no problem with Unicode● SQLAlchemy with SQLite is fine too● Other databases - ? 39
  82. 82. Platform-specific issues ● Windows DOS prompt has no love for Unicode ● MacOS X IDLE cant handle Unicode ● MacOS X terminal doesnt like Unicode, likes UTF-8 ● Recommendation: Use PyCrust? 40I checked and it turns out that PyCrust chokes on intl characters sent through raw_input(), even on Windows. So I formally withdraw my recommendation of PyCrust.
  83. 83. Demos● Filesystem demo● Mako template engine demo● chardet demo● pysqlite demo● wxPython demo 41
  84. 84. Click to add title Questions? 有问题吗? 42Thanks to the experts in the audience who provided hard-hitting answers to the the tough questions. And, of course, thanks to everyone who attended my first talk at ChiPy. I hope there will be more.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×