Perl And Unicode
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Perl And Unicode

on

  • 3,748 views

Perl and Unicode talk from Italian Perl Workshop 2009 and London Perl Workshop 2010.

Perl and Unicode talk from Italian Perl Workshop 2009 and London Perl Workshop 2010.

Statistics

Views

Total Views
3,748
Views on SlideShare
3,739
Embed Views
9

Actions

Likes
5
Downloads
60
Comments
5

4 Embeds 9

http://www.slideshare.net 5
http://a0.twimg.com 2
https://si0.twimg.com 1
https://duckduckgo.com 1

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Perl And Unicode Presentation Transcript

  • 1. Perl and Unicode
  • 2. Perl and UnicodeMike Whitaker, BBC/EnlightenedPerl.org
  • 3. The problem• Keeping track of input and output encodings• Not losing encoding data in the middle• Understanding the difference between characters and bytes
  • 4. Characters vs bytes
  • 5. Characters vs bytescharacters
  • 6. Characters vs bytescharacters $
  • 7. Characters vs bytescharacters $ U+0024
  • 8. Characters vs bytescharacters $ U+0024 bytes (UTF-8)
  • 9. Characters vs bytescharacters $ U+0024 bytes 0x24 (UTF-8)
  • 10. Characters vs bytescharacters $ € U+0024 bytes 0x24 (UTF-8)
  • 11. Characters vs bytescharacters $ € U+0024 U+20AC bytes 0x24 (UTF-8)
  • 12. Characters vs bytescharacters $ € U+0024 U+20AC bytes 0x24 0xE2 0x82 0xAC (UTF-8)
  • 13. Characters vs bytes 2characters $ € U+0024 U+20AC bytes 0x24 0xE2 0x82 0xAC (UTF-8)
  • 14. Characters vs bytes 2characters $ € U+0024 U+20AC 4 bytes 0x24 0xE2 0x82 0xAC (UTF-8)
  • 15. Handling Encodings
  • 16. Handling Encodingsinput
  • 17. Handling Encodings àbçdé bytes in someinput encoding or other
  • 18. Handling Encodings àbçdé bytes in someinput encoding or other decode
  • 19. Handling Encodings àbçdé bytes in someinput encoding or other decode àbçdé character-based internal representation
  • 20. Handling Encodings àbçdé bytes in someinput encoding or other decode encode àbçdé character-based internal representation
  • 21. Handling Encodings àbçdé àbçdé bytes in some bytes in desiredinput encoding or encoding other decode encode àbçdé character-based internal representation
  • 22. Handling Encodings àbçdé àbçdé bytes in some bytes in desiredinput encoding or encoding output other decode encode àbçdé character-based internal representation
  • 23. Handling Encodings àbçdé àbçdé bytes in some bytes in desiredinput encoding or encoding output other decode encodeuse Encode;$chars = decode($enc, àbçdé $bytes); character-based internal representation
  • 24. Handling Encodings àbçdé àbçdé bytes in some bytes in desiredinput encoding or encoding output other decode encodeuse Encode;$chars = decode($enc, àbçdé use Encode; $bytes = encode($enc, $bytes); $chars); character-based internal representation
  • 25. The Holy Grail
  • 26. The Holy Grail• Can represent all encodings
  • 27. The Holy Grail• Can represent all encodings• Has multibyte character support
  • 28. The Holy Grail• Can represent all encodings• Has multibyte character support • for example, length() should count characters, not bytes
  • 29. It doesnt work like that
  • 30. use Encode;
  • 31. use Encode;Only works in Perl 5.8and above
  • 32. use Encode;Only works in Perl 5.8 Why the $£%^&*() are you using 5.6and above ANYWAY?
  • 33. use Encode;Only works in Perl 5.8and aboveThere are solutions for 5.6 and evenearlier. But theyre HORRIBLE.
  • 34. character-based internal representation
  • 35. character-based internal Perl has one! representation
  • 36. character-based internal Perl has one! representation Magic internal representation.
  • 37. character-based internal Perl has one! representation Magic internal representation. All string functions know about it.
  • 38. character-based internal Perl has one! representation Magic internal representation. All string functions know about it. Its encoding-agnostic.
  • 39. character-based internal Perl has one! representation Magic internal representation. All string functions know about it. Its encoding-agnostic. In fact....
  • 40. ITS UTF-8!
  • 41. -8!almost TF SU IT
  • 42. Handling Encodings àbçdé àbçdé bytes in some bytes in desiredinput encoding or encoding output other decode encodeuse Encode;$chars = decode($enc, àbçdé use Encode; $bytes = encode($enc, $bytes); $chars); Perls magic internal representation
  • 43. àbçdé àbçdé bytes in bytes ininput machines 8bit machines 8bit output encoding encoding àbçdé bytes in machines 8bit encoding
  • 44. I18N? What the £$%^&*(s that? àbçdé àbçdé bytes in bytes ininput machines 8bit machines 8bit output encoding encoding àbçdé bytes in machines 8bit encoding
  • 45. People are still writing Perl like it was Perl 4
  • 46. People arestill writing Perl like itwas Perl 4
  • 47. People arestill writing Perl like itwas Perl 4...and we have to supportthem.
  • 48. People arestill writing Perl like itwas Perl 4...and we have to supportthem.Even though our stringfunctions expect chars.
  • 49. ????Perls magic internal representation
  • 50. ????Perls magic internal representation if
  • 51. ????Perls magic internal representation if all characters are representable in local machines 8 bit charset, use that;
  • 52. ????Perls magic internal representation if all characters are representable in local machines 8 bit charset, use that; else
  • 53. ????Perls magic internal representation if all characters are representable in local machines 8 bit charset, use that; else use UTF-8
  • 54. àbçdé UTF-8characters
  • 55. àbçdé UTF-8characters use Encode; $bytes = encode($enc, $chars);
  • 56. àbçdé àbçdé UTF-8 bytes in desiredcharacters output use Encode; encoding $bytes = encode($enc, $chars);
  • 57. àbçdé àbçdé UTF-8 bytes in desiredcharacters output use Encode; encoding $bytes = encode($enc, $chars); àbçdé machine bytes
  • 58. àbçdé àbçdé UTF-8 bytes in desiredcharacters output use Encode; encoding $bytes = encode($enc, $chars); àbçdé machine bytes use Encode; $bytes = encode($enc, $chars);
  • 59. àbçdé àbçdé UTF-8 bytes in desiredcharacters output use Encode; encoding $bytes = encode($enc, $chars); àbçdé àbçdé machine bytes in desired output bytes use Encode; encoding $bytes = encode($enc, $chars);
  • 60. UTF-8characters
  • 61. UTF-8characters+
  • 62. UTF-8characters+àbçdé machine bytes
  • 63. UTF-8+ =characters ????? àbçdé machine bytes
  • 64. UTF-8 + = characters ?????àbçdémachine bytes
  • 65. UTF-8 + = characters ?????àbçdémachine promote bytes
  • 66. UTF-8 + = characters ?????àbçdé àbçdémachine UTF-8 promote bytes characters
  • 67. UTF-8 + = characters àbçdé UTF-8 bytesàbçdé àbçdémachine UTF-8 promote bytes characters
  • 68. àbçdémachine bytes
  • 69. àbçdémachine output bytes
  • 70. àbçdé Content-Encoding: UTF-8machine output bytes
  • 71. àbçdé Content-Encoding: UTF-8 bd ? ? ?machine output bytes
  • 72. àbçdé Content-Encoding: UTF-8 bd ? ? ?machine output bytes Content-Encoding: ISO-8859-1
  • 73. àbçdé Content-Encoding: UTF-8 bd ? ? ?machine output bytes Content-Encoding: ISO-8859-1 àbçdé
  • 74. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé UTF-8characters
  • 75. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé UTF-8 outputcharacters
  • 76. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé Content-Encoding: UTF-8 UTF-8 outputcharacters
  • 77. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé Content-Encoding: UTF-8 àbçdé UTF-8 outputcharacters
  • 78. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé Content-Encoding: UTF-8 àbçdé UTF-8 outputcharacters Content-Encoding: ISO-8859-1
  • 79. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé Content-Encoding: UTF-8 àbçdé UTF-8 outputcharacters Content-Encoding: ISO-8859-1 àbçdé
  • 80. ARRR GH!!!!
  • 81. It gets worse.
  • 82. You cant tell whatyouve actually got
  • 83. You cant tell whatyouve actually got utf8::is_utf8()
  • 84. You cant tell what youve actually got utf8::is_utf8()does not mean what you think it means
  • 85. You cant tell whatyouve actually got
  • 86. You cant tell what youve actually gotencoded bytes
  • 87. You cant tell what youve actually gotencoded bytes utf8::is_utf8() = false
  • 88. You cant tell what youve actually gotencoded bytes utf8::is_utf8() = false EVEN IF theyre UTF-8
  • 89. You cant tell what youve actually got encoded bytes utf8::is_utf8() = false EVEN IF theyre UTF-8 decodedUTF-8 chars
  • 90. You cant tell what youve actually got encoded bytes utf8::is_utf8() = false EVEN IF theyre UTF-8 decodedUTF-8 chars utf8::is_utf8() = true
  • 91. You cant tell what youve actually got encoded bytes utf8::is_utf8() = false EVEN IF theyre UTF-8 decodedUTF-8 chars utf8::is_utf8() = true decodedmachine bytes
  • 92. You cant tell what youve actually got encoded bytes utf8::is_utf8() = false EVEN IF theyre UTF-8 decodedUTF-8 chars utf8::is_utf8() = true decodedmachine bytes utf8::is_utf8() = false
  • 93. The science bit
  • 94. The science bit• Encode.pm use Encode; $bytes = encode($enc, $chars);
  • 95. The science bit• Encode.pm use Encode; $bytes = encode($enc, $chars);• 3 argument form of open() - PerlIO layers open(FILEHANDLE, ">:encoding(UTF-8)", $file);
  • 96. The science bit• Encode.pm use Encode; $bytes = encode($enc, $chars);• 3 argument form of open() - PerlIO layers open(FILEHANDLE, ">:encoding(UTF-8)", $file);• binmode(FILEHANDE,
  • 97. utf8 vs UTF-8
  • 98. utf8 vs UTF-8• Encode.pm
  • 99. utf8 vs UTF-8• Encode.pm • utf8 = marks it as UTF-8 and hopes...
  • 100. utf8 vs UTF-8• Encode.pm • utf8 = marks it as UTF-8 and hopes... • UTF-8 = is actually valid UTF-8
  • 101. utf8 vs UTF-8• Encode.pm • utf8 = marks it as UTF-8 and hopes... • UTF-8 = is actually valid UTF-8• PerlIO layers:
  • 102. utf8 vs UTF-8• Encode.pm • utf8 = marks it as UTF-8 and hopes... • UTF-8 = is actually valid UTF-8• PerlIO layers: • :utf8
  • 103. utf8 vs UTF-8• Encode.pm • utf8 = marks it as UTF-8 and hopes... • UTF-8 = is actually valid UTF-8• PerlIO layers: • :utf8 • :encoding(UTF-8)
  • 104. use utf8;
  • 105. use utf8;• Does NOT do what you might think it does
  • 106. use utf8;• Does NOT do what you might think it does• All it says is my source code is UTF-8.
  • 107. Modules
  • 108. Modules• It depends on the module:
  • 109. Modules• It depends on the module: • CGI - $CGI::PARAM_UTF8=1;
  • 110. Modules• It depends on the module: • CGI - $CGI::PARAM_UTF8=1; • LWP::UserAgent - >decoded_content() method honours Content-Encoding:
  • 111. Modules• It depends on the module: • CGI - $CGI::PARAM_UTF8=1; • LWP::UserAgent - >decoded_content() method honours Content-Encoding: • DBI - mysql_enable_utf8 in DBI::connect()
  • 112. Modules• It depends on the module: • CGI - $CGI::PARAM_UTF8=1; • LWP::UserAgent - >decoded_content() method honours Content-Encoding: • DBI - mysql_enable_utf8 in DBI::connect() • XML::LibXML - looks at encoding,
  • 113. In summary
  • 114. In summary• decode bytes as soon as you get them:
  • 115. In summary• decode bytes as soon as you get them: • decode(), binmode(STDIN), 3 arg open()
  • 116. In summary• decode bytes as soon as you get them: • decode(), binmode(STDIN), 3 arg open()• encode characters just before you output:
  • 117. In summary• decode bytes as soon as you get them: • decode(), binmode(STDIN), 3 arg open()• encode characters just before you output: • encode(), binmode(STDOUT), 3 arg open()
  • 118. In summary• decode bytes as soon as you get them: • decode(), binmode(STDIN), 3 arg open()• encode characters just before you output: • encode(), binmode(STDOUT), 3 arg open()• keep track of whether your strings are
  • 119. NEVER EVER EVERrely on the encoding of Perls internal representation
  • 120. and...
  • 121. ...there isNO SUCH THING as "plain text"
  • 122. Handling Encodings àbçdé àbçdé bytes in some bytes in desiredinput encoding or encoding output other decode encodeuse Encode;$chars = decode($enc, àbçdé use Encode; $bytes = encode($enc, $bytes); $chars); Perls magic internal representation
  • 123. The Holy Fail (thanks Joel!) àbçdé àbçdé bytes in some bytes in desiredinput encoding or encoding output other decode encodeuse Encode;$chars = decode($enc, àbçdé use Encode; $bytes = encode($enc, $bytes); $chars); Perls magic internal representation
  • 124. Questions?