Perl And Unicode

3,570 views
3,381 views

Published on

Perl and Unicode talk from Italian Perl Workshop 2009 and London Perl Workshop 2010.

Published in: Technology, Business
5 Comments
5 Likes
Statistics
Notes
No Downloads
Views
Total views
3,570
On SlideShare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
69
Comments
5
Likes
5
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Perl And Unicode

    1. 1. Perl and Unicode
    2. 2. Perl and UnicodeMike Whitaker, BBC/EnlightenedPerl.org
    3. 3. The problem• Keeping track of input and output encodings• Not losing encoding data in the middle• Understanding the difference between characters and bytes
    4. 4. Characters vs bytes
    5. 5. Characters vs bytescharacters
    6. 6. Characters vs bytescharacters $
    7. 7. Characters vs bytescharacters $ U+0024
    8. 8. Characters vs bytescharacters $ U+0024 bytes (UTF-8)
    9. 9. Characters vs bytescharacters $ U+0024 bytes 0x24 (UTF-8)
    10. 10. Characters vs bytescharacters $ € U+0024 bytes 0x24 (UTF-8)
    11. 11. Characters vs bytescharacters $ € U+0024 U+20AC bytes 0x24 (UTF-8)
    12. 12. Characters vs bytescharacters $ € U+0024 U+20AC bytes 0x24 0xE2 0x82 0xAC (UTF-8)
    13. 13. Characters vs bytes 2characters $ € U+0024 U+20AC bytes 0x24 0xE2 0x82 0xAC (UTF-8)
    14. 14. Characters vs bytes 2characters $ € U+0024 U+20AC 4 bytes 0x24 0xE2 0x82 0xAC (UTF-8)
    15. 15. Handling Encodings
    16. 16. Handling Encodingsinput
    17. 17. Handling Encodings àbçdé bytes in someinput encoding or other
    18. 18. Handling Encodings àbçdé bytes in someinput encoding or other decode
    19. 19. Handling Encodings àbçdé bytes in someinput encoding or other decode àbçdé character-based internal representation
    20. 20. Handling Encodings àbçdé bytes in someinput encoding or other decode encode àbçdé character-based internal representation
    21. 21. Handling Encodings àbçdé àbçdé bytes in some bytes in desiredinput encoding or encoding other decode encode àbçdé character-based internal representation
    22. 22. Handling Encodings àbçdé àbçdé bytes in some bytes in desiredinput encoding or encoding output other decode encode àbçdé character-based internal representation
    23. 23. Handling Encodings àbçdé àbçdé bytes in some bytes in desiredinput encoding or encoding output other decode encodeuse Encode;$chars = decode($enc, àbçdé $bytes); character-based internal representation
    24. 24. Handling Encodings àbçdé àbçdé bytes in some bytes in desiredinput encoding or encoding output other decode encodeuse Encode;$chars = decode($enc, àbçdé use Encode; $bytes = encode($enc, $bytes); $chars); character-based internal representation
    25. 25. The Holy Grail
    26. 26. The Holy Grail• Can represent all encodings
    27. 27. The Holy Grail• Can represent all encodings• Has multibyte character support
    28. 28. The Holy Grail• Can represent all encodings• Has multibyte character support • for example, length() should count characters, not bytes
    29. 29. It doesnt work like that
    30. 30. use Encode;
    31. 31. use Encode;Only works in Perl 5.8and above
    32. 32. use Encode;Only works in Perl 5.8 Why the $£%^&*() are you using 5.6and above ANYWAY?
    33. 33. use Encode;Only works in Perl 5.8and aboveThere are solutions for 5.6 and evenearlier. But theyre HORRIBLE.
    34. 34. character-based internal representation
    35. 35. character-based internal Perl has one! representation
    36. 36. character-based internal Perl has one! representation Magic internal representation.
    37. 37. character-based internal Perl has one! representation Magic internal representation. All string functions know about it.
    38. 38. character-based internal Perl has one! representation Magic internal representation. All string functions know about it. Its encoding-agnostic.
    39. 39. character-based internal Perl has one! representation Magic internal representation. All string functions know about it. Its encoding-agnostic. In fact....
    40. 40. ITS UTF-8!
    41. 41. -8!almost TF SU IT
    42. 42. Handling Encodings àbçdé àbçdé bytes in some bytes in desiredinput encoding or encoding output other decode encodeuse Encode;$chars = decode($enc, àbçdé use Encode; $bytes = encode($enc, $bytes); $chars); Perls magic internal representation
    43. 43. àbçdé àbçdé bytes in bytes ininput machines 8bit machines 8bit output encoding encoding àbçdé bytes in machines 8bit encoding
    44. 44. I18N? What the £$%^&*(s that? àbçdé àbçdé bytes in bytes ininput machines 8bit machines 8bit output encoding encoding àbçdé bytes in machines 8bit encoding
    45. 45. People are still writing Perl like it was Perl 4
    46. 46. People arestill writing Perl like itwas Perl 4
    47. 47. People arestill writing Perl like itwas Perl 4...and we have to supportthem.
    48. 48. People arestill writing Perl like itwas Perl 4...and we have to supportthem.Even though our stringfunctions expect chars.
    49. 49. ????Perls magic internal representation
    50. 50. ????Perls magic internal representation if
    51. 51. ????Perls magic internal representation if all characters are representable in local machines 8 bit charset, use that;
    52. 52. ????Perls magic internal representation if all characters are representable in local machines 8 bit charset, use that; else
    53. 53. ????Perls magic internal representation if all characters are representable in local machines 8 bit charset, use that; else use UTF-8
    54. 54. àbçdé UTF-8characters
    55. 55. àbçdé UTF-8characters use Encode; $bytes = encode($enc, $chars);
    56. 56. àbçdé àbçdé UTF-8 bytes in desiredcharacters output use Encode; encoding $bytes = encode($enc, $chars);
    57. 57. àbçdé àbçdé UTF-8 bytes in desiredcharacters output use Encode; encoding $bytes = encode($enc, $chars); àbçdé machine bytes
    58. 58. àbçdé àbçdé UTF-8 bytes in desiredcharacters output use Encode; encoding $bytes = encode($enc, $chars); àbçdé machine bytes use Encode; $bytes = encode($enc, $chars);
    59. 59. àbçdé àbçdé UTF-8 bytes in desiredcharacters output use Encode; encoding $bytes = encode($enc, $chars); àbçdé àbçdé machine bytes in desired output bytes use Encode; encoding $bytes = encode($enc, $chars);
    60. 60. UTF-8characters
    61. 61. UTF-8characters+
    62. 62. UTF-8characters+àbçdé machine bytes
    63. 63. UTF-8+ =characters ????? àbçdé machine bytes
    64. 64. UTF-8 + = characters ?????àbçdémachine bytes
    65. 65. UTF-8 + = characters ?????àbçdémachine promote bytes
    66. 66. UTF-8 + = characters ?????àbçdé àbçdémachine UTF-8 promote bytes characters
    67. 67. UTF-8 + = characters àbçdé UTF-8 bytesàbçdé àbçdémachine UTF-8 promote bytes characters
    68. 68. àbçdémachine bytes
    69. 69. àbçdémachine output bytes
    70. 70. àbçdé Content-Encoding: UTF-8machine output bytes
    71. 71. àbçdé Content-Encoding: UTF-8 bd ? ? ?machine output bytes
    72. 72. àbçdé Content-Encoding: UTF-8 bd ? ? ?machine output bytes Content-Encoding: ISO-8859-1
    73. 73. àbçdé Content-Encoding: UTF-8 bd ? ? ?machine output bytes Content-Encoding: ISO-8859-1 àbçdé
    74. 74. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé UTF-8characters
    75. 75. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé UTF-8 outputcharacters
    76. 76. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé Content-Encoding: UTF-8 UTF-8 outputcharacters
    77. 77. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé Content-Encoding: UTF-8 àbçdé UTF-8 outputcharacters
    78. 78. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé Content-Encoding: UTF-8 àbçdé UTF-8 outputcharacters Content-Encoding: ISO-8859-1
    79. 79. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé Content-Encoding: UTF-8 àbçdé UTF-8 outputcharacters Content-Encoding: ISO-8859-1 àbçdé
    80. 80. ARRR GH!!!!
    81. 81. It gets worse.
    82. 82. You cant tell whatyouve actually got
    83. 83. You cant tell whatyouve actually got utf8::is_utf8()
    84. 84. You cant tell what youve actually got utf8::is_utf8()does not mean what you think it means
    85. 85. You cant tell whatyouve actually got
    86. 86. You cant tell what youve actually gotencoded bytes
    87. 87. You cant tell what youve actually gotencoded bytes utf8::is_utf8() = false
    88. 88. You cant tell what youve actually gotencoded bytes utf8::is_utf8() = false EVEN IF theyre UTF-8
    89. 89. You cant tell what youve actually got encoded bytes utf8::is_utf8() = false EVEN IF theyre UTF-8 decodedUTF-8 chars
    90. 90. You cant tell what youve actually got encoded bytes utf8::is_utf8() = false EVEN IF theyre UTF-8 decodedUTF-8 chars utf8::is_utf8() = true
    91. 91. You cant tell what youve actually got encoded bytes utf8::is_utf8() = false EVEN IF theyre UTF-8 decodedUTF-8 chars utf8::is_utf8() = true decodedmachine bytes
    92. 92. You cant tell what youve actually got encoded bytes utf8::is_utf8() = false EVEN IF theyre UTF-8 decodedUTF-8 chars utf8::is_utf8() = true decodedmachine bytes utf8::is_utf8() = false
    93. 93. The science bit
    94. 94. The science bit• Encode.pm use Encode; $bytes = encode($enc, $chars);
    95. 95. The science bit• Encode.pm use Encode; $bytes = encode($enc, $chars);• 3 argument form of open() - PerlIO layers open(FILEHANDLE, ">:encoding(UTF-8)", $file);
    96. 96. The science bit• Encode.pm use Encode; $bytes = encode($enc, $chars);• 3 argument form of open() - PerlIO layers open(FILEHANDLE, ">:encoding(UTF-8)", $file);• binmode(FILEHANDE,
    97. 97. utf8 vs UTF-8
    98. 98. utf8 vs UTF-8• Encode.pm
    99. 99. utf8 vs UTF-8• Encode.pm • utf8 = marks it as UTF-8 and hopes...
    100. 100. utf8 vs UTF-8• Encode.pm • utf8 = marks it as UTF-8 and hopes... • UTF-8 = is actually valid UTF-8
    101. 101. utf8 vs UTF-8• Encode.pm • utf8 = marks it as UTF-8 and hopes... • UTF-8 = is actually valid UTF-8• PerlIO layers:
    102. 102. utf8 vs UTF-8• Encode.pm • utf8 = marks it as UTF-8 and hopes... • UTF-8 = is actually valid UTF-8• PerlIO layers: • :utf8
    103. 103. utf8 vs UTF-8• Encode.pm • utf8 = marks it as UTF-8 and hopes... • UTF-8 = is actually valid UTF-8• PerlIO layers: • :utf8 • :encoding(UTF-8)
    104. 104. use utf8;
    105. 105. use utf8;• Does NOT do what you might think it does
    106. 106. use utf8;• Does NOT do what you might think it does• All it says is my source code is UTF-8.
    107. 107. Modules
    108. 108. Modules• It depends on the module:
    109. 109. Modules• It depends on the module: • CGI - $CGI::PARAM_UTF8=1;
    110. 110. Modules• It depends on the module: • CGI - $CGI::PARAM_UTF8=1; • LWP::UserAgent - >decoded_content() method honours Content-Encoding:
    111. 111. Modules• It depends on the module: • CGI - $CGI::PARAM_UTF8=1; • LWP::UserAgent - >decoded_content() method honours Content-Encoding: • DBI - mysql_enable_utf8 in DBI::connect()
    112. 112. Modules• It depends on the module: • CGI - $CGI::PARAM_UTF8=1; • LWP::UserAgent - >decoded_content() method honours Content-Encoding: • DBI - mysql_enable_utf8 in DBI::connect() • XML::LibXML - looks at encoding,
    113. 113. In summary
    114. 114. In summary• decode bytes as soon as you get them:
    115. 115. In summary• decode bytes as soon as you get them: • decode(), binmode(STDIN), 3 arg open()
    116. 116. In summary• decode bytes as soon as you get them: • decode(), binmode(STDIN), 3 arg open()• encode characters just before you output:
    117. 117. In summary• decode bytes as soon as you get them: • decode(), binmode(STDIN), 3 arg open()• encode characters just before you output: • encode(), binmode(STDOUT), 3 arg open()
    118. 118. In summary• decode bytes as soon as you get them: • decode(), binmode(STDIN), 3 arg open()• encode characters just before you output: • encode(), binmode(STDOUT), 3 arg open()• keep track of whether your strings are
    119. 119. NEVER EVER EVERrely on the encoding of Perls internal representation
    120. 120. and...
    121. 121. ...there isNO SUCH THING as "plain text"
    122. 122. Handling Encodings àbçdé àbçdé bytes in some bytes in desiredinput encoding or encoding output other decode encodeuse Encode;$chars = decode($enc, àbçdé use Encode; $bytes = encode($enc, $bytes); $chars); Perls magic internal representation
    123. 123. The Holy Fail (thanks Joel!) àbçdé àbçdé bytes in some bytes in desiredinput encoding or encoding output other decode encodeuse Encode;$chars = decode($enc, àbçdé use Encode; $bytes = encode($enc, $bytes); $chars); Perls magic internal representation
    124. 124. Questions?

    ×