Perl And Unicode

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Perl And Unicode - Presentation Transcript

    1. Perl and Unicode
    2. Perl and Unicode Italian Perl Workshop 2009
    3. Perl and Unicode Italian Perl Workshop 2009 Mike Whitaker, BBC/EnlightenedPerl.org
    4. The problem • Keeping track of input and output encodings • Not losing encoding data in the middle • Understanding the difference between characters and bytes
    5. Characters vs bytes
    6. Characters vs bytes characters
    7. Characters vs bytes characters $
    8. Characters vs bytes characters $ U+0024
    9. Characters vs bytes characters $ U+0024 bytes (UTF-8)
    10. Characters vs bytes characters $ U+0024 bytes 0x24 (UTF-8)
    11. Characters vs bytes characters $ € U+0024 bytes 0x24 (UTF-8)
    12. Characters vs bytes characters $ € U+0024 U+20AC bytes 0x24 (UTF-8)
    13. Characters vs bytes characters $ € U+0024 U+20AC bytes 0x24 0xE2 0x82 0xAC (UTF-8)
    14. Characters vs bytes 2 characters $ € U+0024 U+20AC bytes 0x24 0xE2 0x82 0xAC (UTF-8)
    15. Characters vs bytes 2 characters $ € U+0024 U+20AC 4 bytes 0x24 0xE2 0x82 0xAC (UTF-8)
    16. Handling Encodings
    17. Handling Encodings input
    18. Handling Encodings àbçdé bytes in some input encoding or other
    19. Handling Encodings àbçdé bytes in some input encoding or other decode
    20. Handling Encodings àbçdé bytes in some input encoding or other decode àbçdé character-based internal representation
    21. Handling Encodings àbçdé bytes in some input encoding or other decode encode àbçdé character-based internal representation
    22. Handling Encodings àbçdé àbçdé bytes in some bytes in desired input encoding or encoding other decode encode àbçdé character-based internal representation
    23. Handling Encodings àbçdé àbçdé bytes in some bytes in desired input encoding or encoding output other decode encode àbçdé character-based internal representation
    24. Handling Encodings àbçdé àbçdé bytes in some bytes in desired input encoding or encoding output other decode encode use Encode; $chars = decode($enc, àbçdé $bytes); character-based internal representation
    25. Handling Encodings àbçdé àbçdé bytes in some bytes in desired input encoding or encoding output other decode encode use Encode; $chars = decode($enc, àbçdé use Encode; $bytes = encode($enc, $bytes); $chars); character-based internal representation
    26. The Holy Grail
    27. The Holy Grail • Can represent all encodings
    28. The Holy Grail • Can represent all encodings • Has multibyte character support
    29. The Holy Grail • Can represent all encodings • Has multibyte character support • for example, length() should count characters, not bytes
    30. It doesn't work like that
    31. use Encode;
    32. use Encode; Only works in Perl 5.8 and above
    33. use Encode; Only works in Perl 5.8 Why the $£%^&*() are you using 5.6 and above ANYWAY?
    34. use Encode; Only works in Perl 5.8 and above There are solutions for 5.6 and even earlier. But they're HORRIBLE.
    35. character-based internal representation
    36. character-based internal Perl has one! representation
    37. character-based internal Perl has one! representation Magic internal representation.
    38. character-based internal Perl has one! representation Magic internal representation. All string functions know about it.
    39. character-based internal Perl has one! representation Magic internal representation. All string functions know about it. It's encoding-agnostic.
    40. character-based internal Perl has one! representation Magic internal representation. All string functions know about it. It's encoding-agnostic. In fact....
    41. IT'S UTF-8!
    42. -8! almost TF SU IT'
    43. Handling Encodings àbçdé àbçdé bytes in some bytes in desired input encoding or encoding output other decode encode use Encode; $chars = decode($enc, àbçdé use Encode; $bytes = encode($enc, $bytes); $chars); Perl's magic internal representation
    44. àbçdé àbçdé bytes in bytes in input machine's 8bit machine's 8bit output encoding encoding àbçdé bytes in machine's 8bit encoding
    45. I18N? What the £$%^&*('s that? àbçdé àbçdé bytes in bytes in input machine's 8bit machine's 8bit output encoding encoding àbçdé bytes in machine's 8bit encoding
    46. People are still writing Perl like it was Perl 4
    47. People are still writing Perl like it was Perl 4
    48. People are still writing Perl like it was Perl 4 ...and we have to support them.
    49. People are still writing Perl like it was Perl 4 ...and we have to support them. Even though our string functions expect chars.
    50. ???? Perl's magic internal representation
    51. ???? Perl's magic internal representation if
    52. ???? Perl's magic internal representation if all characters are representable in local machine's 8 bit charset, use that;
    53. ???? Perl's magic internal representation if all characters are representable in local machine's 8 bit charset, use that; else
    54. ???? Perl's magic internal representation if all characters are representable in local machine's 8 bit charset, use that; else use UTF-8
    55. àbçdé UTF-8 characters
    56. àbçdé UTF-8 characters use Encode; $bytes = encode($enc, $chars);
    57. àbçdé àbçdé UTF-8 bytes in desired characters output use Encode; encoding $bytes = encode($enc, $chars);
    58. àbçdé àbçdé UTF-8 bytes in desired characters output use Encode; encoding $bytes = encode($enc, $chars); àbçdé machine bytes
    59. àbçdé àbçdé UTF-8 bytes in desired characters output use Encode; encoding $bytes = encode($enc, $chars); àbçdé machine bytes use Encode; $bytes = encode($enc, $chars);
    60. àbçdé àbçdé UTF-8 bytes in desired characters output use Encode; encoding $bytes = encode($enc, $chars); àbçdé àbçdé machine bytes in desired output bytes use Encode; encoding $bytes = encode($enc, $chars);
    61. UTF-8 characters
    62. UTF-8 characters +
    63. UTF-8 characters + àbçdé machine bytes
    64. UTF-8 + = characters ????? àbçdé machine bytes
    65. UTF-8 + = characters ????? àbçdé machine bytes
    66. UTF-8 + = characters ????? àbçdé machine promote bytes
    67. UTF-8 + = characters ????? àbçdé àbçdé machine UTF-8 promote bytes characters
    68. UTF-8 + = characters àbçdé UTF-8 bytes àbçdé àbçdé machine UTF-8 promote bytes characters
    69. àbçdé machine bytes
    70. àbçdé machine output bytes
    71. àbçdé Content-Encoding: UTF-8 machine output bytes
    72. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes
    73. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1
    74. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé
    75. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé UTF-8 characters
    76. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé UTF-8 output characters
    77. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé Content-Encoding: UTF-8 UTF-8 output characters
    78. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé Content-Encoding: UTF-8 àbçdé UTF-8 output characters
    79. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé Content-Encoding: UTF-8 àbçdé UTF-8 output characters Content-Encoding: ISO-8859-1
    80. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé Content-Encoding: UTF-8 àbçdé UTF-8 output characters Content-Encoding: ISO-8859-1 àbçdé
    81. ARRR GH!!!!
    82. It gets worse.
    83. You can't tell what you've actually got
    84. You can't tell what you've actually got utf8::is_utf8()
    85. You can't tell what you've actually got utf8::is_utf8() does not mean what you think it means
    86. You can't tell what you've actually got
    87. You can't tell what you've actually got encoded bytes
    88. You can't tell what you've actually got encoded bytes utf8::is_utf8() = false
    89. You can't tell what you've actually got encoded bytes utf8::is_utf8() = false EVEN IF they're UTF-8
    90. You can't tell what you've actually got encoded bytes utf8::is_utf8() = false EVEN IF they're UTF-8 decoded UTF-8 chars
    91. You can't tell what you've actually got encoded bytes utf8::is_utf8() = false EVEN IF they're UTF-8 decoded UTF-8 chars utf8::is_utf8() = true
    92. You can't tell what you've actually got encoded bytes utf8::is_utf8() = false EVEN IF they're UTF-8 decoded UTF-8 chars utf8::is_utf8() = true decoded machine bytes
    93. You can't tell what you've actually got encoded bytes utf8::is_utf8() = false EVEN IF they're UTF-8 decoded UTF-8 chars utf8::is_utf8() = true decoded machine bytes utf8::is_utf8() = false
    94. The science bit
    95. The science bit • Encode.pm use Encode; $bytes = encode($enc, $chars);
    96. The science bit • Encode.pm use Encode; $bytes = encode($enc, $chars); • 3 argument form of open() - PerlIO layers open(FILEHANDLE, ">:encoding(UTF-8)", $file);
    97. The science bit • Encode.pm use Encode; $bytes = encode($enc, $chars); • 3 argument form of open() - PerlIO layers open(FILEHANDLE, ">:encoding(UTF-8)", $file); • binmode(FILEHANDE,
    98. 'utf8' vs 'UTF-8'
    99. 'utf8' vs 'UTF-8' • Encode.pm
    100. 'utf8' vs 'UTF-8' • Encode.pm • utf8 = marks it as UTF-8 and hopes...
    101. 'utf8' vs 'UTF-8' • Encode.pm • utf8 = marks it as UTF-8 and hopes... • UTF-8 = is actually valid UTF-8
    102. 'utf8' vs 'UTF-8' • Encode.pm • utf8 = marks it as UTF-8 and hopes... • UTF-8 = is actually valid UTF-8 • PerlIO layers:
    103. 'utf8' vs 'UTF-8' • Encode.pm • utf8 = marks it as UTF-8 and hopes... • UTF-8 = is actually valid UTF-8 • PerlIO layers: • :utf8
    104. 'utf8' vs 'UTF-8' • Encode.pm • utf8 = marks it as UTF-8 and hopes... • UTF-8 = is actually valid UTF-8 • PerlIO layers: • :utf8 • :encoding(UTF-8)
    105. use utf8;
    106. use utf8; • Does NOT do what you might think it does
    107. use utf8; • Does NOT do what you might think it does • All it says is 'my source code is UTF-8'.
    108. Modules
    109. Modules • It depends on the module:
    110. Modules • It depends on the module: • CGI - does nothing
    111. Modules • It depends on the module: • CGI - does nothing • LWP::UserAgent - does nothing
    112. Modules • It depends on the module: • CGI - does nothing • LWP::UserAgent - does nothing • DBI - mysql_enable_utf8 in DBI::connect()
    113. Modules • It depends on the module: • CGI - does nothing • LWP::UserAgent - does nothing • DBI - mysql_enable_utf8 in DBI::connect() • XML::LibXML - looks at encoding, decode()'s
    114. In summary
    115. In summary • decode bytes as soon as you get them:
    116. In summary • decode bytes as soon as you get them: • decode(), binmode(STDIN), 3 arg open()
    117. In summary • decode bytes as soon as you get them: • decode(), binmode(STDIN), 3 arg open() • encode characters just before you output:
    118. In summary • decode bytes as soon as you get them: • decode(), binmode(STDIN), 3 arg open() • encode characters just before you output: • encode(), binmode(STDOUT), 3 arg open()
    119. In summary • decode bytes as soon as you get them: • decode(), binmode(STDIN), 3 arg open() • encode characters just before you output: • encode(), binmode(STDOUT), 3 arg open() • keep track of whether your strings are
    120. NEVER EVER EVER rely on the encoding of Perl's internal representation
    121. and...
    122. ...there is NO SUCH THING as "plain text"
    123. Handling Encodings àbçdé àbçdé bytes in some bytes in desired input encoding or encoding output other decode encode use Encode; $chars = decode($enc, àbçdé use Encode; $bytes = encode($enc, $bytes); $chars); Perl's magic internal representation
    124. The Holy Fail (thanks Joel!) àbçdé àbçdé bytes in some bytes in desired input encoding or encoding output other decode encode use Encode; $chars = decode($enc, àbçdé use Encode; $bytes = encode($enc, $bytes); $chars); Perl's magic internal representation
    125. Questions?

    + PenfoldPenfold, 3 weeks ago

    custom

    66 views, 0 favs, 0 embeds more stats

    Perl and Unicode talk from Italian Perl Workshop 20 more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 66
      • 66 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 1
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories