Perl and Unicode
Perl and Unicode
Mike Whitaker, BBC/EnlightenedPerl.org
The problem

• Keeping track of input and output
  encodings
• Not losing encoding data in the middle
• Understanding the difference between
  characters and bytes
Characters vs bytes
Characters vs bytes

characters
Characters vs bytes

characters
             $
Characters vs bytes

characters
              $
             U+0024
Characters vs bytes

characters
              $
             U+0024



  bytes
 (UTF-8)
Characters vs bytes

characters
              $
             U+0024



  bytes      0x24
 (UTF-8)
Characters vs bytes

characters
              $       €
             U+0024



  bytes      0x24
 (UTF-8)
Characters vs bytes

characters
              $        €
             U+0024   U+20AC



  bytes      0x24
 (UTF-8)
Characters vs bytes

characters
              $        €
             U+0024   U+20AC



  bytes      0x24 0xE2 0x82 0xAC
 (UTF-8)
Characters vs bytes

   2
characters
              $        €
             U+0024   U+20AC



  bytes      0x24 0xE2 0x82 0xAC
 (UTF-8)
Characters vs bytes

   2
characters
              $        €
             U+0024   U+20AC


   4
  bytes      0x24 0xE2 0x82 0xAC
 (UTF-8)
Handling Encodings
Handling Encodings
input
Handling Encodings
           àbçdé
         bytes in some
input     encoding or
             other
Handling Encodings
           àbçdé
         bytes in some
input     encoding or
             other


        decode
Handling Encodings
           àbçdé
         bytes in some
input     encoding or
             other


        decode
                         àbçdé
                    character-based
                        internal
                     representation
Handling Encodings
           àbçdé
         bytes in some
input     encoding or
             other


        decode                        encode
                         àbçdé
                    character-based
                        internal
                     representation
Handling Encodings
           àbçdé                     àbçdé
         bytes in some
                                 bytes in desired
input     encoding or
                                    encoding
             other


        decode                         encode
                         àbçdé
                    character-based
                        internal
                     representation
Handling Encodings
           àbçdé                     àbçdé
         bytes in some
                                 bytes in desired
input     encoding or
                                    encoding        output
             other


        decode                         encode
                         àbçdé
                    character-based
                        internal
                     representation
Handling Encodings
                àbçdé                     àbçdé
              bytes in some
                                      bytes in desired
input          encoding or
                                         encoding        output
                  other


            decode                          encode
use Encode;
$chars = decode($enc,
                              àbçdé
          $bytes);
                         character-based
                             internal
                          representation
Handling Encodings
                àbçdé                     àbçdé
              bytes in some
                                      bytes in desired
input          encoding or
                                         encoding           output
                  other


            decode                          encode
use Encode;
$chars = decode($enc,
                              àbçdé            use Encode;
                                               $bytes = encode($enc,
          $bytes);                                       $chars);
                         character-based
                             internal
                          representation
The Holy Grail
The Holy Grail

•   Can represent all
    encodings
The Holy Grail

•   Can represent all
    encodings

•   Has multibyte character
    support
The Holy Grail

•   Can represent all
    encodings

•   Has multibyte character
    support

    •   for example, length()
        should count
        characters, not bytes
It doesn't work like
        that
use Encode;
use Encode;
Only works in Perl 5.8
and above
use Encode;
Only works in Perl 5.8   Why the $£%^&*()
                         are you using 5.6
and above                ANYWAY?
use Encode;
Only works in Perl 5.8
and above

There are solutions for 5.6 and even
earlier. But they're HORRIBLE.
character-based
    internal
 representation
character-based
    internal      Perl has one!
 representation
character-based
    internal               Perl has one!
 representation



          Magic internal representation.
character-based
    internal               Perl has one!
 representation



          Magic internal representation.

       All string functions know about it.
character-based
    internal               Perl has one!
 representation



          Magic internal representation.

       All string functions know about it.
              It's encoding-agnostic.
character-based
    internal                 Perl has one!
 representation



          Magic internal representation.

       All string functions know about it.
              It's encoding-agnostic.

                    In fact....
IT'S UTF-8!
-8!
almost
         TF
    SU
 IT'
Handling Encodings
                àbçdé                     àbçdé
              bytes in some
                                      bytes in desired
input          encoding or
                                         encoding             output
                  other


            decode                              encode
use Encode;
$chars = decode($enc,
                              àbçdé              use Encode;
                                                 $bytes = encode($enc,
          $bytes);                                         $chars);
                        Perl's magic internal
                          representation
àbçdé                     àbçdé
           bytes in                 bytes in
input   machine's 8bit           machine's 8bit   output
          encoding                 encoding




                         àbçdé

                   bytes in machine's
                     8bit encoding
I18N? What the £$%^&*('s that?
           àbçdé                     àbçdé
           bytes in                 bytes in
input   machine's 8bit           machine's 8bit   output
          encoding                 encoding




                         àbçdé

                   bytes in machine's
                     8bit encoding
People are still writing
 Perl like it was Perl 4
People are
still writing
 Perl like it
was Perl 4
People are
still writing
 Perl like it
was Perl 4
...and we have to support
them.
People are
still writing
 Perl like it
was Perl 4
...and we have to support
them.

Even though our string
functions expect chars.
????

Perl's magic internal
  representation
????

Perl's magic internal
  representation



                        if
????

Perl's magic internal
  representation



                           if
          all characters are representable in
          local machine's 8 bit charset, use
                         that;
????

Perl's magic internal
  representation



                           if
          all characters are representable in
          local machine's 8 bit charset, use
                         that;

                         else
????

Perl's magic internal
  representation



                           if
          all characters are representable in
          local machine's 8 bit charset, use
                         that;

                           else
                        use UTF-8
àbçdé
  UTF-8
characters
àbçdé
  UTF-8
characters   use Encode;
             $bytes = encode($enc,
                       $chars);
àbçdé                               àbçdé
  UTF-8                          bytes in desired
characters                                          output
             use Encode;            encoding
             $bytes = encode($enc,
                       $chars);
àbçdé                               àbçdé
  UTF-8                          bytes in desired
characters                                          output
             use Encode;            encoding
             $bytes = encode($enc,
                       $chars);




 àbçdé
 machine
  bytes
àbçdé                               àbçdé
  UTF-8                          bytes in desired
characters                                          output
             use Encode;            encoding
             $bytes = encode($enc,
                       $chars);




 àbçdé
 machine
  bytes      use Encode;
             $bytes = encode($enc,
                       $chars);
àbçdé                               àbçdé
  UTF-8                          bytes in desired
characters                                          output
             use Encode;            encoding
             $bytes = encode($enc,
                       $chars);




 àbçdé                               àbçdé
 machine                         bytes in desired   output
  bytes      use Encode;
                                    encoding
             $bytes = encode($enc,
                       $chars);
UTF-8
characters
UTF-8
characters




+
UTF-8
characters




+àbçdé
 machine
  bytes
UTF-8




+ =
characters



             ?????


 àbçdé
 machine
  bytes
UTF-8




          + =
          characters



                       ?????


àbçdé
machine
 bytes
UTF-8




          + =
          characters



                       ?????


àbçdé
machine      promote
 bytes
UTF-8




          + =
          characters



                                    ?????


àbçdé                    àbçdé
machine                  UTF-8
             promote
 bytes                 characters
UTF-8




          + =
          characters

                                         àbçdé

                                    UTF-8 bytes


àbçdé                    àbçdé
machine                  UTF-8
             promote
 bytes                 characters
àbçdé
machine
 bytes
àbçdé
machine   output
 bytes
àbçdé
          Content-Encoding: UTF-8
machine                             output
 bytes
àbçdé
          Content-Encoding: UTF-8   bd
                                    ? ? ?
machine                              output
 bytes
àbçdé
               Content-Encoding: UTF-8   bd
                                         ? ? ?
machine                                   output
 bytes
          Content-Encoding: ISO-8859-1
àbçdé
               Content-Encoding: UTF-8   bd
                                         ? ? ?
machine                                   output
 bytes
          Content-Encoding: ISO-8859-1   àbçdé
àbçdé
                  Content-Encoding: UTF-8   bd
                                            ? ? ?
 machine                                     output
  bytes
             Content-Encoding: ISO-8859-1   àbçdé



 àbçdé
  UTF-8
characters
àbçdé
                  Content-Encoding: UTF-8   bd
                                            ? ? ?
 machine                                     output
  bytes
             Content-Encoding: ISO-8859-1   àbçdé



 àbçdé
  UTF-8                                      output
characters
àbçdé
                  Content-Encoding: UTF-8   bd
                                            ? ? ?
 machine                                     output
  bytes
             Content-Encoding: ISO-8859-1   àbçdé



 àbçdé
                  Content-Encoding: UTF-8
  UTF-8                                      output
characters
àbçdé
                  Content-Encoding: UTF-8   bd
                                            ? ? ?
 machine                                     output
  bytes
             Content-Encoding: ISO-8859-1   àbçdé



 àbçdé
                  Content-Encoding: UTF-8    àbçdé
  UTF-8                                       output
characters
àbçdé
                  Content-Encoding: UTF-8   bd
                                            ? ? ?
 machine                                     output
  bytes
             Content-Encoding: ISO-8859-1   àbçdé



 àbçdé
                  Content-Encoding: UTF-8    àbçdé
  UTF-8                                       output
characters
             Content-Encoding: ISO-8859-1
àbçdé
                  Content-Encoding: UTF-8   bd
                                            ? ? ?
 machine                                     output
  bytes
             Content-Encoding: ISO-8859-1   àbçdé



 àbçdé
                  Content-Encoding: UTF-8    àbçdé
  UTF-8                                       output
characters
             Content-Encoding: ISO-8859-1   àbçdé
ARRR   GH!!!!
It gets worse.
You can't tell what
you've actually got
You can't tell what
you've actually got

  utf8::is_utf8()
You can't tell what
  you've actually got

      utf8::is_utf8()
does not mean what you think it means
You can't tell what
you've actually got
You can't tell what
    you've actually got
encoded
 bytes
You can't tell what
    you've actually got
encoded
 bytes         utf8::is_utf8() = false
You can't tell what
    you've actually got
encoded
 bytes          utf8::is_utf8() = false
               EVEN IF they're UTF-8
You can't tell what
      you've actually got
 encoded
  bytes           utf8::is_utf8() = false
                 EVEN IF they're UTF-8
 decoded
UTF-8 chars
You can't tell what
      you've actually got
 encoded
  bytes           utf8::is_utf8() = false
                 EVEN IF they're UTF-8
 decoded
UTF-8 chars       utf8::is_utf8() = true
You can't tell what
       you've actually got
  encoded
   bytes           utf8::is_utf8() = false
                  EVEN IF they're UTF-8
 decoded
UTF-8 chars        utf8::is_utf8() = true


  decoded
machine bytes
You can't tell what
       you've actually got
  encoded
   bytes           utf8::is_utf8() = false
                  EVEN IF they're UTF-8
 decoded
UTF-8 chars        utf8::is_utf8() = true


  decoded
machine bytes      utf8::is_utf8() = false
The science bit
The science bit
• Encode.pm
  use Encode; $bytes = encode($enc,
  $chars);
The science bit
• Encode.pm
  use Encode; $bytes = encode($enc,
  $chars);
• 3 argument form of open() - PerlIO layers
  open(FILEHANDLE, ">:encoding(UTF-8)",
  $file);
The science bit
• Encode.pm
  use Encode; $bytes = encode($enc,
  $chars);
• 3 argument form of open() - PerlIO layers
  open(FILEHANDLE, ">:encoding(UTF-8)",
  $file);
• binmode(FILEHANDE,
'utf8' vs 'UTF-8'
'utf8' vs 'UTF-8'
• Encode.pm
'utf8' vs 'UTF-8'
• Encode.pm
 • utf8 = marks it as UTF-8 and hopes...
'utf8' vs 'UTF-8'
• Encode.pm
 • utf8 = marks it as UTF-8 and hopes...
 • UTF-8 = is actually valid UTF-8
'utf8' vs 'UTF-8'
• Encode.pm
 • utf8 = marks it as UTF-8 and hopes...
 • UTF-8 = is actually valid UTF-8
• PerlIO layers:
'utf8' vs 'UTF-8'
• Encode.pm
 • utf8 = marks it as UTF-8 and hopes...
 • UTF-8 = is actually valid UTF-8
• PerlIO layers:
 • :utf8
'utf8' vs 'UTF-8'
• Encode.pm
 • utf8 = marks it as UTF-8 and hopes...
 • UTF-8 = is actually valid UTF-8
• PerlIO layers:
 • :utf8
 • :encoding(UTF-8)
use utf8;
use utf8;


• Does NOT do what you might think it
  does
use utf8;


• Does NOT do what you might think it
  does
• All it says is 'my source code is UTF-8'.
Modules
Modules
• It depends on the module:
Modules
• It depends on the module:
 • CGI - $CGI::PARAM_UTF8=1;
Modules
• It depends on the module:
 • CGI - $CGI::PARAM_UTF8=1;
 • LWP::UserAgent -
   >decoded_content() method honours
   Content-Encoding:
Modules
• It depends on the module:
 • CGI - $CGI::PARAM_UTF8=1;
 • LWP::UserAgent -
   >decoded_content() method honours
   Content-Encoding:
 • DBI - mysql_enable_utf8 in
   DBI::connect()
Modules
• It depends on the module:
 • CGI - $CGI::PARAM_UTF8=1;
 • LWP::UserAgent -
   >decoded_content() method honours
   Content-Encoding:
 • DBI - mysql_enable_utf8 in
   DBI::connect()
 • XML::LibXML - looks at encoding,
In summary
In summary
• decode bytes as soon as you get them:
In summary
• decode bytes as soon as you get them:
 • decode(), binmode(STDIN), 3 arg
    open()
In summary
• decode bytes as soon as you get them:
 • decode(), binmode(STDIN), 3 arg
    open()
• encode characters just before you output:
In summary
• decode bytes as soon as you get them:
 • decode(), binmode(STDIN), 3 arg
    open()
• encode characters just before you output:
 • encode(), binmode(STDOUT), 3 arg
    open()
In summary
• decode bytes as soon as you get them:
 • decode(), binmode(STDIN), 3 arg
    open()
• encode characters just before you output:
 • encode(), binmode(STDOUT), 3 arg
    open()
• keep track of whether your strings are
NEVER EVER EVER
rely on the encoding of
      Perl's internal
     representation
and...
...there is
NO SUCH THING
        as
  "plain text"
Handling Encodings
                àbçdé                     àbçdé
              bytes in some
                                      bytes in desired
input          encoding or
                                         encoding             output
                  other


            decode                              encode
use Encode;
$chars = decode($enc,
                              àbçdé              use Encode;
                                                 $bytes = encode($enc,
          $bytes);                                         $chars);
                        Perl's magic internal
                          representation
The Holy Fail (thanks Joel!)
                àbçdé                     àbçdé
              bytes in some
                                      bytes in desired
input          encoding or
                                         encoding             output
                  other


            decode                              encode
use Encode;
$chars = decode($enc,
                              àbçdé              use Encode;
                                                 $bytes = encode($enc,
          $bytes);                                         $chars);
                        Perl's magic internal
                          representation
Questions?

Perl And Unicode

  • 1.
  • 2.
    Perl and Unicode MikeWhitaker, BBC/EnlightenedPerl.org
  • 3.
    The problem • Keepingtrack of input and output encodings • Not losing encoding data in the middle • Understanding the difference between characters and bytes
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
    Characters vs bytes characters $ U+0024 bytes (UTF-8)
  • 9.
    Characters vs bytes characters $ U+0024 bytes 0x24 (UTF-8)
  • 10.
    Characters vs bytes characters $ € U+0024 bytes 0x24 (UTF-8)
  • 11.
    Characters vs bytes characters $ € U+0024 U+20AC bytes 0x24 (UTF-8)
  • 12.
    Characters vs bytes characters $ € U+0024 U+20AC bytes 0x24 0xE2 0x82 0xAC (UTF-8)
  • 13.
    Characters vs bytes 2 characters $ € U+0024 U+20AC bytes 0x24 0xE2 0x82 0xAC (UTF-8)
  • 14.
    Characters vs bytes 2 characters $ € U+0024 U+20AC 4 bytes 0x24 0xE2 0x82 0xAC (UTF-8)
  • 16.
  • 17.
  • 18.
    Handling Encodings àbçdé bytes in some input encoding or other
  • 19.
    Handling Encodings àbçdé bytes in some input encoding or other decode
  • 20.
    Handling Encodings àbçdé bytes in some input encoding or other decode àbçdé character-based internal representation
  • 21.
    Handling Encodings àbçdé bytes in some input encoding or other decode encode àbçdé character-based internal representation
  • 22.
    Handling Encodings àbçdé àbçdé bytes in some bytes in desired input encoding or encoding other decode encode àbçdé character-based internal representation
  • 23.
    Handling Encodings àbçdé àbçdé bytes in some bytes in desired input encoding or encoding output other decode encode àbçdé character-based internal representation
  • 24.
    Handling Encodings àbçdé àbçdé bytes in some bytes in desired input encoding or encoding output other decode encode use Encode; $chars = decode($enc, àbçdé $bytes); character-based internal representation
  • 25.
    Handling Encodings àbçdé àbçdé bytes in some bytes in desired input encoding or encoding output other decode encode use Encode; $chars = decode($enc, àbçdé use Encode; $bytes = encode($enc, $bytes); $chars); character-based internal representation
  • 26.
  • 27.
    The Holy Grail • Can represent all encodings
  • 28.
    The Holy Grail • Can represent all encodings • Has multibyte character support
  • 29.
    The Holy Grail • Can represent all encodings • Has multibyte character support • for example, length() should count characters, not bytes
  • 30.
    It doesn't worklike that
  • 31.
  • 32.
    use Encode; Only worksin Perl 5.8 and above
  • 33.
    use Encode; Only worksin Perl 5.8 Why the $£%^&*() are you using 5.6 and above ANYWAY?
  • 34.
    use Encode; Only worksin Perl 5.8 and above There are solutions for 5.6 and even earlier. But they're HORRIBLE.
  • 35.
    character-based internal representation
  • 36.
    character-based internal Perl has one! representation
  • 37.
    character-based internal Perl has one! representation Magic internal representation.
  • 38.
    character-based internal Perl has one! representation Magic internal representation. All string functions know about it.
  • 39.
    character-based internal Perl has one! representation Magic internal representation. All string functions know about it. It's encoding-agnostic.
  • 40.
    character-based internal Perl has one! representation Magic internal representation. All string functions know about it. It's encoding-agnostic. In fact....
  • 42.
  • 43.
    -8! almost TF SU IT'
  • 44.
    Handling Encodings àbçdé àbçdé bytes in some bytes in desired input encoding or encoding output other decode encode use Encode; $chars = decode($enc, àbçdé use Encode; $bytes = encode($enc, $bytes); $chars); Perl's magic internal representation
  • 45.
    àbçdé àbçdé bytes in bytes in input machine's 8bit machine's 8bit output encoding encoding àbçdé bytes in machine's 8bit encoding
  • 46.
    I18N? What the£$%^&*('s that? àbçdé àbçdé bytes in bytes in input machine's 8bit machine's 8bit output encoding encoding àbçdé bytes in machine's 8bit encoding
  • 47.
    People are stillwriting Perl like it was Perl 4
  • 48.
    People are still writing Perl like it was Perl 4
  • 49.
    People are still writing Perl like it was Perl 4 ...and we have to support them.
  • 50.
    People are still writing Perl like it was Perl 4 ...and we have to support them. Even though our string functions expect chars.
  • 51.
  • 52.
    ???? Perl's magic internal representation if
  • 53.
    ???? Perl's magic internal representation if all characters are representable in local machine's 8 bit charset, use that;
  • 54.
    ???? Perl's magic internal representation if all characters are representable in local machine's 8 bit charset, use that; else
  • 55.
    ???? Perl's magic internal representation if all characters are representable in local machine's 8 bit charset, use that; else use UTF-8
  • 57.
  • 58.
    àbçdé UTF-8 characters use Encode; $bytes = encode($enc, $chars);
  • 59.
    àbçdé àbçdé UTF-8 bytes in desired characters output use Encode; encoding $bytes = encode($enc, $chars);
  • 60.
    àbçdé àbçdé UTF-8 bytes in desired characters output use Encode; encoding $bytes = encode($enc, $chars); àbçdé machine bytes
  • 61.
    àbçdé àbçdé UTF-8 bytes in desired characters output use Encode; encoding $bytes = encode($enc, $chars); àbçdé machine bytes use Encode; $bytes = encode($enc, $chars);
  • 62.
    àbçdé àbçdé UTF-8 bytes in desired characters output use Encode; encoding $bytes = encode($enc, $chars); àbçdé àbçdé machine bytes in desired output bytes use Encode; encoding $bytes = encode($enc, $chars);
  • 64.
  • 65.
  • 66.
  • 67.
    UTF-8 + = characters ????? àbçdé machine bytes
  • 68.
    UTF-8 + = characters ????? àbçdé machine bytes
  • 69.
    UTF-8 + = characters ????? àbçdé machine promote bytes
  • 70.
    UTF-8 + = characters ????? àbçdé àbçdé machine UTF-8 promote bytes characters
  • 71.
    UTF-8 + = characters àbçdé UTF-8 bytes àbçdé àbçdé machine UTF-8 promote bytes characters
  • 73.
  • 74.
    àbçdé machine output bytes
  • 75.
    àbçdé Content-Encoding: UTF-8 machine output bytes
  • 76.
    àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes
  • 77.
    àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1
  • 78.
    àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé
  • 79.
    àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé UTF-8 characters
  • 80.
    àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé UTF-8 output characters
  • 81.
    àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé Content-Encoding: UTF-8 UTF-8 output characters
  • 82.
    àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé Content-Encoding: UTF-8 àbçdé UTF-8 output characters
  • 83.
    àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé Content-Encoding: UTF-8 àbçdé UTF-8 output characters Content-Encoding: ISO-8859-1
  • 84.
    àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé Content-Encoding: UTF-8 àbçdé UTF-8 output characters Content-Encoding: ISO-8859-1 àbçdé
  • 85.
    ARRR GH!!!!
  • 87.
  • 88.
    You can't tellwhat you've actually got
  • 89.
    You can't tellwhat you've actually got utf8::is_utf8()
  • 90.
    You can't tellwhat you've actually got utf8::is_utf8() does not mean what you think it means
  • 91.
    You can't tellwhat you've actually got
  • 92.
    You can't tellwhat you've actually got encoded bytes
  • 93.
    You can't tellwhat you've actually got encoded bytes utf8::is_utf8() = false
  • 94.
    You can't tellwhat you've actually got encoded bytes utf8::is_utf8() = false EVEN IF they're UTF-8
  • 95.
    You can't tellwhat you've actually got encoded bytes utf8::is_utf8() = false EVEN IF they're UTF-8 decoded UTF-8 chars
  • 96.
    You can't tellwhat you've actually got encoded bytes utf8::is_utf8() = false EVEN IF they're UTF-8 decoded UTF-8 chars utf8::is_utf8() = true
  • 97.
    You can't tellwhat you've actually got encoded bytes utf8::is_utf8() = false EVEN IF they're UTF-8 decoded UTF-8 chars utf8::is_utf8() = true decoded machine bytes
  • 98.
    You can't tellwhat you've actually got encoded bytes utf8::is_utf8() = false EVEN IF they're UTF-8 decoded UTF-8 chars utf8::is_utf8() = true decoded machine bytes utf8::is_utf8() = false
  • 99.
  • 100.
    The science bit •Encode.pm use Encode; $bytes = encode($enc, $chars);
  • 101.
    The science bit •Encode.pm use Encode; $bytes = encode($enc, $chars); • 3 argument form of open() - PerlIO layers open(FILEHANDLE, ">:encoding(UTF-8)", $file);
  • 102.
    The science bit •Encode.pm use Encode; $bytes = encode($enc, $chars); • 3 argument form of open() - PerlIO layers open(FILEHANDLE, ">:encoding(UTF-8)", $file); • binmode(FILEHANDE,
  • 103.
  • 104.
  • 105.
    'utf8' vs 'UTF-8' •Encode.pm • utf8 = marks it as UTF-8 and hopes...
  • 106.
    'utf8' vs 'UTF-8' •Encode.pm • utf8 = marks it as UTF-8 and hopes... • UTF-8 = is actually valid UTF-8
  • 107.
    'utf8' vs 'UTF-8' •Encode.pm • utf8 = marks it as UTF-8 and hopes... • UTF-8 = is actually valid UTF-8 • PerlIO layers:
  • 108.
    'utf8' vs 'UTF-8' •Encode.pm • utf8 = marks it as UTF-8 and hopes... • UTF-8 = is actually valid UTF-8 • PerlIO layers: • :utf8
  • 109.
    'utf8' vs 'UTF-8' •Encode.pm • utf8 = marks it as UTF-8 and hopes... • UTF-8 = is actually valid UTF-8 • PerlIO layers: • :utf8 • :encoding(UTF-8)
  • 110.
  • 111.
    use utf8; • DoesNOT do what you might think it does
  • 112.
    use utf8; • DoesNOT do what you might think it does • All it says is 'my source code is UTF-8'.
  • 113.
  • 114.
    Modules • It dependson the module:
  • 115.
    Modules • It dependson the module: • CGI - $CGI::PARAM_UTF8=1;
  • 116.
    Modules • It dependson the module: • CGI - $CGI::PARAM_UTF8=1; • LWP::UserAgent - >decoded_content() method honours Content-Encoding:
  • 117.
    Modules • It dependson the module: • CGI - $CGI::PARAM_UTF8=1; • LWP::UserAgent - >decoded_content() method honours Content-Encoding: • DBI - mysql_enable_utf8 in DBI::connect()
  • 118.
    Modules • It dependson the module: • CGI - $CGI::PARAM_UTF8=1; • LWP::UserAgent - >decoded_content() method honours Content-Encoding: • DBI - mysql_enable_utf8 in DBI::connect() • XML::LibXML - looks at encoding,
  • 119.
  • 120.
    In summary • decodebytes as soon as you get them:
  • 121.
    In summary • decodebytes as soon as you get them: • decode(), binmode(STDIN), 3 arg open()
  • 122.
    In summary • decodebytes as soon as you get them: • decode(), binmode(STDIN), 3 arg open() • encode characters just before you output:
  • 123.
    In summary • decodebytes as soon as you get them: • decode(), binmode(STDIN), 3 arg open() • encode characters just before you output: • encode(), binmode(STDOUT), 3 arg open()
  • 124.
    In summary • decodebytes as soon as you get them: • decode(), binmode(STDIN), 3 arg open() • encode characters just before you output: • encode(), binmode(STDOUT), 3 arg open() • keep track of whether your strings are
  • 126.
    NEVER EVER EVER relyon the encoding of Perl's internal representation
  • 128.
  • 130.
    ...there is NO SUCHTHING as "plain text"
  • 131.
    Handling Encodings àbçdé àbçdé bytes in some bytes in desired input encoding or encoding output other decode encode use Encode; $chars = decode($enc, àbçdé use Encode; $bytes = encode($enc, $bytes); $chars); Perl's magic internal representation
  • 132.
    The Holy Fail(thanks Joel!) àbçdé àbçdé bytes in some bytes in desired input encoding or encoding output other decode encode use Encode; $chars = decode($enc, àbçdé use Encode; $bytes = encode($enc, $bytes); $chars); Perl's magic internal representation
  • 133.