SlideShare a Scribd company logo
1 of 133
Perl and Unicode
Perl and Unicode
Mike Whitaker, BBC/EnlightenedPerl.org
The problem

• Keeping track of input and output
  encodings
• Not losing encoding data in the middle
• Understanding the difference between
  characters and bytes
Characters vs bytes
Characters vs bytes

characters
Characters vs bytes

characters
             $
Characters vs bytes

characters
              $
             U+0024
Characters vs bytes

characters
              $
             U+0024



  bytes
 (UTF-8)
Characters vs bytes

characters
              $
             U+0024



  bytes      0x24
 (UTF-8)
Characters vs bytes

characters
              $       €
             U+0024



  bytes      0x24
 (UTF-8)
Characters vs bytes

characters
              $        €
             U+0024   U+20AC



  bytes      0x24
 (UTF-8)
Characters vs bytes

characters
              $        €
             U+0024   U+20AC



  bytes      0x24 0xE2 0x82 0xAC
 (UTF-8)
Characters vs bytes

   2
characters
              $        €
             U+0024   U+20AC



  bytes      0x24 0xE2 0x82 0xAC
 (UTF-8)
Characters vs bytes

   2
characters
              $        €
             U+0024   U+20AC


   4
  bytes      0x24 0xE2 0x82 0xAC
 (UTF-8)
Handling Encodings
Handling Encodings
input
Handling Encodings
           àbçdé
         bytes in some
input     encoding or
             other
Handling Encodings
           àbçdé
         bytes in some
input     encoding or
             other


        decode
Handling Encodings
           àbçdé
         bytes in some
input     encoding or
             other


        decode
                         àbçdé
                    character-based
                        internal
                     representation
Handling Encodings
           àbçdé
         bytes in some
input     encoding or
             other


        decode                        encode
                         àbçdé
                    character-based
                        internal
                     representation
Handling Encodings
           àbçdé                     àbçdé
         bytes in some
                                 bytes in desired
input     encoding or
                                    encoding
             other


        decode                         encode
                         àbçdé
                    character-based
                        internal
                     representation
Handling Encodings
           àbçdé                     àbçdé
         bytes in some
                                 bytes in desired
input     encoding or
                                    encoding        output
             other


        decode                         encode
                         àbçdé
                    character-based
                        internal
                     representation
Handling Encodings
                àbçdé                     àbçdé
              bytes in some
                                      bytes in desired
input          encoding or
                                         encoding        output
                  other


            decode                          encode
use Encode;
$chars = decode($enc,
                              àbçdé
          $bytes);
                         character-based
                             internal
                          representation
Handling Encodings
                àbçdé                     àbçdé
              bytes in some
                                      bytes in desired
input          encoding or
                                         encoding           output
                  other


            decode                          encode
use Encode;
$chars = decode($enc,
                              àbçdé            use Encode;
                                               $bytes = encode($enc,
          $bytes);                                       $chars);
                         character-based
                             internal
                          representation
The Holy Grail
The Holy Grail

•   Can represent all
    encodings
The Holy Grail

•   Can represent all
    encodings

•   Has multibyte character
    support
The Holy Grail

•   Can represent all
    encodings

•   Has multibyte character
    support

    •   for example, length()
        should count
        characters, not bytes
It doesn't work like
        that
use Encode;
use Encode;
Only works in Perl 5.8
and above
use Encode;
Only works in Perl 5.8   Why the $£%^&*()
                         are you using 5.6
and above                ANYWAY?
use Encode;
Only works in Perl 5.8
and above

There are solutions for 5.6 and even
earlier. But they're HORRIBLE.
character-based
    internal
 representation
character-based
    internal      Perl has one!
 representation
character-based
    internal               Perl has one!
 representation



          Magic internal representation.
character-based
    internal               Perl has one!
 representation



          Magic internal representation.

       All string functions know about it.
character-based
    internal               Perl has one!
 representation



          Magic internal representation.

       All string functions know about it.
              It's encoding-agnostic.
character-based
    internal                 Perl has one!
 representation



          Magic internal representation.

       All string functions know about it.
              It's encoding-agnostic.

                    In fact....
IT'S UTF-8!
-8!
almost
         TF
    SU
 IT'
Handling Encodings
                àbçdé                     àbçdé
              bytes in some
                                      bytes in desired
input          encoding or
                                         encoding             output
                  other


            decode                              encode
use Encode;
$chars = decode($enc,
                              àbçdé              use Encode;
                                                 $bytes = encode($enc,
          $bytes);                                         $chars);
                        Perl's magic internal
                          representation
àbçdé                     àbçdé
           bytes in                 bytes in
input   machine's 8bit           machine's 8bit   output
          encoding                 encoding




                         àbçdé

                   bytes in machine's
                     8bit encoding
I18N? What the £$%^&*('s that?
           àbçdé                     àbçdé
           bytes in                 bytes in
input   machine's 8bit           machine's 8bit   output
          encoding                 encoding




                         àbçdé

                   bytes in machine's
                     8bit encoding
People are still writing
 Perl like it was Perl 4
People are
still writing
 Perl like it
was Perl 4
People are
still writing
 Perl like it
was Perl 4
...and we have to support
them.
People are
still writing
 Perl like it
was Perl 4
...and we have to support
them.

Even though our string
functions expect chars.
????

Perl's magic internal
  representation
????

Perl's magic internal
  representation



                        if
????

Perl's magic internal
  representation



                           if
          all characters are representable in
          local machine's 8 bit charset, use
                         that;
????

Perl's magic internal
  representation



                           if
          all characters are representable in
          local machine's 8 bit charset, use
                         that;

                         else
????

Perl's magic internal
  representation



                           if
          all characters are representable in
          local machine's 8 bit charset, use
                         that;

                           else
                        use UTF-8
àbçdé
  UTF-8
characters
àbçdé
  UTF-8
characters   use Encode;
             $bytes = encode($enc,
                       $chars);
àbçdé                               àbçdé
  UTF-8                          bytes in desired
characters                                          output
             use Encode;            encoding
             $bytes = encode($enc,
                       $chars);
àbçdé                               àbçdé
  UTF-8                          bytes in desired
characters                                          output
             use Encode;            encoding
             $bytes = encode($enc,
                       $chars);




 àbçdé
 machine
  bytes
àbçdé                               àbçdé
  UTF-8                          bytes in desired
characters                                          output
             use Encode;            encoding
             $bytes = encode($enc,
                       $chars);




 àbçdé
 machine
  bytes      use Encode;
             $bytes = encode($enc,
                       $chars);
àbçdé                               àbçdé
  UTF-8                          bytes in desired
characters                                          output
             use Encode;            encoding
             $bytes = encode($enc,
                       $chars);




 àbçdé                               àbçdé
 machine                         bytes in desired   output
  bytes      use Encode;
                                    encoding
             $bytes = encode($enc,
                       $chars);
UTF-8
characters
UTF-8
characters




+
UTF-8
characters




+àbçdé
 machine
  bytes
UTF-8




+ =
characters



             ?????


 àbçdé
 machine
  bytes
UTF-8




          + =
          characters



                       ?????


àbçdé
machine
 bytes
UTF-8




          + =
          characters



                       ?????


àbçdé
machine      promote
 bytes
UTF-8




          + =
          characters



                                    ?????


àbçdé                    àbçdé
machine                  UTF-8
             promote
 bytes                 characters
UTF-8




          + =
          characters

                                         àbçdé

                                    UTF-8 bytes


àbçdé                    àbçdé
machine                  UTF-8
             promote
 bytes                 characters
àbçdé
machine
 bytes
àbçdé
machine   output
 bytes
àbçdé
          Content-Encoding: UTF-8
machine                             output
 bytes
àbçdé
          Content-Encoding: UTF-8   bd
                                    ? ? ?
machine                              output
 bytes
àbçdé
               Content-Encoding: UTF-8   bd
                                         ? ? ?
machine                                   output
 bytes
          Content-Encoding: ISO-8859-1
àbçdé
               Content-Encoding: UTF-8   bd
                                         ? ? ?
machine                                   output
 bytes
          Content-Encoding: ISO-8859-1   àbçdé
àbçdé
                  Content-Encoding: UTF-8   bd
                                            ? ? ?
 machine                                     output
  bytes
             Content-Encoding: ISO-8859-1   àbçdé



 àbçdé
  UTF-8
characters
àbçdé
                  Content-Encoding: UTF-8   bd
                                            ? ? ?
 machine                                     output
  bytes
             Content-Encoding: ISO-8859-1   àbçdé



 àbçdé
  UTF-8                                      output
characters
àbçdé
                  Content-Encoding: UTF-8   bd
                                            ? ? ?
 machine                                     output
  bytes
             Content-Encoding: ISO-8859-1   àbçdé



 àbçdé
                  Content-Encoding: UTF-8
  UTF-8                                      output
characters
àbçdé
                  Content-Encoding: UTF-8   bd
                                            ? ? ?
 machine                                     output
  bytes
             Content-Encoding: ISO-8859-1   àbçdé



 àbçdé
                  Content-Encoding: UTF-8    àbçdé
  UTF-8                                       output
characters
àbçdé
                  Content-Encoding: UTF-8   bd
                                            ? ? ?
 machine                                     output
  bytes
             Content-Encoding: ISO-8859-1   àbçdé



 àbçdé
                  Content-Encoding: UTF-8    àbçdé
  UTF-8                                       output
characters
             Content-Encoding: ISO-8859-1
àbçdé
                  Content-Encoding: UTF-8   bd
                                            ? ? ?
 machine                                     output
  bytes
             Content-Encoding: ISO-8859-1   àbçdé



 àbçdé
                  Content-Encoding: UTF-8    àbçdé
  UTF-8                                       output
characters
             Content-Encoding: ISO-8859-1   àbçdé
ARRR   GH!!!!
It gets worse.
You can't tell what
you've actually got
You can't tell what
you've actually got

  utf8::is_utf8()
You can't tell what
  you've actually got

      utf8::is_utf8()
does not mean what you think it means
You can't tell what
you've actually got
You can't tell what
    you've actually got
encoded
 bytes
You can't tell what
    you've actually got
encoded
 bytes         utf8::is_utf8() = false
You can't tell what
    you've actually got
encoded
 bytes          utf8::is_utf8() = false
               EVEN IF they're UTF-8
You can't tell what
      you've actually got
 encoded
  bytes           utf8::is_utf8() = false
                 EVEN IF they're UTF-8
 decoded
UTF-8 chars
You can't tell what
      you've actually got
 encoded
  bytes           utf8::is_utf8() = false
                 EVEN IF they're UTF-8
 decoded
UTF-8 chars       utf8::is_utf8() = true
You can't tell what
       you've actually got
  encoded
   bytes           utf8::is_utf8() = false
                  EVEN IF they're UTF-8
 decoded
UTF-8 chars        utf8::is_utf8() = true


  decoded
machine bytes
You can't tell what
       you've actually got
  encoded
   bytes           utf8::is_utf8() = false
                  EVEN IF they're UTF-8
 decoded
UTF-8 chars        utf8::is_utf8() = true


  decoded
machine bytes      utf8::is_utf8() = false
The science bit
The science bit
• Encode.pm
  use Encode; $bytes = encode($enc,
  $chars);
The science bit
• Encode.pm
  use Encode; $bytes = encode($enc,
  $chars);
• 3 argument form of open() - PerlIO layers
  open(FILEHANDLE, ">:encoding(UTF-8)",
  $file);
The science bit
• Encode.pm
  use Encode; $bytes = encode($enc,
  $chars);
• 3 argument form of open() - PerlIO layers
  open(FILEHANDLE, ">:encoding(UTF-8)",
  $file);
• binmode(FILEHANDE,
'utf8' vs 'UTF-8'
'utf8' vs 'UTF-8'
• Encode.pm
'utf8' vs 'UTF-8'
• Encode.pm
 • utf8 = marks it as UTF-8 and hopes...
'utf8' vs 'UTF-8'
• Encode.pm
 • utf8 = marks it as UTF-8 and hopes...
 • UTF-8 = is actually valid UTF-8
'utf8' vs 'UTF-8'
• Encode.pm
 • utf8 = marks it as UTF-8 and hopes...
 • UTF-8 = is actually valid UTF-8
• PerlIO layers:
'utf8' vs 'UTF-8'
• Encode.pm
 • utf8 = marks it as UTF-8 and hopes...
 • UTF-8 = is actually valid UTF-8
• PerlIO layers:
 • :utf8
'utf8' vs 'UTF-8'
• Encode.pm
 • utf8 = marks it as UTF-8 and hopes...
 • UTF-8 = is actually valid UTF-8
• PerlIO layers:
 • :utf8
 • :encoding(UTF-8)
use utf8;
use utf8;


• Does NOT do what you might think it
  does
use utf8;


• Does NOT do what you might think it
  does
• All it says is 'my source code is UTF-8'.
Modules
Modules
• It depends on the module:
Modules
• It depends on the module:
 • CGI - $CGI::PARAM_UTF8=1;
Modules
• It depends on the module:
 • CGI - $CGI::PARAM_UTF8=1;
 • LWP::UserAgent -
   >decoded_content() method honours
   Content-Encoding:
Modules
• It depends on the module:
 • CGI - $CGI::PARAM_UTF8=1;
 • LWP::UserAgent -
   >decoded_content() method honours
   Content-Encoding:
 • DBI - mysql_enable_utf8 in
   DBI::connect()
Modules
• It depends on the module:
 • CGI - $CGI::PARAM_UTF8=1;
 • LWP::UserAgent -
   >decoded_content() method honours
   Content-Encoding:
 • DBI - mysql_enable_utf8 in
   DBI::connect()
 • XML::LibXML - looks at encoding,
In summary
In summary
• decode bytes as soon as you get them:
In summary
• decode bytes as soon as you get them:
 • decode(), binmode(STDIN), 3 arg
    open()
In summary
• decode bytes as soon as you get them:
 • decode(), binmode(STDIN), 3 arg
    open()
• encode characters just before you output:
In summary
• decode bytes as soon as you get them:
 • decode(), binmode(STDIN), 3 arg
    open()
• encode characters just before you output:
 • encode(), binmode(STDOUT), 3 arg
    open()
In summary
• decode bytes as soon as you get them:
 • decode(), binmode(STDIN), 3 arg
    open()
• encode characters just before you output:
 • encode(), binmode(STDOUT), 3 arg
    open()
• keep track of whether your strings are
NEVER EVER EVER
rely on the encoding of
      Perl's internal
     representation
and...
...there is
NO SUCH THING
        as
  "plain text"
Handling Encodings
                àbçdé                     àbçdé
              bytes in some
                                      bytes in desired
input          encoding or
                                         encoding             output
                  other


            decode                              encode
use Encode;
$chars = decode($enc,
                              àbçdé              use Encode;
                                                 $bytes = encode($enc,
          $bytes);                                         $chars);
                        Perl's magic internal
                          representation
The Holy Fail (thanks Joel!)
                àbçdé                     àbçdé
              bytes in some
                                      bytes in desired
input          encoding or
                                         encoding             output
                  other


            decode                              encode
use Encode;
$chars = decode($enc,
                              àbçdé              use Encode;
                                                 $bytes = encode($enc,
          $bytes);                                         $chars);
                        Perl's magic internal
                          representation
Questions?

More Related Content

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 

Recently uploaded (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Perl And Unicode

  • 2. Perl and Unicode Mike Whitaker, BBC/EnlightenedPerl.org
  • 3. The problem • Keeping track of input and output encodings • Not losing encoding data in the middle • Understanding the difference between characters and bytes
  • 8. Characters vs bytes characters $ U+0024 bytes (UTF-8)
  • 9. Characters vs bytes characters $ U+0024 bytes 0x24 (UTF-8)
  • 10. Characters vs bytes characters $ € U+0024 bytes 0x24 (UTF-8)
  • 11. Characters vs bytes characters $ € U+0024 U+20AC bytes 0x24 (UTF-8)
  • 12. Characters vs bytes characters $ € U+0024 U+20AC bytes 0x24 0xE2 0x82 0xAC (UTF-8)
  • 13. Characters vs bytes 2 characters $ € U+0024 U+20AC bytes 0x24 0xE2 0x82 0xAC (UTF-8)
  • 14. Characters vs bytes 2 characters $ € U+0024 U+20AC 4 bytes 0x24 0xE2 0x82 0xAC (UTF-8)
  • 15.
  • 18. Handling Encodings àbçdé bytes in some input encoding or other
  • 19. Handling Encodings àbçdé bytes in some input encoding or other decode
  • 20. Handling Encodings àbçdé bytes in some input encoding or other decode àbçdé character-based internal representation
  • 21. Handling Encodings àbçdé bytes in some input encoding or other decode encode àbçdé character-based internal representation
  • 22. Handling Encodings àbçdé àbçdé bytes in some bytes in desired input encoding or encoding other decode encode àbçdé character-based internal representation
  • 23. Handling Encodings àbçdé àbçdé bytes in some bytes in desired input encoding or encoding output other decode encode àbçdé character-based internal representation
  • 24. Handling Encodings àbçdé àbçdé bytes in some bytes in desired input encoding or encoding output other decode encode use Encode; $chars = decode($enc, àbçdé $bytes); character-based internal representation
  • 25. Handling Encodings àbçdé àbçdé bytes in some bytes in desired input encoding or encoding output other decode encode use Encode; $chars = decode($enc, àbçdé use Encode; $bytes = encode($enc, $bytes); $chars); character-based internal representation
  • 27. The Holy Grail • Can represent all encodings
  • 28. The Holy Grail • Can represent all encodings • Has multibyte character support
  • 29. The Holy Grail • Can represent all encodings • Has multibyte character support • for example, length() should count characters, not bytes
  • 30. It doesn't work like that
  • 32. use Encode; Only works in Perl 5.8 and above
  • 33. use Encode; Only works in Perl 5.8 Why the $£%^&*() are you using 5.6 and above ANYWAY?
  • 34. use Encode; Only works in Perl 5.8 and above There are solutions for 5.6 and even earlier. But they're HORRIBLE.
  • 35. character-based internal representation
  • 36. character-based internal Perl has one! representation
  • 37. character-based internal Perl has one! representation Magic internal representation.
  • 38. character-based internal Perl has one! representation Magic internal representation. All string functions know about it.
  • 39. character-based internal Perl has one! representation Magic internal representation. All string functions know about it. It's encoding-agnostic.
  • 40. character-based internal Perl has one! representation Magic internal representation. All string functions know about it. It's encoding-agnostic. In fact....
  • 41.
  • 43. -8! almost TF SU IT'
  • 44. Handling Encodings àbçdé àbçdé bytes in some bytes in desired input encoding or encoding output other decode encode use Encode; $chars = decode($enc, àbçdé use Encode; $bytes = encode($enc, $bytes); $chars); Perl's magic internal representation
  • 45. àbçdé àbçdé bytes in bytes in input machine's 8bit machine's 8bit output encoding encoding àbçdé bytes in machine's 8bit encoding
  • 46. I18N? What the £$%^&*('s that? àbçdé àbçdé bytes in bytes in input machine's 8bit machine's 8bit output encoding encoding àbçdé bytes in machine's 8bit encoding
  • 47. People are still writing Perl like it was Perl 4
  • 48. People are still writing Perl like it was Perl 4
  • 49. People are still writing Perl like it was Perl 4 ...and we have to support them.
  • 50. People are still writing Perl like it was Perl 4 ...and we have to support them. Even though our string functions expect chars.
  • 51. ???? Perl's magic internal representation
  • 52. ???? Perl's magic internal representation if
  • 53. ???? Perl's magic internal representation if all characters are representable in local machine's 8 bit charset, use that;
  • 54. ???? Perl's magic internal representation if all characters are representable in local machine's 8 bit charset, use that; else
  • 55. ???? Perl's magic internal representation if all characters are representable in local machine's 8 bit charset, use that; else use UTF-8
  • 56.
  • 58. àbçdé UTF-8 characters use Encode; $bytes = encode($enc, $chars);
  • 59. àbçdé àbçdé UTF-8 bytes in desired characters output use Encode; encoding $bytes = encode($enc, $chars);
  • 60. àbçdé àbçdé UTF-8 bytes in desired characters output use Encode; encoding $bytes = encode($enc, $chars); àbçdé machine bytes
  • 61. àbçdé àbçdé UTF-8 bytes in desired characters output use Encode; encoding $bytes = encode($enc, $chars); àbçdé machine bytes use Encode; $bytes = encode($enc, $chars);
  • 62. àbçdé àbçdé UTF-8 bytes in desired characters output use Encode; encoding $bytes = encode($enc, $chars); àbçdé àbçdé machine bytes in desired output bytes use Encode; encoding $bytes = encode($enc, $chars);
  • 63.
  • 67. UTF-8 + = characters ????? àbçdé machine bytes
  • 68. UTF-8 + = characters ????? àbçdé machine bytes
  • 69. UTF-8 + = characters ????? àbçdé machine promote bytes
  • 70. UTF-8 + = characters ????? àbçdé àbçdé machine UTF-8 promote bytes characters
  • 71. UTF-8 + = characters àbçdé UTF-8 bytes àbçdé àbçdé machine UTF-8 promote bytes characters
  • 72.
  • 74. àbçdé machine output bytes
  • 75. àbçdé Content-Encoding: UTF-8 machine output bytes
  • 76. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes
  • 77. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1
  • 78. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé
  • 79. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé UTF-8 characters
  • 80. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé UTF-8 output characters
  • 81. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé Content-Encoding: UTF-8 UTF-8 output characters
  • 82. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé Content-Encoding: UTF-8 àbçdé UTF-8 output characters
  • 83. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé Content-Encoding: UTF-8 àbçdé UTF-8 output characters Content-Encoding: ISO-8859-1
  • 84. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé Content-Encoding: UTF-8 àbçdé UTF-8 output characters Content-Encoding: ISO-8859-1 àbçdé
  • 85. ARRR GH!!!!
  • 86.
  • 88. You can't tell what you've actually got
  • 89. You can't tell what you've actually got utf8::is_utf8()
  • 90. You can't tell what you've actually got utf8::is_utf8() does not mean what you think it means
  • 91. You can't tell what you've actually got
  • 92. You can't tell what you've actually got encoded bytes
  • 93. You can't tell what you've actually got encoded bytes utf8::is_utf8() = false
  • 94. You can't tell what you've actually got encoded bytes utf8::is_utf8() = false EVEN IF they're UTF-8
  • 95. You can't tell what you've actually got encoded bytes utf8::is_utf8() = false EVEN IF they're UTF-8 decoded UTF-8 chars
  • 96. You can't tell what you've actually got encoded bytes utf8::is_utf8() = false EVEN IF they're UTF-8 decoded UTF-8 chars utf8::is_utf8() = true
  • 97. You can't tell what you've actually got encoded bytes utf8::is_utf8() = false EVEN IF they're UTF-8 decoded UTF-8 chars utf8::is_utf8() = true decoded machine bytes
  • 98. You can't tell what you've actually got encoded bytes utf8::is_utf8() = false EVEN IF they're UTF-8 decoded UTF-8 chars utf8::is_utf8() = true decoded machine bytes utf8::is_utf8() = false
  • 100. The science bit • Encode.pm use Encode; $bytes = encode($enc, $chars);
  • 101. The science bit • Encode.pm use Encode; $bytes = encode($enc, $chars); • 3 argument form of open() - PerlIO layers open(FILEHANDLE, ">:encoding(UTF-8)", $file);
  • 102. The science bit • Encode.pm use Encode; $bytes = encode($enc, $chars); • 3 argument form of open() - PerlIO layers open(FILEHANDLE, ">:encoding(UTF-8)", $file); • binmode(FILEHANDE,
  • 104. 'utf8' vs 'UTF-8' • Encode.pm
  • 105. 'utf8' vs 'UTF-8' • Encode.pm • utf8 = marks it as UTF-8 and hopes...
  • 106. 'utf8' vs 'UTF-8' • Encode.pm • utf8 = marks it as UTF-8 and hopes... • UTF-8 = is actually valid UTF-8
  • 107. 'utf8' vs 'UTF-8' • Encode.pm • utf8 = marks it as UTF-8 and hopes... • UTF-8 = is actually valid UTF-8 • PerlIO layers:
  • 108. 'utf8' vs 'UTF-8' • Encode.pm • utf8 = marks it as UTF-8 and hopes... • UTF-8 = is actually valid UTF-8 • PerlIO layers: • :utf8
  • 109. 'utf8' vs 'UTF-8' • Encode.pm • utf8 = marks it as UTF-8 and hopes... • UTF-8 = is actually valid UTF-8 • PerlIO layers: • :utf8 • :encoding(UTF-8)
  • 111. use utf8; • Does NOT do what you might think it does
  • 112. use utf8; • Does NOT do what you might think it does • All it says is 'my source code is UTF-8'.
  • 114. Modules • It depends on the module:
  • 115. Modules • It depends on the module: • CGI - $CGI::PARAM_UTF8=1;
  • 116. Modules • It depends on the module: • CGI - $CGI::PARAM_UTF8=1; • LWP::UserAgent - >decoded_content() method honours Content-Encoding:
  • 117. Modules • It depends on the module: • CGI - $CGI::PARAM_UTF8=1; • LWP::UserAgent - >decoded_content() method honours Content-Encoding: • DBI - mysql_enable_utf8 in DBI::connect()
  • 118. Modules • It depends on the module: • CGI - $CGI::PARAM_UTF8=1; • LWP::UserAgent - >decoded_content() method honours Content-Encoding: • DBI - mysql_enable_utf8 in DBI::connect() • XML::LibXML - looks at encoding,
  • 120. In summary • decode bytes as soon as you get them:
  • 121. In summary • decode bytes as soon as you get them: • decode(), binmode(STDIN), 3 arg open()
  • 122. In summary • decode bytes as soon as you get them: • decode(), binmode(STDIN), 3 arg open() • encode characters just before you output:
  • 123. In summary • decode bytes as soon as you get them: • decode(), binmode(STDIN), 3 arg open() • encode characters just before you output: • encode(), binmode(STDOUT), 3 arg open()
  • 124. In summary • decode bytes as soon as you get them: • decode(), binmode(STDIN), 3 arg open() • encode characters just before you output: • encode(), binmode(STDOUT), 3 arg open() • keep track of whether your strings are
  • 125.
  • 126. NEVER EVER EVER rely on the encoding of Perl's internal representation
  • 127.
  • 128. and...
  • 129.
  • 130. ...there is NO SUCH THING as "plain text"
  • 131. Handling Encodings àbçdé àbçdé bytes in some bytes in desired input encoding or encoding output other decode encode use Encode; $chars = decode($enc, àbçdé use Encode; $bytes = encode($enc, $bytes); $chars); Perl's magic internal representation
  • 132. The Holy Fail (thanks Joel!) àbçdé àbçdé bytes in some bytes in desired input encoding or encoding output other decode encode use Encode; $chars = decode($enc, àbçdé use Encode; $bytes = encode($enc, $bytes); $chars); Perl's magic internal representation

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n
  77. \n
  78. \n
  79. \n
  80. \n
  81. \n
  82. \n
  83. \n
  84. \n
  85. \n
  86. \n
  87. \n
  88. \n
  89. \n
  90. \n
  91. \n
  92. \n
  93. \n
  94. \n
  95. \n
  96. \n
  97. \n
  98. \n
  99. \n
  100. \n
  101. \n
  102. \n
  103. \n
  104. \n
  105. \n
  106. \n
  107. \n
  108. \n
  109. \n
  110. \n
  111. \n
  112. \n
  113. \n
  114. \n
  115. \n
  116. \n
  117. \n
  118. \n
  119. \n
  120. \n
  121. \n
  122. \n
  123. \n
  124. \n
  125. \n
  126. \n
  127. \n
  128. \n
  129. \n
  130. \n
  131. \n
  132. \n
  133. \n
  134. \n
  135. \n
  136. \n
  137. \n
  138. \n
  139. \n
  140. \n
  141. \n
  142. \n
  143. \n
  144. \n
  145. \n
  146. \n
  147. \n
  148. \n
  149. \n
  150. \n
  151. \n
  152. \n
  153. \n
  154. \n
  155. \n
  156. \n
  157. \n
  158. \n
  159. \n
  160. \n
  161. \n
  162. \n
  163. \n
  164. \n
  165. \n
  166. \n
  167. \n
  168. \n
  169. \n
  170. \n
  171. \n
  172. \n
  173. \n
  174. \n
  175. \n
  176. \n
  177. \n
  178. \n
  179. \n
  180. \n
  181. \n
  182. \n