Understand unicode & utf8 in perl (2)
Upcoming SlideShare
Loading in...5

Understand unicode & utf8 in perl (2)



I'm not a Unicode Guru, but working with third parties, I often find that a lot of people consistently fail to get the basics right about Unicode and encoding. There must be something esoteric about ...

I'm not a Unicode Guru, but working with third parties, I often find that a lot of people consistently fail to get the basics right about Unicode and encoding. There must be something esoteric about it. So here's yet another set of slides about Unicode/UTF8 in Perl.

It's not meant to be a comprehensive presentation of all Unicode things in Perl. It's meant to insist on a couple of guidelines and give some pointers to get a good start writing a unicode compliant application and avoiding common issues.



Total Views
Views on SlideShare
Embed Views



2 Embeds 4

https://www.linkedin.com 3
http://www.linkedin.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Understand unicode & utf8 in perl (2) Understand unicode & utf8 in perl (2) Presentation Transcript

  • Understand Unicode &UTF8 in Perlavoid common issues and gain guru status. (You too can be John)
  • Characters and GlyphsA character: éCombination of 2 glyphs:e (LATIN SMALL LETTER E)Followed by:´ (ACUTE ACCENT)
  • Characters and GlyphsA character: éOr a combined glyph:é (LATIN SMALL LETTER E WITH ACUTE)
  • So what is Unicode (in thiscontext)?A collection of glyphs (mainly) calledCodepoints with a unique number and a set ofproperties.Example: E ( U+0045 ) Name LATIN CAPITAL LETTER E Block Basic Latin Category Letter, Uppercase [Lu] Combine 0 BIDI BIDI Lower case U+0065
  • What is a String?An ordered collection of glyphs i.e. an orderedcollection of Unicode point.In Perl:my $s = "he";ormy $s = "N{U+0068}N{U+0065}";
  • What is a String ? - The glyph PitfallAn ordered collection of glyphs. Theres morethat one way to write it.In Perl:my $s = "é"ismy $s = "N{U+00E9}"; OR..my $s = "N{U+0065}N{U+00B4}";In practice, software prefer the first way (pffui),but not always. See Unicode::Normalize
  • How does Perl represent Strings?Short answer: Its not your business.Long answer: It depends :(Only "latin1 characters" -> Latin1. Anythingoutside that -> UTF-8.Feeling fiddly, bug fixing? use utf8::* function.Bedtime read: perldoc perlunicode
  • Not my business? So whats thisfuss about UTF-8 encoding?How strings are represented internally is notyour business.How they are transmitted from/to the outsideworld is.The outside world doesnt understand Strings.It understands bytes.An encoding is a bijection:Unicode Points (glyphs) <-> bytes
  • UTF-8 encodingUnicode Points (glyphs) <-> bytesVariable number of bytes per unicode point.Examples:a <-> x{61} ,☭ <-> x{E2}x{98}x{AD} (gdrive FAIL)Sometimes, the bytes begin with a BOM.
  • The encoding lawNever transfer Strings. Always transfer Bytes.But inside Perl: You want to work with Stringsas much as possible.Sending: Encode as LATE as possible.Receiving: Decode as EARLY as possible.
  • Common outside worlds: STDOUTLatin1 encoding by default :(-> You can only output Latin1 compliantStrings. And your shell should expect Latin1.In the modern world:# Set STDOUT to encode as UTF8binmode STDOUT , :utf8;
  • Common outside worlds: A text fileif you know the file encoding: open(my $fh, "<:encoding(UTF-8)","filename");if you dont know.Maybe you can count on the BOM byte.But you dont want that. You want to know forsure -> set a convention.
  • Common outside worlds: XML fileEncoding specified in the preamble:<?xml version="1.0" encoding="utf-8"?>If not specified -> utf8 is assumed.Feed your XML parser with BYTES.Write XML files in binary mode.XML::LibXML:: Calls bytes Strings.. Peopleare confused. Trust no one.
  • Common outside worlds: WWWFrom a given page, browsers send parametersin the encoding of the page.Correctly encode your binary responses.Decode $c->params()In Catalyst:Catalyst::Plugin::Unicode::Encoding
  • Common outside worlds: Your ownEvery time you communicate with a system,you will send/receive bytes. Never strings.Think about encoding/decoding your stringsto/from bytes, according to what your systemexpects/provides.Sometime, its done automagically throughsome library options.
  • Bug avoiding guidelines.Test everything with Unicode characters.English keyboard? chartables.de, unicodelorem ipsum.Unit test => "N{U+262D}"Never i/o strings. Never. i/o is about bytes.Choose encodings explicitly.
  • Bonus: EscapingWhat if you want to represent your nice shinyUTF8 bytes as part of something else?You need to escape them!Example in URI, escaping parameters:(URI::Escape):http://foo.com/?q=%E2%98%AD
  • Bonus: Escaping for email headersEncode AND Escape for Email subjects(Encode with MIME-Q):Encode::encode(MIME-Q, "aN{U+262D}c");=?UTF-8?Q?a=E2=98=ADb?=It encodes and escapes at the same time.Beware of confusion.Keep string for as long as you can.
  • ConclusionMake sure you make a difference Strings andBytes. In Perl, it must come from discipline.Make sure you always encode/decode on i/o asexplicitly as possible. Dont let confused othersconfuse you.Always wonder: What does this thing operateson. Bytes or Strings? In doubt, investigate.