Understand unicode & utf8 in perl (2)
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Understand unicode & utf8 in perl (2)

  • 861 views
Uploaded on

I'm not a Unicode Guru, but working with third parties, I often find that a lot of people consistently fail to get the basics right about Unicode and encoding. There must be something esoteric......

I'm not a Unicode Guru, but working with third parties, I often find that a lot of people consistently fail to get the basics right about Unicode and encoding. There must be something esoteric about it. So here's yet another set of slides about Unicode/UTF8 in Perl.

It's not meant to be a comprehensive presentation of all Unicode things in Perl. It's meant to insist on a couple of guidelines and give some pointers to get a good start writing a unicode compliant application and avoiding common issues.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
861
On Slideshare
857
From Embeds
4
Number of Embeds
2

Actions

Shares
Downloads
1
Comments
0
Likes
0

Embeds 4

https://www.linkedin.com 3
http://www.linkedin.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Understand Unicode &UTF8 in Perlavoid common issues and gain guru status. (You too can be John)
  • 2. Characters and GlyphsA character: éCombination of 2 glyphs:e (LATIN SMALL LETTER E)Followed by:´ (ACUTE ACCENT)
  • 3. Characters and GlyphsA character: éOr a combined glyph:é (LATIN SMALL LETTER E WITH ACUTE)
  • 4. So what is Unicode (in thiscontext)?A collection of glyphs (mainly) calledCodepoints with a unique number and a set ofproperties.Example: E ( U+0045 ) Name LATIN CAPITAL LETTER E Block Basic Latin Category Letter, Uppercase [Lu] Combine 0 BIDI BIDI Lower case U+0065
  • 5. What is a String?An ordered collection of glyphs i.e. an orderedcollection of Unicode point.In Perl:my $s = "he";ormy $s = "N{U+0068}N{U+0065}";
  • 6. What is a String ? - The glyph PitfallAn ordered collection of glyphs. Theres morethat one way to write it.In Perl:my $s = "é"ismy $s = "N{U+00E9}"; OR..my $s = "N{U+0065}N{U+00B4}";In practice, software prefer the first way (pffui),but not always. See Unicode::Normalize
  • 7. How does Perl represent Strings?Short answer: Its not your business.Long answer: It depends :(Only "latin1 characters" -> Latin1. Anythingoutside that -> UTF-8.Feeling fiddly, bug fixing? use utf8::* function.Bedtime read: perldoc perlunicode
  • 8. Not my business? So whats thisfuss about UTF-8 encoding?How strings are represented internally is notyour business.How they are transmitted from/to the outsideworld is.The outside world doesnt understand Strings.It understands bytes.An encoding is a bijection:Unicode Points (glyphs) <-> bytes
  • 9. UTF-8 encodingUnicode Points (glyphs) <-> bytesVariable number of bytes per unicode point.Examples:a <-> x{61} ,☭ <-> x{E2}x{98}x{AD} (gdrive FAIL)Sometimes, the bytes begin with a BOM.
  • 10. The encoding lawNever transfer Strings. Always transfer Bytes.But inside Perl: You want to work with Stringsas much as possible.Sending: Encode as LATE as possible.Receiving: Decode as EARLY as possible.
  • 11. Common outside worlds: STDOUTLatin1 encoding by default :(-> You can only output Latin1 compliantStrings. And your shell should expect Latin1.In the modern world:# Set STDOUT to encode as UTF8binmode STDOUT , :utf8;
  • 12. Common outside worlds: A text fileif you know the file encoding: open(my $fh, "<:encoding(UTF-8)","filename");if you dont know.Maybe you can count on the BOM byte.But you dont want that. You want to know forsure -> set a convention.
  • 13. Common outside worlds: XML fileEncoding specified in the preamble:<?xml version="1.0" encoding="utf-8"?>If not specified -> utf8 is assumed.Feed your XML parser with BYTES.Write XML files in binary mode.XML::LibXML:: Calls bytes Strings.. Peopleare confused. Trust no one.
  • 14. Common outside worlds: WWWFrom a given page, browsers send parametersin the encoding of the page.Correctly encode your binary responses.Decode $c->params()In Catalyst:Catalyst::Plugin::Unicode::Encoding
  • 15. Common outside worlds: Your ownEvery time you communicate with a system,you will send/receive bytes. Never strings.Think about encoding/decoding your stringsto/from bytes, according to what your systemexpects/provides.Sometime, its done automagically throughsome library options.
  • 16. Bug avoiding guidelines.Test everything with Unicode characters.English keyboard? chartables.de, unicodelorem ipsum.Unit test => "N{U+262D}"Never i/o strings. Never. i/o is about bytes.Choose encodings explicitly.
  • 17. Bonus: EscapingWhat if you want to represent your nice shinyUTF8 bytes as part of something else?You need to escape them!Example in URI, escaping parameters:(URI::Escape):http://foo.com/?q=%E2%98%AD
  • 18. Bonus: Escaping for email headersEncode AND Escape for Email subjects(Encode with MIME-Q):Encode::encode(MIME-Q, "aN{U+262D}c");=?UTF-8?Q?a=E2=98=ADb?=It encodes and escapes at the same time.Beware of confusion.Keep string for as long as you can.
  • 19. ConclusionMake sure you make a difference Strings andBytes. In Perl, it must come from discipline.Make sure you always encode/decode on i/o asexplicitly as possible. Dont let confused othersconfuse you.Always wonder: What does this thing operateson. Bytes or Strings? In doubt, investigate.