I'm not a Unicode Guru, but working with third parties, I often find that a lot of people consistently fail to get the basics right about Unicode and encoding. There must be something esoteric about it. So here's yet another set of slides about Unicode/UTF8 in Perl.
It's not meant to be a comprehensive presentation of all Unicode things in Perl. It's meant to insist on a couple of guidelines and give some pointers to get a good start writing a unicode compliant application and avoiding common issues.
2. Characters and Glyphs
A character: 'é'
Combination of 2 glyphs:
e (LATIN SMALL LETTER E)
Followed by:
´ (ACUTE ACCENT)
3. Characters and Glyphs
A character: 'é'
Or a combined glyph:
é (LATIN SMALL LETTER E WITH ACUTE)
4. So what is Unicode (in this
context)?
A collection of glyphs (mainly) called
Codepoints with a unique number and a set of
properties.
Example: E ( U+0045 )
Name LATIN CAPITAL
LETTER E
Block Basic Latin
Category Letter, Uppercase [Lu]
Combine 0
BIDI BIDI
Lower case U+0065
5. What is a String?
An ordered collection of glyphs i.e. an ordered
collection of Unicode point.
In Perl:
my $s = "he";
or
my $s = "N{U+0068}N{U+0065}";
6. What is a String ? - The glyph Pitfall
An ordered collection of glyphs. There's more
that one way to write it.
In Perl:
my $s = "é"
is
my $s = "N{U+00E9}"; OR..
my $s = "N{U+0065}N{U+00B4}";
In practice, software prefer the first way (pffui),
but not always. See Unicode::Normalize
7. How does Perl represent Strings?
Short answer: It's not your business.
Long answer: It depends :(
Only "latin1 characters" -> Latin1. Anything
outside that -> UTF-8.
Feeling fiddly, bug fixing? use utf8::* function.
Bedtime read: perldoc perlunicode
8. Not my business? So what's this
fuss about UTF-8 encoding?
How strings are represented internally is not
your business.
How they are transmitted from/to the outside
world is.
The outside world doesn't understand 'Strings'.
It understands 'bytes'.
An encoding is a bijection:
Unicode Points (glyphs) <-> bytes
9. UTF-8 encoding
Unicode Points (glyphs) <-> bytes
Variable number of bytes per unicode point.
Examples:
a <-> x{61} ,
☭ <-> x{E2}x{98}x{AD} (gdrive FAIL)
Sometimes, the bytes begin with a BOM.
10. The encoding law
Never transfer Strings. Always transfer Bytes.
But inside Perl: You want to work with Strings
as much as possible.
Sending: Encode as LATE as possible.
Receiving: Decode as EARLY as possible.
11. Common outside worlds: STDOUT
Latin1 encoding by default :(
-> You can only output 'Latin1 compliant
Strings'. And your shell should expect Latin1.
In the modern world:
# Set STDOUT to encode as UTF8
binmode STDOUT , ':utf8';
12. Common outside worlds: A text file
if you know the file encoding:
open(my $fh, "<:encoding(UTF-8)",
"filename");
if you don't know.
Maybe you can count on the BOM byte.
But you don't want that. You want to know for
sure -> set a convention.
13. Common outside worlds: XML file
Encoding specified in the preamble:
<?xml version="1.0" encoding="utf-8"?>
If not specified -> utf8 is assumed.
Feed your XML parser with BYTES.
Write XML files in binary mode.
XML::LibXML:: Calls bytes 'Strings'.. People
are confused. Trust no one.
14. Common outside worlds: WWW
From a given page, browsers send parameters
in the encoding of the page.
Correctly encode your binary responses.
Decode $c->params()
In Catalyst:
Catalyst::Plugin::Unicode::Encoding
15. Common outside worlds: Your own
Every time you communicate with a system,
you will send/receive bytes. Never strings.
Think about encoding/decoding your strings
to/from bytes, according to what your system
expects/provides.
Sometime, it's done automagically through
some library options.
16. Bug avoiding guidelines.
Test everything with Unicode characters.
English keyboard? chartables.de, unicode
lorem ipsum.
Unit test => "N{U+262D}"
Never i/o strings. Never. i/o is about bytes.
Choose encodings explicitly.
17. Bonus: Escaping
What if you want to represent your nice shiny
UTF8 bytes as part of something else?
You need to escape them!
Example in URI, escaping parameters:
(URI::Escape):
http://foo.com/?q=%E2%98%AD
18. Bonus: Escaping for email headers
Encode AND Escape for Email subjects
(Encode with MIME-Q):
Encode::encode('MIME-Q', "aN{U+262D}c");
=?UTF-8?Q?a=E2=98=ADb?=
It encodes and escapes at the same time.
Beware of confusion.
Keep string for as long as you can.
19. Conclusion
Make sure you make a difference Strings and
Bytes. In Perl, it must come from discipline.
Make sure you always encode/decode on i/o as
explicitly as possible. Don't let confused others
confuse you.
Always wonder: What does this thing operates
on. Bytes or Strings? In doubt, investigate.