This document provides an overview of Unicode concepts like characters, glyphs, code points, character encodings, normalization, collation, and more. It discusses that characters are abstract concepts, while glyphs are visual representations. Code points map characters to numeric codes, and encodings convert these to digital formats. Character sets, encodings, and repertoires are commonly confused terms. Unicode supports over 1 million code points and encodings like UTF-8 and UTF-16. Normalization and collation are also covered at a high level.
2. “The smallest component of
written language that has semantic value;
refers to the abstract meaning and/or shape,
rather than a specific shape.”
—The Unicode Consortium
What Is a Character?
5. Glyphs are visual
representations of characters.
Fonts are collections of glyphs.
There may be many different glyphs
for the same character.
What Is a Glyph?
6. Glyphs are visual
representations of characters.
Fonts are collections of glyphs.
There may be many different glyphs
for the same character.
This talk is not about fonts or glyphs.
What Is a Glyph?
12. Many people use “character set”
to mean one or more of these:
Character Code
Character Encoding
Character Repertoire
Which makes for a confusing situation.
Character Set
13. A defined mapping of
characters to numbers.
A ⇒ 41
B ⇒ 42
C ⇒ 43
Each value in a character code
is called a code point.
Character Code
14. An algorithm to convert
code points to a digital form for ease
of transmitting or storing data.
41 (A) ⇒ 1000001
42 (B) ⇒ 1000010
43 (C) ⇒ 1000011
Character Encoding
15. A character repertoire is a
collection of distinct characters.
Character codes, keyboards, and
written languages all have
well-defined character repertoires.
Character Repertoire
17. ASCII
character code: 128 code points
character encoding: 7 bits each
Latin 1 (ISO-8859-1)
character code: 256 code points
character encoding: 8 bits (1 byte) each
Character Codes & Encodings
19. Unicode (character code)
1,112,064 code points (110,000+ defined)
character encodings:
UTF-8 — 1 to 4 bytes each
UTF-16 — 2 or 4 bytes each
UTF-32 — 4 bytes each
Character Codes & Encodings
21. Some code points have
precomposed diacritics.
ȫ
U+022B
LATIN SMALL LETTER O
WITH DIAERESIS AND MACRON
Code Points
22. Other characters must be composed
from multiple code points
using “combing characters.”
n̈
U+006E
LATIN SMALL LETTER N
U+0308
COMBINING DIAERESIS
Code Points
23. Any series of code points that are composed
into a single user-perceived character.
Informally known as “graphemes.”
A (U+0041)
n̥̈ (U+006E U+0308 U+0325)
CRLF (U+000D U+000A)
Grapheme Clusters
27. use charnames qw( :full );
say "N{INVERTED EXCLAMATION
MARK}jalapeN{LATIN SMALL LETTER N WITH
TILDE}o!";
String constants ... TIMTOWTDI
28. use charnames qw( :full );
say "N{INVERTED EXCLAMATION
MARK}jalapeN{LATIN SMALL LETTER N WITH
TILDE}o!";
use utf8;
say '¡jalapeño!';
String constants ... TIMTOWTDI
31. open my $fh, '<:encoding(UTF-8)', $filename;
open my $fh, '>:encoding(UTF-8)', $filename;
I/O
32. open my $fh, '<:encoding(UTF-8)', $filename;
open my $fh, '>:encoding(UTF-8)', $filename;
binmode $fh, ':encoding(UTF-8)';
binmode STDIN, ':encoding(UTF-8)';
I/O
33. use open qw( :encoding(UTF-8) );
open my $fh, '<', $filename;
I/O
34. use open qw( :encoding(UTF-8) );
open my $fh, '<', $filename;
# :std for STDIN, STDOUT, STDERR
use open qw( :encoding(UTF-8) :std );
I/O
35. use open qw( :encoding(UTF-8) );
open my $fh, '<', $filename;
# :std for STDIN, STDOUT, STDERR
use open qw( :encoding(UTF-8) :std );
# CPAN module to enable everything UTF-8
use utf8::all;
I/O
36. use Encode;
my $internal = decode('UTF-8', $input);
my $output = encode('UTF-8', $internal);
Explicit Encoding & Decoding
37. Let’s use this grapheme cluster as the
string in our next example:
ю́
U+044E
CYRILLIC SMALL LETTER YU
U+0301
COMBINING ACUTE ACCENT
String Length
38. # UTF-8 encoded: D1 8E CC 81
say length $encoded_grapheme; # 4
String Length
39. # UTF-8 encoded: D1 8E CC 81
say length $encoded_grapheme; # 4
use Encode;
# Unicode string: 044E 0301
my $grapheme = decode('UTF-8', $encoded);
say length $grapheme; # 2
String Length
40. # UTF-8 encoded: D1 8E CC 81
say length $encoded_grapheme; # 4
use Encode;
# Unicode string: 044E 0301
my $grapheme = decode('UTF-8', $encoded);
say length $grapheme; # 2
my $length = () = $grapheme =~ /X/g;
say $length; # 1
String Length
41. # sort of complex for a simple length, eh?
my $length = () = $str =~ /X/g;
say $length;
String Length
42. # sort of complex for a simple length, eh?
my $length = () = $str =~ /X/g;
say $length;
# and tricky depending on the context
say scalar( () = $str =~ /X/g );
String Length
43. # sort of complex for a simple length, eh?
my $length = () = $str =~ /X/g;
say $length;
# and tricky depending on the context
say scalar( () = $str =~ /X/g );
# a little better
$length++ while $str =~ /X/g;
say $length;
String Length
44. # an alternative approach
use Unicode::GCString;
say Unicode::GCString->new($str)->length;
String Length
45. # an alternative approach
use Unicode::GCString;
say Unicode::GCString->new($str)->length;
# and yet another (Warning: I wrote it!)
use Unicode::Util qw( grapheme_length );
say grapheme_length($str);
String Length
46. Standard ordering of strings
for comparison and sorting.
sort @names
$a cmp $b
$x gt $y
$foo eq $bar
Collation
47. Perl provides a collation algorithm
based on code points.
Collation
48. Perl provides a collation algorithm
based on code points.
@words = qw( Äpfel durian Xerxes )
sort @words
# Xerxes durian Äpfel
Collation
51. Unicode Collation Algorithm (UCA) provides
collation based on natural language usage.
use Unicode::Collate;
my $collator = Unicode::Collate->new;
$collator->sort(@words);
# Äpfel durian Xerxes
Collation
52. Unicode Collation Algorithm (UCA) provides
collation based on natural language usage.
$collator->sort(@names)
$collator->cmp($a, $b)
$collator->gt($x, $y)
$collator->eq($foo, $bar)
Collation
53. UCA also provides locale-specific collations
for different languages.
Collation
54. UCA also provides locale-specific collations
for different languages.
use Unicode::Collate::Locale;
my $kolator = Unicode::Collate::Locale->new(
locale => 'pl' # Polish
);
Collation
55. Unicode has 4 normalization forms.
The most important are:
NFD: Normalization Form
Canonical Decomposition
NFC: Normalization Form
Canonical Composition
Normalization
56. use Unicode::Normalize;
# NFD can be helpful on input
$str = NFD($input);
# NFC is recommended on output
$output = NFC($str);
Normalization
58. By default, unfortunately, strings and regexes are
not guaranteed to use Unicode semantics.
This is known as “The Unicode Bug.”
There are a few ways to fix this:
Unicode Semantics
59. By default, unfortunately, strings and regexes are
not guaranteed to use Unicode semantics.
This is known as “The Unicode Bug.”
There are a few ways to fix this:
utf8::upgrade($str);
Unicode Semantics
60. By default, unfortunately, strings and regexes are
not guaranteed to use Unicode semantics.
This is known as “The Unicode Bug.”
There are a few ways to fix this:
utf8::upgrade($str);
use v5.12;
Unicode Semantics
61. By default, unfortunately, strings and regexes are
not guaranteed to use Unicode semantics.
This is known as “The Unicode Bug.”
There are a few ways to fix this:
utf8::upgrade($str);
use v5.12;
use feature 'unicode_strings';
Unicode Semantics
62. You’ll see the “utf8” encoding
used frequently in Perl.
“utf8” follows the UTF-8 standard very
loosely and allows many errors
in your data without warnings.
By default, use “UTF-8” instead.
UTF-8 vs. utf8 vs. :utf8
63. # utf8 is Perl's internal encoding form
my $internal = decode('utf8', $input);
# UTF-8 is the official UTF-8 encoding
my $internal = decode('UTF-8', $input);
UTF-8 vs. utf8 vs. :utf8
64. # utf8 is Perl's internal encoding form
my $internal = decode('utf8', $input);
# UTF-8 is the official UTF-8 encoding
my $internal = decode('UTF-8', $input);
# insecure! no encoding validation at all
open my $fh, '<:utf8', $filename;
# proper UTF-8 validation
open my $fh, '<:encoding(UTF-8)', $filename;
UTF-8 vs. utf8 vs. :utf8