Fundamental Unicode in Perl

Fundamental
Unicode
Nick Patch

“The smallest component of
written language that has semantic value;
refers to the abstract meaning and/or shape,
rather than a specific shape.”
—The Unicode Consortium
What Is a Character?

Glyphs are visual
representations of characters.
What Is a Glyph?

Glyphs are visual
Fonts are collections of glyphs.
What Is a Glyph?

Glyphs are visual
There may be many different glyphs
for the same character.
What Is a Glyph?

Glyphs are visual
There may be many different glyphs
for the same character.
This talk is not about fonts or glyphs.
What Is a Glyph?

. / ?
「 « » 」
Punctuation

CARRIAGE RETURN
NO-BREAK SPACE
COMBINING GRAPHEME JOINER
RIGHT-TO-LEFT MARK
Control Characters

Many people use “character set”
to mean one or more of these:
Character Code
Character Encoding
Character Repertoire
Which makes for a confusing situation.
Character Set

A defined mapping of
characters to numbers.
A ⇒ 41
B ⇒ 42
C ⇒ 43
Each value in a character code
is called a code point.
Character Code

An algorithm to convert
code points to a digital form for ease
of transmitting or storing data.
41 (A) ⇒ 1000001
42 (B) ⇒ 1000010
43 (C) ⇒ 1000011
Character Encoding

A character repertoire is a
collection of distinct characters.
Character codes, keyboards, and
written languages all have
well-defined character repertoires.
Character Repertoire

ASCII
character code: 128 code points
character encoding: 7 bits each
Character Codes & Encodings

ASCII
character encoding: 7 bits each
Latin 1 (ISO-8859-1)
character encoding: 8 bits (1 byte) each

Unicode (character code)
1,112,064 code points (110,000+ defined)

Unicode (character code)
1,112,064 code points (110,000+ defined)
character encodings:
UTF-8 — 1 to 4 bytes each
UTF-16 — 2 or 4 bytes each
UTF-32 — 4 bytes each

A
U+0041
LATIN CAPITAL LETTER A
໓
U+0ED3
LAO DIGIT THREE
U+1F4A9
PILE OF POO
Code Points

Some code points have
precomposed diacritics.
ȫ
U+022B
LATIN SMALL LETTER O
WITH DIAERESIS AND MACRON
Code Points

Other characters must be composed
from multiple code points
using “combing characters.”
n̈
U+006E
LATIN SMALL LETTER N
U+0308
COMBINING DIAERESIS
Code Points

Any series of code points that are composed
into a single user-perceived character.
Informally known as “graphemes.”
A (U+0041)
n̥̈ (U+006E U+0308 U+0325)
CRLF (U+000D U+000A)
Grapheme Clusters

U+1F42A
DROMEDARY CAMEL
Time for some…

# ¡jalapeño!
say "x{A1}jalapex{D1}o!";
String constants ... TIMTOWTDI

# ¡jalapeño!
say "x{A1}jalapex{D1}o!";
use v5.12;
say "N{U+00A1}jalapeN{U+00D1}o!";

use charnames qw( :full );
say "N{INVERTED EXCLAMATION
MARK}jalapeN{LATIN SMALL LETTER N WITH
TILDE}o!";

use charnames qw( :full );
say "N{INVERTED EXCLAMATION
MARK}jalapeN{LATIN SMALL LETTER N WITH
TILDE}o!";
use utf8;
say '¡jalapeño!';

=encoding UTF-8
=head1 ¡jalapeño!
String constants ... POD

UTF-8 encoded input
⇩
decode
⇩
Perl Unicode string
⇩
encode
⇩
UTF-8 encoded output
I/O

open my $fh, '<:encoding(UTF-8)', $filename;
open my $fh, '>:encoding(UTF-8)', $filename;
I/O

open my $fh, '>:encoding(UTF-8)', $filename;
binmode $fh, ':encoding(UTF-8)';
binmode STDIN, ':encoding(UTF-8)';
I/O

use open qw( :encoding(UTF-8) );
open my $fh, '<', $filename;
I/O

# :std for STDIN, STDOUT, STDERR
use open qw( :encoding(UTF-8) :std );
I/O

# :std for STDIN, STDOUT, STDERR
use open qw( :encoding(UTF-8) :std );
# CPAN module to enable everything UTF-8
use utf8::all;
I/O

use Encode;
my $internal = decode('UTF-8', $input);
my $output = encode('UTF-8', $internal);
Explicit Encoding & Decoding

Let’s use this grapheme cluster as the
string in our next example:
ю́
U+044E
CYRILLIC SMALL LETTER YU
U+0301
COMBINING ACUTE ACCENT
String Length

# UTF-8 encoded: D1 8E CC 81
say length $encoded_grapheme; # 4
String Length

use Encode;
# Unicode string: 044E 0301
my $grapheme = decode('UTF-8', $encoded);
say length $grapheme; # 2
String Length

use Encode;
# Unicode string: 044E 0301
my $grapheme = decode('UTF-8', $encoded);
say length $grapheme; # 2
my $length = () = $grapheme =~ /X/g;
say $length; # 1
String Length

# sort of complex for a simple length, eh?
my $length = () = $str =~ /X/g;
say $length;
String Length

say $length;
# and tricky depending on the context
say scalar( () = $str =~ /X/g );
String Length

say $length;
# and tricky depending on the context
say scalar( () = $str =~ /X/g );
# a little better
$length++ while $str =~ /X/g;
say $length;
String Length

# an alternative approach
use Unicode::GCString;
say Unicode::GCString->new($str)->length;
String Length

# an alternative approach
use Unicode::GCString;
say Unicode::GCString->new($str)->length;
# and yet another (Warning: I wrote it!)
use Unicode::Util qw( grapheme_length );
say grapheme_length($str);
String Length

Standard ordering of strings
for comparison and sorting.
sort @names
$a cmp $b
$x gt $y
$foo eq $bar
Collation

Perl provides a collation algorithm
based on code points.
Collation

@words = qw( Äpfel durian Xerxes )
sort @words
# Xerxes durian Äpfel
Collation

@words = qw( Äpfel durian Xerxes )
sort @words
# Xerxes durian Äpfel
sort { lc $a cmp lc $b } @words
# durian Xerxes Äpfel
Collation

Unicode Collation Algorithm (UCA) provides
collation based on natural language usage.
Collation

use Unicode::Collate;
my $collator = Unicode::Collate->new;
$collator->sort(@words);
# Äpfel durian Xerxes
Collation

$collator->sort(@names)
$collator->cmp($a, $b)
$collator->gt($x, $y)
$collator->eq($foo, $bar)
Collation

UCA also provides locale-specific collations
for different languages.
Collation

UCA also provides locale-specific collations
for different languages.
use Unicode::Collate::Locale;
my $kolator = Unicode::Collate::Locale->new(
locale => 'pl' # Polish
);
Collation

Unicode has 4 normalization forms.
The most important are:
NFD: Normalization Form
Canonical Decomposition
NFC: Normalization Form
Canonical Composition
Normalization

use Unicode::Normalize;
# NFD can be helpful on input
$str = NFD($input);
# NFC is recommended on output
$output = NFC($str);
Normalization

UTF-8 encoded input
⇩
decode
⇩
NFD
⇩
Perl Unicode string
⇩
NFC
⇩
encode
⇩
UTF-8 encoded output
Normalization

By default, unfortunately, strings and regexes are
not guaranteed to use Unicode semantics.
This is known as “The Unicode Bug.”
There are a few ways to fix this:
Unicode Semantics

utf8::upgrade($str);
Unicode Semantics

use v5.12;
Unicode Semantics

use v5.12;
use feature 'unicode_strings';
Unicode Semantics

You’ll see the “utf8” encoding
used frequently in Perl.
“utf8” follows the UTF-8 standard very
loosely and allows many errors
in your data without warnings.
By default, use “UTF-8” instead.
UTF-8 vs. utf8 vs. :utf8

# utf8 is Perl's internal encoding form
my $internal = decode('utf8', $input);
# UTF-8 is the official UTF-8 encoding

# utf8 is Perl's internal encoding form
my $internal = decode('utf8', $input);
# UTF-8 is the official UTF-8 encoding
# insecure! no encoding validation at all
open my $fh, '<:utf8', $filename;
# proper UTF-8 validation

Slides will be posted to:
@nickpatch
Questions?

Fundamental Unicode in Perl

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (20)

Similar to Fundamental Unicode in Perl

Similar to Fundamental Unicode in Perl (20)

Recently uploaded

Recently uploaded (20)

Fundamental Unicode in Perl