SlideShare a Scribd company logo
1 of 65
Download to read offline
Fundamental
Unicode
Nick Patch
“The smallest component of
written language that has semantic value;
refers to the abstract meaning and/or shape,
rather than a specific shape.”
—The Unicode Consortium
What Is a Character?
Glyphs are visual
representations of characters.
What Is a Glyph?
Glyphs are visual
representations of characters.
Fonts are collections of glyphs.
What Is a Glyph?
Glyphs are visual
representations of characters.
Fonts are collections of glyphs.
There may be many different glyphs
for the same character.
What Is a Glyph?
Glyphs are visual
representations of characters.
Fonts are collections of glyphs.
There may be many different glyphs
for the same character.
This talk is not about fonts or glyphs.
What Is a Glyph?
a b c
π ‫ث‬ й
Letters
1 2 3
໓
๓ ३
Numbers
. / ?
「 « » 」
Punctuation
™ © ≠
☺ ☠
Symbols
CARRIAGE RETURN
NO-BREAK SPACE
COMBINING GRAPHEME JOINER
RIGHT-TO-LEFT MARK
Control Characters
Many people use “character set”
to mean one or more of these:
Character Code
Character Encoding
Character Repertoire
Which makes for a confusing situation.
Character Set
A defined mapping of
characters to numbers.
A ⇒ 41
B ⇒ 42
C ⇒ 43
Each value in a character code
is called a code point.
Character Code
An algorithm to convert
code points to a digital form for ease
of transmitting or storing data.
41 (A) ⇒ 1000001
42 (B) ⇒ 1000010
43 (C) ⇒ 1000011
Character Encoding
A character repertoire is a
collection of distinct characters.
Character codes, keyboards, and
written languages all have
well-defined character repertoires.
Character Repertoire
ASCII
character code: 128 code points
character encoding: 7 bits each
Character Codes & Encodings
ASCII
character code: 128 code points
character encoding: 7 bits each
Latin 1 (ISO-8859-1)
character code: 256 code points
character encoding: 8 bits (1 byte) each
Character Codes & Encodings
Unicode (character code)
1,112,064 code points (110,000+ defined)
Character Codes & Encodings
Unicode (character code)
1,112,064 code points (110,000+ defined)
character encodings:
UTF-8 — 1 to 4 bytes each
UTF-16 — 2 or 4 bytes each
UTF-32 — 4 bytes each
Character Codes & Encodings
A
U+0041
LATIN CAPITAL LETTER A
໓
U+0ED3
LAO DIGIT THREE
U+1F4A9
PILE OF POO
Code Points
Some code points have
precomposed diacritics.
ȫ
U+022B
LATIN SMALL LETTER O
WITH DIAERESIS AND MACRON
Code Points
Other characters must be composed
from multiple code points
using “combing characters.”
n̈
U+006E
LATIN SMALL LETTER N
U+0308
COMBINING DIAERESIS
Code Points
Any series of code points that are composed
into a single user-perceived character.
Informally known as “graphemes.”
A (U+0041)
n̥̈ (U+006E U+0308 U+0325)
CRLF (U+000D U+000A)
Grapheme Clusters
U+1F42A
DROMEDARY CAMEL
Time for some…
# ¡jalapeño!
say "x{A1}jalapex{D1}o!";
String constants ... TIMTOWTDI
# ¡jalapeño!
say "x{A1}jalapex{D1}o!";
use v5.12;
say "N{U+00A1}jalapeN{U+00D1}o!";
String constants ... TIMTOWTDI
use charnames qw( :full );
say "N{INVERTED EXCLAMATION
MARK}jalapeN{LATIN SMALL LETTER N WITH
TILDE}o!";
String constants ... TIMTOWTDI
use charnames qw( :full );
say "N{INVERTED EXCLAMATION
MARK}jalapeN{LATIN SMALL LETTER N WITH
TILDE}o!";
use utf8;
say '¡jalapeño!';
String constants ... TIMTOWTDI
=encoding UTF-8
=head1 ¡jalapeño!
String constants ... POD
UTF-8 encoded input
⇩
decode
⇩
Perl Unicode string
⇩
encode
⇩
UTF-8 encoded output
I/O
open my $fh, '<:encoding(UTF-8)', $filename;
open my $fh, '>:encoding(UTF-8)', $filename;
I/O
open my $fh, '<:encoding(UTF-8)', $filename;
open my $fh, '>:encoding(UTF-8)', $filename;
binmode $fh, ':encoding(UTF-8)';
binmode STDIN, ':encoding(UTF-8)';
I/O
use open qw( :encoding(UTF-8) );
open my $fh, '<', $filename;
I/O
use open qw( :encoding(UTF-8) );
open my $fh, '<', $filename;
# :std for STDIN, STDOUT, STDERR
use open qw( :encoding(UTF-8) :std );
I/O
use open qw( :encoding(UTF-8) );
open my $fh, '<', $filename;
# :std for STDIN, STDOUT, STDERR
use open qw( :encoding(UTF-8) :std );
# CPAN module to enable everything UTF-8
use utf8::all;
I/O
use Encode;
my $internal = decode('UTF-8', $input);
my $output = encode('UTF-8', $internal);
Explicit Encoding & Decoding
Let’s use this grapheme cluster as the
string in our next example:
ю́
U+044E
CYRILLIC SMALL LETTER YU
U+0301
COMBINING ACUTE ACCENT
String Length
# UTF-8 encoded: D1 8E CC 81
say length $encoded_grapheme; # 4
String Length
# UTF-8 encoded: D1 8E CC 81
say length $encoded_grapheme; # 4
use Encode;
# Unicode string: 044E 0301
my $grapheme = decode('UTF-8', $encoded);
say length $grapheme; # 2
String Length
# UTF-8 encoded: D1 8E CC 81
say length $encoded_grapheme; # 4
use Encode;
# Unicode string: 044E 0301
my $grapheme = decode('UTF-8', $encoded);
say length $grapheme; # 2
my $length = () = $grapheme =~ /X/g;
say $length; # 1
String Length
# sort of complex for a simple length, eh?
my $length = () = $str =~ /X/g;
say $length;
String Length
# sort of complex for a simple length, eh?
my $length = () = $str =~ /X/g;
say $length;
# and tricky depending on the context
say scalar( () = $str =~ /X/g );
String Length
# sort of complex for a simple length, eh?
my $length = () = $str =~ /X/g;
say $length;
# and tricky depending on the context
say scalar( () = $str =~ /X/g );
# a little better
$length++ while $str =~ /X/g;
say $length;
String Length
# an alternative approach
use Unicode::GCString;
say Unicode::GCString->new($str)->length;
String Length
# an alternative approach
use Unicode::GCString;
say Unicode::GCString->new($str)->length;
# and yet another (Warning: I wrote it!)
use Unicode::Util qw( grapheme_length );
say grapheme_length($str);
String Length
Standard ordering of strings
for comparison and sorting.
sort @names
$a cmp $b
$x gt $y
$foo eq $bar
Collation
Perl provides a collation algorithm
based on code points.
Collation
Perl provides a collation algorithm
based on code points.
@words = qw( Äpfel durian Xerxes )
sort @words
# Xerxes durian Äpfel
Collation
Perl provides a collation algorithm
based on code points.
@words = qw( Äpfel durian Xerxes )
sort @words
# Xerxes durian Äpfel
sort { lc $a cmp lc $b } @words
# durian Xerxes Äpfel
Collation
Unicode Collation Algorithm (UCA) provides
collation based on natural language usage.
Collation
Unicode Collation Algorithm (UCA) provides
collation based on natural language usage.
use Unicode::Collate;
my $collator = Unicode::Collate->new;
$collator->sort(@words);
# Äpfel durian Xerxes
Collation
Unicode Collation Algorithm (UCA) provides
collation based on natural language usage.
$collator->sort(@names)
$collator->cmp($a, $b)
$collator->gt($x, $y)
$collator->eq($foo, $bar)
Collation
UCA also provides locale-specific collations
for different languages.
Collation
UCA also provides locale-specific collations
for different languages.
use Unicode::Collate::Locale;
my $kolator = Unicode::Collate::Locale->new(
locale => 'pl' # Polish
);
Collation
Unicode has 4 normalization forms.
The most important are:
NFD: Normalization Form
Canonical Decomposition
NFC: Normalization Form
Canonical Composition
Normalization
use Unicode::Normalize;
# NFD can be helpful on input
$str = NFD($input);
# NFC is recommended on output
$output = NFC($str);
Normalization
UTF-8 encoded input
⇩
decode
⇩
NFD
⇩
Perl Unicode string
⇩
NFC
⇩
encode
⇩
UTF-8 encoded output
Normalization
By default, unfortunately, strings and regexes are
not guaranteed to use Unicode semantics.
This is known as “The Unicode Bug.”
There are a few ways to fix this:
Unicode Semantics
By default, unfortunately, strings and regexes are
not guaranteed to use Unicode semantics.
This is known as “The Unicode Bug.”
There are a few ways to fix this:
utf8::upgrade($str);
Unicode Semantics
By default, unfortunately, strings and regexes are
not guaranteed to use Unicode semantics.
This is known as “The Unicode Bug.”
There are a few ways to fix this:
utf8::upgrade($str);
use v5.12;
Unicode Semantics
By default, unfortunately, strings and regexes are
not guaranteed to use Unicode semantics.
This is known as “The Unicode Bug.”
There are a few ways to fix this:
utf8::upgrade($str);
use v5.12;
use feature 'unicode_strings';
Unicode Semantics
You’ll see the “utf8” encoding
used frequently in Perl.
“utf8” follows the UTF-8 standard very
loosely and allows many errors
in your data without warnings.
By default, use “UTF-8” instead.
UTF-8 vs. utf8 vs. :utf8
# utf8 is Perl's internal encoding form
my $internal = decode('utf8', $input);
# UTF-8 is the official UTF-8 encoding
my $internal = decode('UTF-8', $input);
UTF-8 vs. utf8 vs. :utf8
# utf8 is Perl's internal encoding form
my $internal = decode('utf8', $input);
# UTF-8 is the official UTF-8 encoding
my $internal = decode('UTF-8', $input);
# insecure! no encoding validation at all
open my $fh, '<:utf8', $filename;
# proper UTF-8 validation
open my $fh, '<:encoding(UTF-8)', $filename;
UTF-8 vs. utf8 vs. :utf8
Slides will be posted to:
@nickpatch
Questions?

More Related Content

What's hot

Types and perl language
Types and perl languageTypes and perl language
Types and perl languageMasahiro Honma
 
Clean code: meaningful Name
Clean code: meaningful NameClean code: meaningful Name
Clean code: meaningful Namenahid035
 
CSharp Language Overview Part 1
CSharp Language Overview Part 1CSharp Language Overview Part 1
CSharp Language Overview Part 1Hossein Zahed
 
Four Languages From Forty Years Ago
Four Languages From Forty Years AgoFour Languages From Forty Years Ago
Four Languages From Forty Years AgoScott Wlaschin
 
Learn About Simple Tricks For Coding & Decoding
Learn About Simple Tricks For Coding & DecodingLearn About Simple Tricks For Coding & Decoding
Learn About Simple Tricks For Coding & DecodingMukesh Kumar
 
Regex Presentation
Regex PresentationRegex Presentation
Regex Presentationarnolambert
 
natural language processing
natural language processing natural language processing
natural language processing sunanthakrishnan
 
Aspects of software naturalness through the generation of IdentifierNames
Aspects of software naturalness through the generation of IdentifierNamesAspects of software naturalness through the generation of IdentifierNames
Aspects of software naturalness through the generation of IdentifierNamesOleksandr Zaitsev
 
Lecture 2 php basics (1)
Lecture 2  php basics (1)Lecture 2  php basics (1)
Lecture 2 php basics (1)Core Lee
 
Naming Standards, Clean Code
Naming Standards, Clean CodeNaming Standards, Clean Code
Naming Standards, Clean CodeCleanestCode
 
How to improve the quality of your TYPO3 extensions
How to improve the quality of your TYPO3 extensionsHow to improve the quality of your TYPO3 extensions
How to improve the quality of your TYPO3 extensionsChristian Trabold
 

What's hot (17)

C# slid
C# slidC# slid
C# slid
 
Types and perl language
Types and perl languageTypes and perl language
Types and perl language
 
Perl slid
Perl slidPerl slid
Perl slid
 
Clean code: meaningful Name
Clean code: meaningful NameClean code: meaningful Name
Clean code: meaningful Name
 
CSharp Language Overview Part 1
CSharp Language Overview Part 1CSharp Language Overview Part 1
CSharp Language Overview Part 1
 
Four Languages From Forty Years Ago
Four Languages From Forty Years AgoFour Languages From Forty Years Ago
Four Languages From Forty Years Ago
 
7.1.intro perl
7.1.intro perl7.1.intro perl
7.1.intro perl
 
Learn About Simple Tricks For Coding & Decoding
Learn About Simple Tricks For Coding & DecodingLearn About Simple Tricks For Coding & Decoding
Learn About Simple Tricks For Coding & Decoding
 
Regex Presentation
Regex PresentationRegex Presentation
Regex Presentation
 
natural language processing
natural language processing natural language processing
natural language processing
 
note
notenote
note
 
Aspects of software naturalness through the generation of IdentifierNames
Aspects of software naturalness through the generation of IdentifierNamesAspects of software naturalness through the generation of IdentifierNames
Aspects of software naturalness through the generation of IdentifierNames
 
Parsing
ParsingParsing
Parsing
 
Lecture 2 php basics (1)
Lecture 2  php basics (1)Lecture 2  php basics (1)
Lecture 2 php basics (1)
 
Naming Standards, Clean Code
Naming Standards, Clean CodeNaming Standards, Clean Code
Naming Standards, Clean Code
 
How to improve the quality of your TYPO3 extensions
How to improve the quality of your TYPO3 extensionsHow to improve the quality of your TYPO3 extensions
How to improve the quality of your TYPO3 extensions
 
C++
C++C++
C++
 

Viewers also liked

Linea del tempo .. del patinaje de carreras
Linea del tempo .. del  patinaje  de carrerasLinea del tempo .. del  patinaje  de carreras
Linea del tempo .. del patinaje de carrerasLeidypiracoca
 
Grelha Baseado Na Afe 6 1 Me Lamas
Grelha Baseado Na Afe 6 1 Me LamasGrelha Baseado Na Afe 6 1 Me Lamas
Grelha Baseado Na Afe 6 1 Me Lamasguest8100d11
 
Specialty Silica
Specialty SilicaSpecialty Silica
Specialty SilicaEssien Jae
 
El cajero automatico
El cajero automaticoEl cajero automatico
El cajero automaticosheryl0072
 
NEXT GAS 486.4 MILES copy
NEXT GAS 486.4 MILES copyNEXT GAS 486.4 MILES copy
NEXT GAS 486.4 MILES copyEd Gines
 
Símbolos Da 1ª República
Símbolos Da 1ª RepúblicaSímbolos Da 1ª República
Símbolos Da 1ª RepúblicaBibJoseRegio
 
ТЕМА 11. ПРОИЗВОДСТВЕННАЯ ЦЕПОЧКА И УЧАСТНИКИ РЫНКА РЕКЛАМЫ И СВЯЗЕЙ С ОБЩЕСТ...
ТЕМА 11. ПРОИЗВОДСТВЕННАЯ ЦЕПОЧКА И УЧАСТНИКИ РЫНКА РЕКЛАМЫ И СВЯЗЕЙ С ОБЩЕСТ...ТЕМА 11. ПРОИЗВОДСТВЕННАЯ ЦЕПОЧКА И УЧАСТНИКИ РЫНКА РЕКЛАМЫ И СВЯЗЕЙ С ОБЩЕСТ...
ТЕМА 11. ПРОИЗВОДСТВЕННАЯ ЦЕПОЧКА И УЧАСТНИКИ РЫНКА РЕКЛАМЫ И СВЯЗЕЙ С ОБЩЕСТ...Usanov Aleksey
 
Gene expression analysis in storage root of cassava using microarray data
Gene expression analysis in storage root of cassava using microarray dataGene expression analysis in storage root of cassava using microarray data
Gene expression analysis in storage root of cassava using microarray dataCIAT
 
Para liberar el estress
Para liberar el estressPara liberar el estress
Para liberar el estresspacheco
 
Best Exercise bike 2016
Best Exercise bike 2016Best Exercise bike 2016
Best Exercise bike 2016dhoni45
 
BSA Four Pillars
BSA Four PillarsBSA Four Pillars
BSA Four PillarsRita Saco
 

Viewers also liked (20)

Trading StocksSemanal04/03/2011
Trading StocksSemanal04/03/2011Trading StocksSemanal04/03/2011
Trading StocksSemanal04/03/2011
 
Linea del tempo .. del patinaje de carreras
Linea del tempo .. del  patinaje  de carrerasLinea del tempo .. del  patinaje  de carreras
Linea del tempo .. del patinaje de carreras
 
ORDINARY DIPLOMA
ORDINARY DIPLOMAORDINARY DIPLOMA
ORDINARY DIPLOMA
 
1469797563-109815594
1469797563-1098155941469797563-109815594
1469797563-109815594
 
Тема 10
Тема 10Тема 10
Тема 10
 
Pndh3
Pndh3Pndh3
Pndh3
 
Grelha Baseado Na Afe 6 1 Me Lamas
Grelha Baseado Na Afe 6 1 Me LamasGrelha Baseado Na Afe 6 1 Me Lamas
Grelha Baseado Na Afe 6 1 Me Lamas
 
Specialty Silica
Specialty SilicaSpecialty Silica
Specialty Silica
 
El cajero automatico
El cajero automaticoEl cajero automatico
El cajero automatico
 
NEXT GAS 486.4 MILES copy
NEXT GAS 486.4 MILES copyNEXT GAS 486.4 MILES copy
NEXT GAS 486.4 MILES copy
 
Símbolos Da 1ª República
Símbolos Da 1ª RepúblicaSímbolos Da 1ª República
Símbolos Da 1ª República
 
ТЕМА 11. ПРОИЗВОДСТВЕННАЯ ЦЕПОЧКА И УЧАСТНИКИ РЫНКА РЕКЛАМЫ И СВЯЗЕЙ С ОБЩЕСТ...
ТЕМА 11. ПРОИЗВОДСТВЕННАЯ ЦЕПОЧКА И УЧАСТНИКИ РЫНКА РЕКЛАМЫ И СВЯЗЕЙ С ОБЩЕСТ...ТЕМА 11. ПРОИЗВОДСТВЕННАЯ ЦЕПОЧКА И УЧАСТНИКИ РЫНКА РЕКЛАМЫ И СВЯЗЕЙ С ОБЩЕСТ...
ТЕМА 11. ПРОИЗВОДСТВЕННАЯ ЦЕПОЧКА И УЧАСТНИКИ РЫНКА РЕКЛАМЫ И СВЯЗЕЙ С ОБЩЕСТ...
 
Gene expression analysis in storage root of cassava using microarray data
Gene expression analysis in storage root of cassava using microarray dataGene expression analysis in storage root of cassava using microarray data
Gene expression analysis in storage root of cassava using microarray data
 
Presentation_NEW.PPTX
Presentation_NEW.PPTXPresentation_NEW.PPTX
Presentation_NEW.PPTX
 
Mada's New logo
Mada's New logoMada's New logo
Mada's New logo
 
Para liberar el estress
Para liberar el estressPara liberar el estress
Para liberar el estress
 
Seducción en la publicidad B2C
Seducción en la publicidad B2CSeducción en la publicidad B2C
Seducción en la publicidad B2C
 
Best Exercise bike 2016
Best Exercise bike 2016Best Exercise bike 2016
Best Exercise bike 2016
 
BSA Four Pillars
BSA Four PillarsBSA Four Pillars
BSA Four Pillars
 
Calendario
CalendarioCalendario
Calendario
 

Similar to Fundamental Unicode in Perl

An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Imp...
An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Imp...An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Imp...
An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Imp...Yann-Gaël Guéhéneuc
 
Unicode Regular Expressions
Unicode Regular ExpressionsUnicode Regular Expressions
Unicode Regular ExpressionsNova Patch
 
Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)Jerome Eteve
 
Unicode and character sets
Unicode and character setsUnicode and character sets
Unicode and character setsrenchenyu
 
Parsing Expression Grammars
Parsing Expression GrammarsParsing Expression Grammars
Parsing Expression Grammarsteknico
 
Jun 29 new privacy technologies for unicode and international data standards ...
Jun 29 new privacy technologies for unicode and international data standards ...Jun 29 new privacy technologies for unicode and international data standards ...
Jun 29 new privacy technologies for unicode and international data standards ...Ulf Mattsson
 
Lecture 04 syntax analysis
Lecture 04 syntax analysisLecture 04 syntax analysis
Lecture 04 syntax analysisIffat Anjum
 
ITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and GrammarsITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and GrammarsTonny Madsen
 
Internationalisation And Globalisation
Internationalisation And GlobalisationInternationalisation And Globalisation
Internationalisation And GlobalisationAlan Dean
 
New compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdfNew compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdfeliasabdi2024
 
CryptX '22 W1 Release (1).pptx
CryptX '22 W1 Release (1).pptxCryptX '22 W1 Release (1).pptx
CryptX '22 W1 Release (1).pptxBhavikaGianey
 
System Programming Unit IV
System Programming Unit IVSystem Programming Unit IV
System Programming Unit IVManoj Patil
 

Similar to Fundamental Unicode in Perl (20)

An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Imp...
An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Imp...An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Imp...
An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Imp...
 
PHP for Grown-ups
PHP for Grown-upsPHP for Grown-ups
PHP for Grown-ups
 
Unicode Regular Expressions
Unicode Regular ExpressionsUnicode Regular Expressions
Unicode Regular Expressions
 
Your code is not a string
Your code is not a stringYour code is not a string
Your code is not a string
 
Syntax analysis
Syntax analysisSyntax analysis
Syntax analysis
 
Unicode 101
Unicode 101Unicode 101
Unicode 101
 
Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)
 
Unicode and character sets
Unicode and character setsUnicode and character sets
Unicode and character sets
 
Parsing Expression Grammars
Parsing Expression GrammarsParsing Expression Grammars
Parsing Expression Grammars
 
Jun 29 new privacy technologies for unicode and international data standards ...
Jun 29 new privacy technologies for unicode and international data standards ...Jun 29 new privacy technologies for unicode and international data standards ...
Jun 29 new privacy technologies for unicode and international data standards ...
 
What Reika Taught us
What Reika Taught usWhat Reika Taught us
What Reika Taught us
 
Cfg part i
Cfg   part iCfg   part i
Cfg part i
 
Lecture 04 syntax analysis
Lecture 04 syntax analysisLecture 04 syntax analysis
Lecture 04 syntax analysis
 
Working with text, Regular expressions
Working with text, Regular expressionsWorking with text, Regular expressions
Working with text, Regular expressions
 
ITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and GrammarsITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and Grammars
 
Internationalisation And Globalisation
Internationalisation And GlobalisationInternationalisation And Globalisation
Internationalisation And Globalisation
 
New compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdfNew compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdf
 
CryptX '22 W1 Release (1).pptx
CryptX '22 W1 Release (1).pptxCryptX '22 W1 Release (1).pptx
CryptX '22 W1 Release (1).pptx
 
System Programming Unit IV
System Programming Unit IVSystem Programming Unit IV
System Programming Unit IV
 
Compiler Design Tutorial
Compiler Design Tutorial Compiler Design Tutorial
Compiler Design Tutorial
 

Recently uploaded

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 

Fundamental Unicode in Perl

  • 2. “The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape.” —The Unicode Consortium What Is a Character?
  • 3. Glyphs are visual representations of characters. What Is a Glyph?
  • 4. Glyphs are visual representations of characters. Fonts are collections of glyphs. What Is a Glyph?
  • 5. Glyphs are visual representations of characters. Fonts are collections of glyphs. There may be many different glyphs for the same character. What Is a Glyph?
  • 6. Glyphs are visual representations of characters. Fonts are collections of glyphs. There may be many different glyphs for the same character. This talk is not about fonts or glyphs. What Is a Glyph?
  • 7. a b c π ‫ث‬ й Letters
  • 8. 1 2 3 ໓ ๓ ३ Numbers
  • 9. . / ? 「 « » 」 Punctuation
  • 10. ™ © ≠ ☺ ☠ Symbols
  • 11. CARRIAGE RETURN NO-BREAK SPACE COMBINING GRAPHEME JOINER RIGHT-TO-LEFT MARK Control Characters
  • 12. Many people use “character set” to mean one or more of these: Character Code Character Encoding Character Repertoire Which makes for a confusing situation. Character Set
  • 13. A defined mapping of characters to numbers. A ⇒ 41 B ⇒ 42 C ⇒ 43 Each value in a character code is called a code point. Character Code
  • 14. An algorithm to convert code points to a digital form for ease of transmitting or storing data. 41 (A) ⇒ 1000001 42 (B) ⇒ 1000010 43 (C) ⇒ 1000011 Character Encoding
  • 15. A character repertoire is a collection of distinct characters. Character codes, keyboards, and written languages all have well-defined character repertoires. Character Repertoire
  • 16. ASCII character code: 128 code points character encoding: 7 bits each Character Codes & Encodings
  • 17. ASCII character code: 128 code points character encoding: 7 bits each Latin 1 (ISO-8859-1) character code: 256 code points character encoding: 8 bits (1 byte) each Character Codes & Encodings
  • 18. Unicode (character code) 1,112,064 code points (110,000+ defined) Character Codes & Encodings
  • 19. Unicode (character code) 1,112,064 code points (110,000+ defined) character encodings: UTF-8 — 1 to 4 bytes each UTF-16 — 2 or 4 bytes each UTF-32 — 4 bytes each Character Codes & Encodings
  • 20. A U+0041 LATIN CAPITAL LETTER A ໓ U+0ED3 LAO DIGIT THREE U+1F4A9 PILE OF POO Code Points
  • 21. Some code points have precomposed diacritics. ȫ U+022B LATIN SMALL LETTER O WITH DIAERESIS AND MACRON Code Points
  • 22. Other characters must be composed from multiple code points using “combing characters.” n̈ U+006E LATIN SMALL LETTER N U+0308 COMBINING DIAERESIS Code Points
  • 23. Any series of code points that are composed into a single user-perceived character. Informally known as “graphemes.” A (U+0041) n̥̈ (U+006E U+0308 U+0325) CRLF (U+000D U+000A) Grapheme Clusters
  • 26. # ¡jalapeño! say "x{A1}jalapex{D1}o!"; use v5.12; say "N{U+00A1}jalapeN{U+00D1}o!"; String constants ... TIMTOWTDI
  • 27. use charnames qw( :full ); say "N{INVERTED EXCLAMATION MARK}jalapeN{LATIN SMALL LETTER N WITH TILDE}o!"; String constants ... TIMTOWTDI
  • 28. use charnames qw( :full ); say "N{INVERTED EXCLAMATION MARK}jalapeN{LATIN SMALL LETTER N WITH TILDE}o!"; use utf8; say '¡jalapeño!'; String constants ... TIMTOWTDI
  • 30. UTF-8 encoded input ⇩ decode ⇩ Perl Unicode string ⇩ encode ⇩ UTF-8 encoded output I/O
  • 31. open my $fh, '<:encoding(UTF-8)', $filename; open my $fh, '>:encoding(UTF-8)', $filename; I/O
  • 32. open my $fh, '<:encoding(UTF-8)', $filename; open my $fh, '>:encoding(UTF-8)', $filename; binmode $fh, ':encoding(UTF-8)'; binmode STDIN, ':encoding(UTF-8)'; I/O
  • 33. use open qw( :encoding(UTF-8) ); open my $fh, '<', $filename; I/O
  • 34. use open qw( :encoding(UTF-8) ); open my $fh, '<', $filename; # :std for STDIN, STDOUT, STDERR use open qw( :encoding(UTF-8) :std ); I/O
  • 35. use open qw( :encoding(UTF-8) ); open my $fh, '<', $filename; # :std for STDIN, STDOUT, STDERR use open qw( :encoding(UTF-8) :std ); # CPAN module to enable everything UTF-8 use utf8::all; I/O
  • 36. use Encode; my $internal = decode('UTF-8', $input); my $output = encode('UTF-8', $internal); Explicit Encoding & Decoding
  • 37. Let’s use this grapheme cluster as the string in our next example: ю́ U+044E CYRILLIC SMALL LETTER YU U+0301 COMBINING ACUTE ACCENT String Length
  • 38. # UTF-8 encoded: D1 8E CC 81 say length $encoded_grapheme; # 4 String Length
  • 39. # UTF-8 encoded: D1 8E CC 81 say length $encoded_grapheme; # 4 use Encode; # Unicode string: 044E 0301 my $grapheme = decode('UTF-8', $encoded); say length $grapheme; # 2 String Length
  • 40. # UTF-8 encoded: D1 8E CC 81 say length $encoded_grapheme; # 4 use Encode; # Unicode string: 044E 0301 my $grapheme = decode('UTF-8', $encoded); say length $grapheme; # 2 my $length = () = $grapheme =~ /X/g; say $length; # 1 String Length
  • 41. # sort of complex for a simple length, eh? my $length = () = $str =~ /X/g; say $length; String Length
  • 42. # sort of complex for a simple length, eh? my $length = () = $str =~ /X/g; say $length; # and tricky depending on the context say scalar( () = $str =~ /X/g ); String Length
  • 43. # sort of complex for a simple length, eh? my $length = () = $str =~ /X/g; say $length; # and tricky depending on the context say scalar( () = $str =~ /X/g ); # a little better $length++ while $str =~ /X/g; say $length; String Length
  • 44. # an alternative approach use Unicode::GCString; say Unicode::GCString->new($str)->length; String Length
  • 45. # an alternative approach use Unicode::GCString; say Unicode::GCString->new($str)->length; # and yet another (Warning: I wrote it!) use Unicode::Util qw( grapheme_length ); say grapheme_length($str); String Length
  • 46. Standard ordering of strings for comparison and sorting. sort @names $a cmp $b $x gt $y $foo eq $bar Collation
  • 47. Perl provides a collation algorithm based on code points. Collation
  • 48. Perl provides a collation algorithm based on code points. @words = qw( Äpfel durian Xerxes ) sort @words # Xerxes durian Äpfel Collation
  • 49. Perl provides a collation algorithm based on code points. @words = qw( Äpfel durian Xerxes ) sort @words # Xerxes durian Äpfel sort { lc $a cmp lc $b } @words # durian Xerxes Äpfel Collation
  • 50. Unicode Collation Algorithm (UCA) provides collation based on natural language usage. Collation
  • 51. Unicode Collation Algorithm (UCA) provides collation based on natural language usage. use Unicode::Collate; my $collator = Unicode::Collate->new; $collator->sort(@words); # Äpfel durian Xerxes Collation
  • 52. Unicode Collation Algorithm (UCA) provides collation based on natural language usage. $collator->sort(@names) $collator->cmp($a, $b) $collator->gt($x, $y) $collator->eq($foo, $bar) Collation
  • 53. UCA also provides locale-specific collations for different languages. Collation
  • 54. UCA also provides locale-specific collations for different languages. use Unicode::Collate::Locale; my $kolator = Unicode::Collate::Locale->new( locale => 'pl' # Polish ); Collation
  • 55. Unicode has 4 normalization forms. The most important are: NFD: Normalization Form Canonical Decomposition NFC: Normalization Form Canonical Composition Normalization
  • 56. use Unicode::Normalize; # NFD can be helpful on input $str = NFD($input); # NFC is recommended on output $output = NFC($str); Normalization
  • 57. UTF-8 encoded input ⇩ decode ⇩ NFD ⇩ Perl Unicode string ⇩ NFC ⇩ encode ⇩ UTF-8 encoded output Normalization
  • 58. By default, unfortunately, strings and regexes are not guaranteed to use Unicode semantics. This is known as “The Unicode Bug.” There are a few ways to fix this: Unicode Semantics
  • 59. By default, unfortunately, strings and regexes are not guaranteed to use Unicode semantics. This is known as “The Unicode Bug.” There are a few ways to fix this: utf8::upgrade($str); Unicode Semantics
  • 60. By default, unfortunately, strings and regexes are not guaranteed to use Unicode semantics. This is known as “The Unicode Bug.” There are a few ways to fix this: utf8::upgrade($str); use v5.12; Unicode Semantics
  • 61. By default, unfortunately, strings and regexes are not guaranteed to use Unicode semantics. This is known as “The Unicode Bug.” There are a few ways to fix this: utf8::upgrade($str); use v5.12; use feature 'unicode_strings'; Unicode Semantics
  • 62. You’ll see the “utf8” encoding used frequently in Perl. “utf8” follows the UTF-8 standard very loosely and allows many errors in your data without warnings. By default, use “UTF-8” instead. UTF-8 vs. utf8 vs. :utf8
  • 63. # utf8 is Perl's internal encoding form my $internal = decode('utf8', $input); # UTF-8 is the official UTF-8 encoding my $internal = decode('UTF-8', $input); UTF-8 vs. utf8 vs. :utf8
  • 64. # utf8 is Perl's internal encoding form my $internal = decode('utf8', $input); # UTF-8 is the official UTF-8 encoding my $internal = decode('UTF-8', $input); # insecure! no encoding validation at all open my $fh, '<:utf8', $filename; # proper UTF-8 validation open my $fh, '<:encoding(UTF-8)', $filename; UTF-8 vs. utf8 vs. :utf8
  • 65. Slides will be posted to: @nickpatch Questions?