Demystifying Unicode
@colinodell
Colin O’Dell
● Principal Engineer at Unleashed Technologies
● PHP for ~20 years; 13 years professionally
● Creator & maintainer of league/commonmark library
● PHP League leadership team
● Owner of moderngeekware.com
● @colinodell
Agenda
● A History of Encoding Systems
● Unicode Standard
● Unicode Encodings
● Using Unicode in PHP
● Tips & Tricks
● Questions & Answers
Assumptions
● Some familiarity with PHP
● Basic understanding of binary and hexadecimal
● Focus on high-level concepts!
Encoding Systems
Encoding Systems
L 1001100
L
A (Brief) History of
Encoding Systems
1837: Morse Code (Internationalized in 1844)
“Morse-Vail Telegraph Key” by the National Museum of American History is licenced under CC BY-NC 2.0
1930s: Teleprinters
1960s: Teletypes (TTYs) For Computing
1960s: ASCII
● American Standard Code for Information Interchange
● 7-bit binary encoding
○ 0000000 = 0
○ ...
○ 1111111 = 127
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPAC
E ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPAC
E ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
Character Hex Binary Character Hex Binary
LF (line feed) 0x0A 0001010 E 0x45 1000101
3 0x33 0110011 e 0x65 1100101
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
00xxxxx
01xxxxx
10xxxxx
11xxxxx
00xxxxx = 32 control codes
01xxxxx = 32 numbers & symbols
10xxxxx = 32 uppercase letters and some extra symbols
11xxxxx = 32 lowercase letters and some extra symbols
A = 0x41 = 1000001
B = 0x42 = 1000010
…
Z = 0x5A = 1011010
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
00xxxxx
01xxxxx
10xxxxx
11xxxxx
A = 0x41 = 1000001
B = 0x42 = 1000010
…
Z = 0x5A = 1011010
a = 0x61 = 1100001
b = 0x62 = 1100010
…
z = 0x7A = 1111010
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
00xxxxx
01xxxxx
10xxxxx
11xxxxx
But computers use 8-bit bytes...
ASCII (7 Bits) ???
Start 00000000 10000000
End 01111111 11111111
Count 128 128
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
7-bit
ASCII
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
8
???
9
A
B
C
D
E
F
8-bit
“Extended
ASCII”
0 1 2 3 4 5 6 7 8 9 A B C D E F
0
1
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~
8
9
A NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ SHY ® ¯
B ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
C À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
D Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
E à á â ã ä å æ ç è é ê ë ì í î ï
F ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
ISO
8859-1
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
8 € ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž
9 ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ
A NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ SHY ® ¯
B ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
C À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
D Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
E à á â ã ä å æ ç è é ê ë ì í î ï
F ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
Windows-1252
0 1 2 3 4 5 6 7 8 9 A B C D E F
0
1
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~
8
9
A NBSP Ą ˘ Ł ¤ Ľ Ś § ¨ Š Ş Ť Ź SHY Ž Ż
B ° ą ˛ ł ´ ľ ś ˇ ¸ š ş ť ź ˝ ž ż
C Ŕ Á Â Ă Ä Ĺ Ć Ç Č É Ę Ë Ě Í Î Ď
D Đ Ń Ň Ó Ô Ő Ö × Ř Ů Ú Ű Ü Ý Ţ ß
E ŕ á â ă ä ĺ ć ç č é ę ë ě í î ď
F đ ń ň ó ô ő ö ÷ ř ů ú ű ü ý ţ ˙
ISO
8859-2
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL ☺ ☻ ♥ ♦ ♣ ♠ • ◘ ○ ◙ ♂ ♀ ♪ ♫ ☼
1 ► ◄ ↕ ‼ ¶ § ▬ ↨ ↑ ↓ → ← ∟ ↔ ▲ ▼
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ ⌂
8 Ç ü é â ä à å ç ê ë è ï î ì Ä Å
9 É æ Æ ô ö ò û ù ÿ Ö Ü ¢ £ ¥ ₧ ƒ
A á í ó ú ñ Ñ ª º ¿ ⌐ ¬ ½ ¼ ¡ « »
B ░ ▒ ▓ │ ┤ ╡ ╢ ╖ ╕ ╣ ║ ╗ ╝ ╜ ╛ ┐
C └ ┴ ┬ ├ ─ ┼ ╞ ╟ ╚ ╔ ╩ ╦ ╠ ═ ╬ ╧
D ╨ ╤ ╥ ╙ ╘ ╒ ╓ ╫ ╪ ┘ ┌ █ ▄ ▌ ▐ ▀
E α ß Γ π Σ σ µ τ Φ Θ Ω δ ∞ φ ε ∩
F ≡ ± ≥ ≤ ⌠ ⌡ ÷ ≈ ° ∙ · √ ⁿ ² ■ NBSP
Code
Page
437
(IBM
PC)
8-bit “Extended ASCII”
● ISO 8859 - 16 variations:
○ ISO 8859-1 (“Latin 1”, Western European)
○ ISO 8859-2 (“Latin 2”, Central European)
○ ISO 8859-3 (“Latin 3”, South European)
○ ISO 8859-4 (“Latin 4”, North European)
○ ISO 8859-5 (Latin/Cyrillic)
○ ISO 8859-6 (Latin/Arabic)
○ ISO 8859-7 (Latin/Greek)
○ ISO 8859-8 (Latin/Hebrew)
○ ISO 8859-9 (“Latin 5”, Turkish)
○ ISO 8859-10 (“Latin 6”, Nordic)
○ ISO 8859-11 (Latin/Thai)
○ ISO 8859-12 (Latin/Devanagari) - abandoned
○ ISO 8859-13 (“Latin 7”, Baltic Rim)
○ ISO 8859-14 (“Latin 8”, Celtic)
○ ISO 8859-15 (“Latin 9”)
■ Revision of 8859-1 with swaps out less-
used chars; adds euro currency symbol
○ ISO 8859-16 (“Latin 10”, South-Eastern European)
● Windows-1252
● CP 437 - Original IBM PC
● Mac OS Roman character set
● TRS-80 character set
● Atari’s ATASCII
● Commodore’s PETSCII
● HP Roman-8 and Roman-9
● DEC’s Multinational Character Set
● Lotus International Character Set
● ECMA-94
But then along came the Internet...
https://xkcd.com/927/
“The Unicode Standard is the universal character
encoding standard for written characters and text. It
defines a consistent way of encoding multilingual text
that enables the exchange of text data internationally and
creates the foundation for global software”
Code Points
Problem:
How to accommodate larger character sets without wasting memory?
Solution:
Break the one-to-one correspondence between characters and
bits/encoding! Offer different ways to encode based on
different needs.
ASCII vs. Unicode
Character Encoded Bits
H 01001000 (0x48)
P 01010000 (0x50)
Glyph Code Point
P U+0050
LATIN CAPITAL LETTER P
H U+0048
LATIN CAPITAL LETTER H
Encoded Bits
????
????
Glyph Code Point Encoded Bits
P U+0050
LATIN CAPITAL LETTER P
????
h U+0068
LATIN SMALL LETTER H
????
Σ U+03A3
GREEK CAPITAL LETTER SIGMA
????
U+0634
ARABIC LETTER SHEEN
????
U+1D2ED
MAYAN NUMERAL
THIRTEEN
????
😸
U+1F638
GRINNING CAT FACE WITH
SMILING EYES
????
H U+0048
LATIN CAPITAL LETTER H
????
D
U+0044
LATIN CAPITAL LETTER D
U+1F604
SMILING FACE WITH
OPEN MOUTH AND
SMILING EYES
Code Planes
Recap
● Code Point: a number representing a single character*
○ 143,859 defined as of Unicode 13.0
○ Format: U+hhhhhh
● Codespace: A range of numerical values available for encoding characters
○ Support for 1,114,112 codepoints (0x000000 - 0x10FFFF)
● Code Planes: Continuous group of 65,536 (216) code points
○ 17 planes, numbered 0 - 16, which corresponds with the possible values 00–1016 of the first
two positions in six position hexadecimal format (U+hhhhhh)
Glyphs and Graphemes
Character / Code Point:
a
U+0061
LATIN SMALL LETTER A
Character / Code Point:
a
U+0061
LATIN SMALL LETTER A
a a a a a a a a
Glyphs:
Glyphs and Graphemes
Glyph /
Grapheme c a f e
Unicode
Character
c a f e
Code Point
U+0063 U+0061 U+0066 U+0065
LATIN
SMALL
LETTER C
LATIN
SMALL
LETTER A
LATIN
SMALL
LETTER F
LATIN
SMALL
LETTER E
Glyphs and Graphemes: Combining Diacritical Marks
Glyph /
Grapheme c a f é
Unicode
Character
c a f e ◌́
Code Point
U+0063 U+0061 U+0066 U+0065 U+0301
LATIN
SMALL
LETTER C
LATIN
SMALL
LETTER A
LATIN
SMALL
LETTER F
LATIN
SMALL
LETTER E
COMBINING
ACUTE ACCENT
Glyphs and Graphemes: Combining Diacritical Marks
Glyph /
Grapheme c a f é
Unicode
Character
c a f e ◌́
Code Point
U+0063 U+0061 U+0066 U+0065 U+0301
LATIN
SMALL
LETTER C
LATIN
SMALL
LETTER A
LATIN
SMALL
LETTER F
LATIN
SMALL
LETTER E
COMBINING
ACUTE ACCENT
e + ◌́ = é
e
Glyphs and Graphemes: Combining Diacritical Marks
Z̷̧̨̰̋Å̸̮͉ ̵͉̣̄̇̀
L̵͉̣̄̇̀G
̸̮͉̊ O
̸̱͒̓ ̷̧̨̰̋Ț͝E̪̘̗̓͝X̪̘̗T
̸̰̺̝̍̈
Glyphs and Graphemes: Variation Selectors
Glyph /
Grapheme ✈
Unicode
Character
✈
Code Point
U+2708 U+FE0E
AIRPLANE
VARIATION
SELECTOR 15
(TEXT STYLE)
VS
15
Glyphs and Graphemes: Variation Selectors
Glyph /
Grapheme ✈
Unicode
Character
✈
Code Point
U+2708 U+FE0E
AIRPLANE
VARIATION
SELECTOR 15
(TEXT STYLE)
Glyph /
Grapheme
Unicode
Character
✈
Code Point
U+2708 U+FE0F
AIRPLANE
VARIATION
SELECTOR 16
(EMOJI STYLE)
VS
16
VS
15
Glyphs and Graphemes: Regional Indicator Symbols
Glyph /
Grapheme 🇺🇸
Unicode
Character
🇺 🇸
Code Point
U+1F1FA U+1F1F8
REGIONAL
INDICATOR
SYMBOL
LETTER U
REGIONAL
INDICATOR
SYMBOL
LETTER S
Glyph /
Grapheme 🇨🇦
Unicode
Character
🇨 🇦
Code Point
U+1F1E8 U+1F1E6
REGIONAL
INDICATOR
SYMBOL
LETTER C
REGIONAL
INDICATOR
SYMBOL
LETTER A
Glyphs and Graphemes: Modifiers
Glyph /
Grapheme
Unicode
Character
👋
Code Point
U+1F44B U+1F3FC
WAVING
HAND SIGN
EMOJI
MODIFIER
FITZPATRICK
TYPE-3
Glyph /
Grapheme
Unicode
Character
👋
Code Point
U+1F44B U+1F3FE
WAVING
HAND SIGN
EMOJI
MODIFIER
FITZPATRICK
TYPE-5
Glyphs and Graphemes: ZWJ Sequences
Glyph /
Grapheme
👨 👩 👶 👧
Unicode
Character
👨 👩 👶 👧
Code
Point
U+1F468 U+1F469 U+1F476 U+1F467
MAN WOMAN BABY GIRL
Glyphs and Graphemes: ZWJ Sequences
Glyph /
Grapheme
Unicode
Character
👨 👩 👶 👧
Code
Point
U+1F468 U+200D U+1F469 U+200D U+1F476 U+200D U+1F467
MAN
ZERO
WIDTH
JOINER
WOMAN
ZERO
WIDTH
JOINER
BABY
ZERO
WIDTH
JOINER
GIRL
ZWJ ZWJ ZWJ
Glyphs and Graphemes: ZWJ Sequences
Glyphs and Graphemes: ZWJ Sequences
Glyphs and Graphemes: ZWJ Sequences
Glyph /
Grapheme
Unicode
Character
Code
Point
U+1F477 U+200D U+2642
CONSTRU
CTION
WORKER
ZERO
WIDTH
JOINER
MALE SIGN
ZWJ
Glyph /
Grapheme
Unicode
Character
Code
Point
U+1F477 U+200D U+2640
CONSTRU
CTION
WORKER
ZERO
WIDTH
JOINER
FEMALE
SIGN
ZWJ
Glyphs and Graphemes: ZWJ Sequences
Glyph / Grapheme
Unicode Character
Code Point
U+1F477 U+1F3FE U+200D U+2640
CONSTRUCTION
WORKER
EMOJI MODIFIER
FITZPATRICK
TYPE-5
ZERO WIDTH
JOINER
FEMALE SIGN
ZWJ
Enough about code points...
Encoding Schemes
Glyph Code Point Encoded Bits
P U+0050
LATIN CAPITAL LETTER P
????
h U+0068
LATIN SMALL LETTER H
????
Σ U+03A3
GREEK CAPITAL LETTER SIGMA
????
U+0634
ARABIC LETTER SHEEN
????
U+1D2ED
MAYAN NUMERAL
THIRTEEN
????
😸
U+1F638
GRINNING CAT FACE WITH
SMILING EYES
????
H U+0048
LATIN CAPITAL LETTER H
????
Encoding Schemes
● Most popular:
○ UTF-8
○ UTF-16
○ UTF-32
UTF-32
Fixed-byte encoding; 4 bytes per code point
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+10FFFF
xxxxxxxxxxxxxxxxxxxxx 00000000 000xxxxx xxxxxxxx xxxxxxxx
UTF-32
Fixed-byte encoding; 4 bytes per character
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+10FFFF
xxxxxxxxxxxxxxxxxxxxx 00000000 000xxxxx xxxxxxxx xxxxxxxx
Examples:
A
U+0041
LATIN CAPITAL A
0x0041 => 1000001 00000000 00000000 00000000 01000001
😸
U+1F638
GRINNING CAT WITH
SMILING EYES
0x1F638 => 11111011000111000 00000000 00000001 11110110 00111000
UTF-16
Variable-length encoding; 2 or 4 bytes per character
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+EFFF
(Basic Multilingual Plane)
00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
Example:
A
U+0041
LATIN CAPITAL A
0x0041 => 1000001 00000000 01000001
Variable-length encoding; 2 or 4 bytes per character
UTF-16
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+EFFF
(Basic Multilingual Plane)
00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
UTF-16
Variable-length encoding; 2 or 4 bytes per character
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+EFFF
(Basic Multilingual Plane)
00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
U+010000..U+10FFFF
(Supplementary Planes)
Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy
U' = xxxxxxxxxxyyyyyyyyyy // U - 0x10000
W1 = 110110xxxxxxxxxx // 0xD800 + xxxxxxxxxx
W2 = 110111yyyyyyyyyy // 0xDC00 + yyyyyyyyyy
UTF-16
Variable-length encoding; 2 or 4 bytes per character
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+EFFF
(Basic Multilingual Plane)
00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
U+010000..U+10FFFF
(Supplementary Planes)
Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+EFFF
(Basic Multilingual Plane)
00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
U+010000..U+10FFFF
(Supplementary Planes)
Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy
Variable-length encoding; 2 or 4 bytes per character
U' = xxxxxxxxxxyyyyyyyyyy // U - 0x10000
W1 = 110110xxxxxxxxxx // 0xD800 + xxxxxxxxxx
W2 = 110111yyyyyyyyyy // 0xDC00 + yyyyyyyyyy
UTF-16
0xD800-
0xDBFF
0xDC00-
0xDFFF
Example:
😸
U+1F638
GRINNING CAT WITH
SMILING EYES
0x1F638 => 1 11110110 00111000 11011000 00111101 11011110 00111000
Codepoint range Unicode scalar value (binary) Encoded bytes
U+010000..U+10FFFF
(Supplementary Planes)
Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy
Variable-length encoding; 2 or 4 bytes per character
U' = 11110110 00111000 // 0x1F638 - 0x10000 = 0xF638
W1 = 11011000 00111101 // 0xD800 + 0000111101
W2 = 11011110 00111000 // 0xDC00 + 1000111000
UTF-16
UTF-8
Variable-length encoding; 1-4 bytes per code point
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
UTF-8
Trick 1: ASCII === UTF-8
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
UTF-8
Trick 2: Virtually all languages only need 1, 2, or 3 bytes
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
UTF-8
Trick 3: First byte tells you the length
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
UTF-8
Trick 4: Self-synchronization
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
UTF-8
Trick 5: No 0x00 bytes, except for NUL
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
UTF Encoding Summary
UTF-32 UTF-16 UTF-8
Encoding length Fixed Variable Variable
4 bytes per code
point
2 or 4 bytes per
code point
1-4 bytes per code
point
Memory-efficient No Somewhat Yes
CPU-efficient Yes Somewhat Somewhat
Self-synchronizing No Yes Yes
Contains null
(0x00) bytes
Yes Yes No
ASCII-compatible No No Yes
https://commons.wikimedia.org/wiki/File:Utf8webgrowth.svg
Unicode in PHP
Handling Text In Programming Languages
1. Treat text as a sequence of bytes (PHP, C)
$smile = "xF0x9Fx98x80";
echo $smile; // => '😀'
echo strlen($smile); // => 4
1. Treat text as a sequence of Unicode code points (Python 3)
2. Treat text as a sequence of UTF-16 code units (JavaScript, C#)
const smile = 'uD83DuDE00';
console.log(smile); // => '😀'
console.log(smile.length); // => 2
PHP Strings
Be careful!
● Strings are simply byte sequences
● Encoding-agnostic
● Some (not all) string functions assume fixed-width, 8-bit ASCII encoding
PHP String Functions
Function What It Actually Does
strlen() Counts the length in bytes
str_replace() Replaces bytes
substr() Returns a subset of bytes
strtoupper() Converts alphabetic ASCII bytes to uppercase based on
globally-set locale
Works for ASCII; not entirely safe* for Unicode!
ext/mbstring
Provides multibyte-safe string functions
Standard Function mbstring Alternative
strlen() mb_strlen()
str_replace() (none)
substr() mb_substr()
strtoupper() mb_strtoupper()
Tip: All functions accept an
optional parameter to specify
the encoding, if known; will be
auto-detected otherwise.
ext/mbstring
Provides multibyte-safe string functions
mb_convert_case(string $string, int $mode, ?string $encoding = null): string
Input $string $mode Output
Mary had a little lamb
MB_CASE_UPPER MARY HAD A LITTLE LAMB
MB_CASE_LOWER mary had a little lamb
MB_CASE_TITLE Mary Had A Little Lamb
MB_CASE_FOLD mary had a little lamb
ext/mbstring
Provides multibyte-safe string functions
mb_convert_case(string $string, int $mode, ?string $encoding = null): string
Input $string $mode Output
Ich grüße den Mann
(I greet the man)
MB_CASE_UPPER ICH GRÜSSE DEN MANN
MB_CASE_LOWER ich grüße den mann
MB_CASE_TITLE Ich Grüße Den Mann
MB_CASE_FOLD ich grüsse den mann
ext/pcre
Enable UTF-8 support with u modifier: preg_match('/foo/u')
Match a character with a Unicode property: p{xx} (37 different codes)
Property
Code
Matches Example
L Any letter p{L}
Ll Lower case letter p{Ll}
Lu Upper case letter p{Lu}
Lm Modifier letter p{Lm}
Lt Title case letter p{Lt}
Lo Other letter p{Lo}
Property
Code
Matches Example
S Any symbol p{S}
Sc Currency symbol p{Sc}
Sk Modifier symbol p{Sk}
Sm Mathematical
symbol
p{Sm}
So Other symbol p{So}
Enable UTF-8 support with u modifier: preg_match('/foo/u')
Match a character with a Unicode property: p{xx} (37 different codes)
Match a character with a Unicode script: p{xxxx} (102 different scripts)
Examples: p{Greek} or p{Egyptian_Hieroglyphs}
ext/pcre
Enable UTF-8 support with u modifier: preg_match('/foo/u')
Match a character with a Unicode property: p{xx} (37 different codes)
Match a character with a Unicode script: p{xxxx} (102 different scripts)
Match a character without a Unicode property: P{xx}
ext/pcre
Enable UTF-8 support with u modifier: preg_match('/foo/u')
Match a character with a Unicode property: p{xx} (37 different codes)
Match a character with a Unicode script: p{xxxx} (102 different scripts)
Match a character without a Unicode property: P{xx}
Match a Unicode extended grapheme cluster: X
Think of it like a . but for multiple characters
that combine into a single glyph
ext/pcre
ext/intl - IntlChar class
var_dump(IntlChar::charName('⛄'));
// string(20) "SNOWMAN WITHOUT SNOW"
$name = "RECYCLING SYMBOL FOR TYPE-1 PLASTICS";
var_dump(IntlChar::charFromName($name));
// int(9843)
var_dump(IntlChar::isupper("A"));
// bool(true)
ext/intl - Normalizer class
1. U+01FA - “Precomposed” character (LATIN CAPITAL
LETTER A WITH RING ABOVE AND ACUTE)
2. A + U+030A + U+0301 - A base letter A followed by two
combining marks (U+030A COMBINING RING ABOVE
and U+0301 COMBINING ACUTE ACCENT)
3. U+00C5 + U+0301 - An accented letter (U+00C5 LATIN
CAPITAL LETTER A WITH RING ABOVE) followed by a
combining accent (U+0301 COMBINING ACUTE
ACCENT)
4. U+212B + U+0301 - A compatibility character (U+212B
ANGSTROM SIGN) followed by a combining accent
(U+0301 COMBINING ACUTE ACCENT)
Ǻ
$variations = [
"xC7xBA",
"A" . "xCCx8AxCCx81",
"xC3x85xCCx81",
"xE2x84xABxCCx81",
];
Ǻ
$variations = [
"xC7xBA",
"A" . "xCCx8AxCCx81",
"xC3x85xCCx81",
"xE2x84xABxCCx81",
];
foreach ($variations as $str) {
echo urlencode(Normalizer::normalize($str));
echo "n";
}
Ǻ
$variations = [
"xC7xBA",
"A" . "xCCx8AxCCx81",
"xC3x85xCCx81",
"xE2x84xABxCCx81",
];
foreach ($variations as $str) {
echo urlencode(Normalizer::normalize($str));
echo "n";
}
// %C7%BA
// %C7%BA
// %C7%BA
// %C7%BA
Ǻ
ext/intl - Grapheme Functions
grapheme_​
extract()
grapheme_​
stripos()
grapheme_​
stristr()
grapheme_​
strlen()
grapheme_​
strpos()
grapheme_​
strripos()
grapheme_​
strrpos()
grapheme_​
strstr()
grapheme_​
substr()
$str = '⛄ Café';
echo strlen($str); // 10
echo mb_strlen($str); // 7
echo grapheme_strlen($str); // 6
ext/iconv - iconv() function to convert encodings
$text = "This is the Euro symbol '€'."; // UTF-8 string
ext/iconv - iconv() function to convert encodings
$text = "This is the Euro symbol '€'."; // UTF-8 string
echo iconv("UTF-8", "ISO-8859-1", $text), PHP_EOL;
// Notice: iconv(): Detected an illegal character in input string
ext/iconv - iconv() function to convert encodings
$text = "This is the Euro symbol '€'."; // UTF-8 string
echo iconv("UTF-8", "ISO-8859-1", $text), PHP_EOL;
// Notice: iconv(): Detected an illegal character in input string
echo iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text), PHP_EOL;
// This is the Euro symbol 'EUR'.
echo iconv("UTF-8", "ISO-8859-1//IGNORE", $text), PHP_EOL;
// This is the Euro symbol ''.
PHP Extension Summary
ext/iconv: Convert between encodings
ext/mbstring: Work with multi-byte string encodings like UTF-8
ext/pcre: Special UTF-compatible matching when /u modifier enabled
ext/intl: Work with individual codepoints and graphemes
Fun Tricks & Micro-Optimizations
Disclaimer
Clever hacks and micro-optimizations are usually unnecessary and can be
detrimental to long-term maintenance!
Don’t use these unless you absolutely need them.
Taking Advantage of UTF-Encoded Bytes
PHP string functions can still be used in some cases:
if (str_contains($utf8, '&')) { … }
$trimmed = trim($utf8);
$firstChar = substr($utf32, 0, 4);
Requires solid understanding of UTF encodings and what the functions do
Don’t be clever unless there’s a clear advantage!
Splitting Strings Into Codepoints
mb_str_split($str) - returns array of individual codepoints (PHP 7.4+)
UTF-8 polyfill for older versions: preg_split('//u', $str)
(Works for codepoints, not graphemes)
ASCII-Only UTF-8 Strings
Is a UTF-8 string pure ASCII? If so, no need for (slower) mbstring functions:
$isAscii = mb_detect_encoding($str, 'ASCII', true);
Micro-optimization (2x faster):
$isASCII = strlen($str) === mb_strlen($str);
Speed is fractions of milliseconds; micro-optimization only
important for parsing-heavy applications
Writing Silly Code
PHP supports Unicode in variable and function names:
class (╯°□°)╯︵┻━┻ extends Exception {}
throw new (╯°□°)╯︵┻━┻;
Writing Silly Code
PHP supports Unicode in variable and function names:
class (╯°□°)╯︵┻━┻ extends Exception {}
throw new (╯°□°)╯︵┻━┻;
Uses U+FF08 FULLWIDTH LEFT PARENTHESIS and U+FF09 FULLWIDTH RIGHT
PARENTHESIS since normal parens (U+0028/U+0029) are not allowed here.
Writing Silly Code (Don’t Do This)
PHP supports Unicode in variable and function names:
class (╯°□°)╯︵┻━┻ extends Exception {}
throw new (╯°□°)╯︵┻━┻;
$👉😎👉 = "Ann Perkins!"; // Parks and Rec reference
Writing Silly Code (Seriously, Don’t Do This)
PHP supports Unicode in variable and function names:
class (╯°□°)╯︵┻━┻ extends Exception {}
throw new (╯°□°)╯︵┻━┻;
$👉😎👉 = "Ann Perkins!"; // Parks and Rec reference
$you can use = 'U+2000 EN QUAD whitespace';
Recap
Recap & Recommendations
● Unicode supports virtually every known modern and historic writing system
● Codepoints != Glyphs/Graphemes != Encoding
● Use and support UTF-8 everywhere, especially for user input
● PHP strings are just raw bytes
● Use mbstring functions
Questions?
Thank You!
Slides & feedback: https://joind.in/talk/9bdc2
Questions? @colinodell or colinodell@gmail.com

Demystifying Unicode - Longhorn PHP 2021

  • 1.
  • 2.
    Colin O’Dell ● PrincipalEngineer at Unleashed Technologies ● PHP for ~20 years; 13 years professionally ● Creator & maintainer of league/commonmark library ● PHP League leadership team ● Owner of moderngeekware.com ● @colinodell
  • 3.
    Agenda ● A Historyof Encoding Systems ● Unicode Standard ● Unicode Encodings ● Using Unicode in PHP ● Tips & Tricks ● Questions & Answers
  • 4.
    Assumptions ● Some familiaritywith PHP ● Basic understanding of binary and hexadecimal ● Focus on high-level concepts!
  • 5.
  • 6.
  • 7.
    A (Brief) Historyof Encoding Systems
  • 8.
    1837: Morse Code(Internationalized in 1844) “Morse-Vail Telegraph Key” by the National Museum of American History is licenced under CC BY-NC 2.0
  • 9.
  • 10.
  • 11.
    1960s: ASCII ● AmericanStandard Code for Information Interchange ● 7-bit binary encoding ○ 0000000 = 0 ○ ... ○ 1111111 = 127
  • 12.
    0 1 23 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPAC E ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL
  • 13.
    0 1 23 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPAC E ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL Character Hex Binary Character Hex Binary LF (line feed) 0x0A 0001010 E 0x45 1000101 3 0x33 0110011 e 0x65 1100101
  • 14.
    0 1 23 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL 00xxxxx 01xxxxx 10xxxxx 11xxxxx 00xxxxx = 32 control codes 01xxxxx = 32 numbers & symbols 10xxxxx = 32 uppercase letters and some extra symbols 11xxxxx = 32 lowercase letters and some extra symbols
  • 15.
    A = 0x41= 1000001 B = 0x42 = 1000010 … Z = 0x5A = 1011010 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL 00xxxxx 01xxxxx 10xxxxx 11xxxxx
  • 16.
    A = 0x41= 1000001 B = 0x42 = 1000010 … Z = 0x5A = 1011010 a = 0x61 = 1100001 b = 0x62 = 1100010 … z = 0x7A = 1111010 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL 00xxxxx 01xxxxx 10xxxxx 11xxxxx
  • 17.
    But computers use8-bit bytes... ASCII (7 Bits) ??? Start 00000000 10000000 End 01111111 11111111 Count 128 128
  • 18.
    0 1 23 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL 7-bit ASCII
  • 19.
    0 1 23 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL 8 ??? 9 A B C D E F 8-bit “Extended ASCII”
  • 20.
    0 1 23 4 5 6 7 8 9 A B C D E F 0 1 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ 8 9 A NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ SHY ® ¯ B ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ C À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï D Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß E à á â ã ä å æ ç è é ê ë ì í î ï F ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ ISO 8859-1
  • 21.
    0 1 23 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL 8 € ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž 9 ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ A NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ SHY ® ¯ B ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ C À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï D Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß E à á â ã ä å æ ç è é ê ë ì í î ï F ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ Windows-1252
  • 22.
    0 1 23 4 5 6 7 8 9 A B C D E F 0 1 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ 8 9 A NBSP Ą ˘ Ł ¤ Ľ Ś § ¨ Š Ş Ť Ź SHY Ž Ż B ° ą ˛ ł ´ ľ ś ˇ ¸ š ş ť ź ˝ ž ż C Ŕ Á Â Ă Ä Ĺ Ć Ç Č É Ę Ë Ě Í Î Ď D Đ Ń Ň Ó Ô Ő Ö × Ř Ů Ú Ű Ü Ý Ţ ß E ŕ á â ă ä ĺ ć ç č é ę ë ě í î ď F đ ń ň ó ô ő ö ÷ ř ů ú ű ü ý ţ ˙ ISO 8859-2
  • 23.
    0 1 23 4 5 6 7 8 9 A B C D E F 0 NUL ☺ ☻ ♥ ♦ ♣ ♠ • ◘ ○ ◙ ♂ ♀ ♪ ♫ ☼ 1 ► ◄ ↕ ‼ ¶ § ▬ ↨ ↑ ↓ → ← ∟ ↔ ▲ ▼ 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ ⌂ 8 Ç ü é â ä à å ç ê ë è ï î ì Ä Å 9 É æ Æ ô ö ò û ù ÿ Ö Ü ¢ £ ¥ ₧ ƒ A á í ó ú ñ Ñ ª º ¿ ⌐ ¬ ½ ¼ ¡ « » B ░ ▒ ▓ │ ┤ ╡ ╢ ╖ ╕ ╣ ║ ╗ ╝ ╜ ╛ ┐ C └ ┴ ┬ ├ ─ ┼ ╞ ╟ ╚ ╔ ╩ ╦ ╠ ═ ╬ ╧ D ╨ ╤ ╥ ╙ ╘ ╒ ╓ ╫ ╪ ┘ ┌ █ ▄ ▌ ▐ ▀ E α ß Γ π Σ σ µ τ Φ Θ Ω δ ∞ φ ε ∩ F ≡ ± ≥ ≤ ⌠ ⌡ ÷ ≈ ° ∙ · √ ⁿ ² ■ NBSP Code Page 437 (IBM PC)
  • 25.
    8-bit “Extended ASCII” ●ISO 8859 - 16 variations: ○ ISO 8859-1 (“Latin 1”, Western European) ○ ISO 8859-2 (“Latin 2”, Central European) ○ ISO 8859-3 (“Latin 3”, South European) ○ ISO 8859-4 (“Latin 4”, North European) ○ ISO 8859-5 (Latin/Cyrillic) ○ ISO 8859-6 (Latin/Arabic) ○ ISO 8859-7 (Latin/Greek) ○ ISO 8859-8 (Latin/Hebrew) ○ ISO 8859-9 (“Latin 5”, Turkish) ○ ISO 8859-10 (“Latin 6”, Nordic) ○ ISO 8859-11 (Latin/Thai) ○ ISO 8859-12 (Latin/Devanagari) - abandoned ○ ISO 8859-13 (“Latin 7”, Baltic Rim) ○ ISO 8859-14 (“Latin 8”, Celtic) ○ ISO 8859-15 (“Latin 9”) ■ Revision of 8859-1 with swaps out less- used chars; adds euro currency symbol ○ ISO 8859-16 (“Latin 10”, South-Eastern European) ● Windows-1252 ● CP 437 - Original IBM PC ● Mac OS Roman character set ● TRS-80 character set ● Atari’s ATASCII ● Commodore’s PETSCII ● HP Roman-8 and Roman-9 ● DEC’s Multinational Character Set ● Lotus International Character Set ● ECMA-94
  • 28.
    But then alongcame the Internet...
  • 29.
  • 31.
    “The Unicode Standardis the universal character encoding standard for written characters and text. It defines a consistent way of encoding multilingual text that enables the exchange of text data internationally and creates the foundation for global software”
  • 32.
    Code Points Problem: How toaccommodate larger character sets without wasting memory? Solution: Break the one-to-one correspondence between characters and bits/encoding! Offer different ways to encode based on different needs.
  • 33.
    ASCII vs. Unicode CharacterEncoded Bits H 01001000 (0x48) P 01010000 (0x50) Glyph Code Point P U+0050 LATIN CAPITAL LETTER P H U+0048 LATIN CAPITAL LETTER H Encoded Bits ???? ????
  • 34.
    Glyph Code PointEncoded Bits P U+0050 LATIN CAPITAL LETTER P ???? h U+0068 LATIN SMALL LETTER H ???? Σ U+03A3 GREEK CAPITAL LETTER SIGMA ???? U+0634 ARABIC LETTER SHEEN ???? U+1D2ED MAYAN NUMERAL THIRTEEN ???? 😸 U+1F638 GRINNING CAT FACE WITH SMILING EYES ???? H U+0048 LATIN CAPITAL LETTER H ????
  • 36.
  • 38.
    U+1F604 SMILING FACE WITH OPENMOUTH AND SMILING EYES
  • 43.
  • 45.
    Recap ● Code Point:a number representing a single character* ○ 143,859 defined as of Unicode 13.0 ○ Format: U+hhhhhh ● Codespace: A range of numerical values available for encoding characters ○ Support for 1,114,112 codepoints (0x000000 - 0x10FFFF) ● Code Planes: Continuous group of 65,536 (216) code points ○ 17 planes, numbered 0 - 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal format (U+hhhhhh)
  • 46.
  • 47.
    Character / CodePoint: a U+0061 LATIN SMALL LETTER A
  • 48.
    Character / CodePoint: a U+0061 LATIN SMALL LETTER A a a a a a a a a Glyphs:
  • 49.
    Glyphs and Graphemes Glyph/ Grapheme c a f e Unicode Character c a f e Code Point U+0063 U+0061 U+0066 U+0065 LATIN SMALL LETTER C LATIN SMALL LETTER A LATIN SMALL LETTER F LATIN SMALL LETTER E
  • 50.
    Glyphs and Graphemes:Combining Diacritical Marks Glyph / Grapheme c a f é Unicode Character c a f e ◌́ Code Point U+0063 U+0061 U+0066 U+0065 U+0301 LATIN SMALL LETTER C LATIN SMALL LETTER A LATIN SMALL LETTER F LATIN SMALL LETTER E COMBINING ACUTE ACCENT
  • 51.
    Glyphs and Graphemes:Combining Diacritical Marks Glyph / Grapheme c a f é Unicode Character c a f e ◌́ Code Point U+0063 U+0061 U+0066 U+0065 U+0301 LATIN SMALL LETTER C LATIN SMALL LETTER A LATIN SMALL LETTER F LATIN SMALL LETTER E COMBINING ACUTE ACCENT e + ◌́ = é e
  • 52.
    Glyphs and Graphemes:Combining Diacritical Marks Z̷̧̨̰̋Å̸̮͉ ̵͉̣̄̇̀ L̵͉̣̄̇̀G ̸̮͉̊ O ̸̱͒̓ ̷̧̨̰̋Ț͝E̪̘̗̓͝X̪̘̗T ̸̰̺̝̍̈
  • 53.
    Glyphs and Graphemes:Variation Selectors Glyph / Grapheme ✈ Unicode Character ✈ Code Point U+2708 U+FE0E AIRPLANE VARIATION SELECTOR 15 (TEXT STYLE) VS 15
  • 54.
    Glyphs and Graphemes:Variation Selectors Glyph / Grapheme ✈ Unicode Character ✈ Code Point U+2708 U+FE0E AIRPLANE VARIATION SELECTOR 15 (TEXT STYLE) Glyph / Grapheme Unicode Character ✈ Code Point U+2708 U+FE0F AIRPLANE VARIATION SELECTOR 16 (EMOJI STYLE) VS 16 VS 15
  • 55.
    Glyphs and Graphemes:Regional Indicator Symbols Glyph / Grapheme 🇺🇸 Unicode Character 🇺 🇸 Code Point U+1F1FA U+1F1F8 REGIONAL INDICATOR SYMBOL LETTER U REGIONAL INDICATOR SYMBOL LETTER S Glyph / Grapheme 🇨🇦 Unicode Character 🇨 🇦 Code Point U+1F1E8 U+1F1E6 REGIONAL INDICATOR SYMBOL LETTER C REGIONAL INDICATOR SYMBOL LETTER A
  • 56.
    Glyphs and Graphemes:Modifiers Glyph / Grapheme Unicode Character 👋 Code Point U+1F44B U+1F3FC WAVING HAND SIGN EMOJI MODIFIER FITZPATRICK TYPE-3 Glyph / Grapheme Unicode Character 👋 Code Point U+1F44B U+1F3FE WAVING HAND SIGN EMOJI MODIFIER FITZPATRICK TYPE-5
  • 57.
    Glyphs and Graphemes:ZWJ Sequences Glyph / Grapheme 👨 👩 👶 👧 Unicode Character 👨 👩 👶 👧 Code Point U+1F468 U+1F469 U+1F476 U+1F467 MAN WOMAN BABY GIRL
  • 58.
    Glyphs and Graphemes:ZWJ Sequences Glyph / Grapheme Unicode Character 👨 👩 👶 👧 Code Point U+1F468 U+200D U+1F469 U+200D U+1F476 U+200D U+1F467 MAN ZERO WIDTH JOINER WOMAN ZERO WIDTH JOINER BABY ZERO WIDTH JOINER GIRL ZWJ ZWJ ZWJ
  • 59.
    Glyphs and Graphemes:ZWJ Sequences
  • 60.
    Glyphs and Graphemes:ZWJ Sequences
  • 61.
    Glyphs and Graphemes:ZWJ Sequences Glyph / Grapheme Unicode Character Code Point U+1F477 U+200D U+2642 CONSTRU CTION WORKER ZERO WIDTH JOINER MALE SIGN ZWJ Glyph / Grapheme Unicode Character Code Point U+1F477 U+200D U+2640 CONSTRU CTION WORKER ZERO WIDTH JOINER FEMALE SIGN ZWJ
  • 62.
    Glyphs and Graphemes:ZWJ Sequences Glyph / Grapheme Unicode Character Code Point U+1F477 U+1F3FE U+200D U+2640 CONSTRUCTION WORKER EMOJI MODIFIER FITZPATRICK TYPE-5 ZERO WIDTH JOINER FEMALE SIGN ZWJ
  • 63.
  • 64.
  • 65.
    Glyph Code PointEncoded Bits P U+0050 LATIN CAPITAL LETTER P ???? h U+0068 LATIN SMALL LETTER H ???? Σ U+03A3 GREEK CAPITAL LETTER SIGMA ???? U+0634 ARABIC LETTER SHEEN ???? U+1D2ED MAYAN NUMERAL THIRTEEN ???? 😸 U+1F638 GRINNING CAT FACE WITH SMILING EYES ???? H U+0048 LATIN CAPITAL LETTER H ????
  • 66.
    Encoding Schemes ● Mostpopular: ○ UTF-8 ○ UTF-16 ○ UTF-32
  • 67.
    UTF-32 Fixed-byte encoding; 4bytes per code point Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+10FFFF xxxxxxxxxxxxxxxxxxxxx 00000000 000xxxxx xxxxxxxx xxxxxxxx
  • 68.
    UTF-32 Fixed-byte encoding; 4bytes per character Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+10FFFF xxxxxxxxxxxxxxxxxxxxx 00000000 000xxxxx xxxxxxxx xxxxxxxx Examples: A U+0041 LATIN CAPITAL A 0x0041 => 1000001 00000000 00000000 00000000 01000001 😸 U+1F638 GRINNING CAT WITH SMILING EYES 0x1F638 => 11111011000111000 00000000 00000001 11110110 00111000
  • 69.
    UTF-16 Variable-length encoding; 2or 4 bytes per character Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+EFFF (Basic Multilingual Plane) 00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
  • 70.
    Example: A U+0041 LATIN CAPITAL A 0x0041=> 1000001 00000000 01000001 Variable-length encoding; 2 or 4 bytes per character UTF-16 Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+EFFF (Basic Multilingual Plane) 00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
  • 71.
    UTF-16 Variable-length encoding; 2or 4 bytes per character Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+EFFF (Basic Multilingual Plane) 00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx U+010000..U+10FFFF (Supplementary Planes) Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy
  • 72.
    U' = xxxxxxxxxxyyyyyyyyyy// U - 0x10000 W1 = 110110xxxxxxxxxx // 0xD800 + xxxxxxxxxx W2 = 110111yyyyyyyyyy // 0xDC00 + yyyyyyyyyy UTF-16 Variable-length encoding; 2 or 4 bytes per character Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+EFFF (Basic Multilingual Plane) 00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx U+010000..U+10FFFF (Supplementary Planes) Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy
  • 73.
    Codepoint range Unicodescalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+EFFF (Basic Multilingual Plane) 00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx U+010000..U+10FFFF (Supplementary Planes) Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy Variable-length encoding; 2 or 4 bytes per character U' = xxxxxxxxxxyyyyyyyyyy // U - 0x10000 W1 = 110110xxxxxxxxxx // 0xD800 + xxxxxxxxxx W2 = 110111yyyyyyyyyy // 0xDC00 + yyyyyyyyyy UTF-16 0xD800- 0xDBFF 0xDC00- 0xDFFF
  • 74.
    Example: 😸 U+1F638 GRINNING CAT WITH SMILINGEYES 0x1F638 => 1 11110110 00111000 11011000 00111101 11011110 00111000 Codepoint range Unicode scalar value (binary) Encoded bytes U+010000..U+10FFFF (Supplementary Planes) Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy Variable-length encoding; 2 or 4 bytes per character U' = 11110110 00111000 // 0x1F638 - 0x10000 = 0xF638 W1 = 11011000 00111101 // 0xD800 + 0000111101 W2 = 11011110 00111000 // 0xDC00 + 1000111000 UTF-16
  • 75.
    UTF-8 Variable-length encoding; 1-4bytes per code point Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic, Arabic, Hebrew, and more, plus combining diacritical marks U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most Chinese, Japanese, and Korean characters U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic scripts, mathematical symbols, and emoji.
  • 76.
    UTF-8 Trick 1: ASCII=== UTF-8 Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic, Arabic, Hebrew, and more, plus combining diacritical marks U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most Chinese, Japanese, and Korean characters U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic scripts, mathematical symbols, and emoji.
  • 77.
    UTF-8 Trick 2: Virtuallyall languages only need 1, 2, or 3 bytes Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic, Arabic, Hebrew, and more, plus combining diacritical marks U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most Chinese, Japanese, and Korean characters U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic scripts, mathematical symbols, and emoji.
  • 78.
    UTF-8 Trick 3: Firstbyte tells you the length Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic, Arabic, Hebrew, and more, plus combining diacritical marks U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most Chinese, Japanese, and Korean characters U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic scripts, mathematical symbols, and emoji.
  • 79.
    UTF-8 Trick 4: Self-synchronization Codepointrange Byte 1 Byte 2 Byte 3 Byte 4 Notes U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic, Arabic, Hebrew, and more, plus combining diacritical marks U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most Chinese, Japanese, and Korean characters U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic scripts, mathematical symbols, and emoji.
  • 80.
    UTF-8 Trick 5: No0x00 bytes, except for NUL Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic, Arabic, Hebrew, and more, plus combining diacritical marks U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most Chinese, Japanese, and Korean characters U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic scripts, mathematical symbols, and emoji.
  • 81.
    UTF Encoding Summary UTF-32UTF-16 UTF-8 Encoding length Fixed Variable Variable 4 bytes per code point 2 or 4 bytes per code point 1-4 bytes per code point Memory-efficient No Somewhat Yes CPU-efficient Yes Somewhat Somewhat Self-synchronizing No Yes Yes Contains null (0x00) bytes Yes Yes No ASCII-compatible No No Yes
  • 82.
  • 83.
  • 84.
    Handling Text InProgramming Languages 1. Treat text as a sequence of bytes (PHP, C) $smile = "xF0x9Fx98x80"; echo $smile; // => '😀' echo strlen($smile); // => 4 1. Treat text as a sequence of Unicode code points (Python 3) 2. Treat text as a sequence of UTF-16 code units (JavaScript, C#) const smile = 'uD83DuDE00'; console.log(smile); // => '😀' console.log(smile.length); // => 2
  • 85.
    PHP Strings Be careful! ●Strings are simply byte sequences ● Encoding-agnostic ● Some (not all) string functions assume fixed-width, 8-bit ASCII encoding
  • 86.
    PHP String Functions FunctionWhat It Actually Does strlen() Counts the length in bytes str_replace() Replaces bytes substr() Returns a subset of bytes strtoupper() Converts alphabetic ASCII bytes to uppercase based on globally-set locale Works for ASCII; not entirely safe* for Unicode!
  • 87.
    ext/mbstring Provides multibyte-safe stringfunctions Standard Function mbstring Alternative strlen() mb_strlen() str_replace() (none) substr() mb_substr() strtoupper() mb_strtoupper() Tip: All functions accept an optional parameter to specify the encoding, if known; will be auto-detected otherwise.
  • 88.
    ext/mbstring Provides multibyte-safe stringfunctions mb_convert_case(string $string, int $mode, ?string $encoding = null): string Input $string $mode Output Mary had a little lamb MB_CASE_UPPER MARY HAD A LITTLE LAMB MB_CASE_LOWER mary had a little lamb MB_CASE_TITLE Mary Had A Little Lamb MB_CASE_FOLD mary had a little lamb
  • 89.
    ext/mbstring Provides multibyte-safe stringfunctions mb_convert_case(string $string, int $mode, ?string $encoding = null): string Input $string $mode Output Ich grüße den Mann (I greet the man) MB_CASE_UPPER ICH GRÜSSE DEN MANN MB_CASE_LOWER ich grüße den mann MB_CASE_TITLE Ich Grüße Den Mann MB_CASE_FOLD ich grüsse den mann
  • 90.
    ext/pcre Enable UTF-8 supportwith u modifier: preg_match('/foo/u') Match a character with a Unicode property: p{xx} (37 different codes) Property Code Matches Example L Any letter p{L} Ll Lower case letter p{Ll} Lu Upper case letter p{Lu} Lm Modifier letter p{Lm} Lt Title case letter p{Lt} Lo Other letter p{Lo} Property Code Matches Example S Any symbol p{S} Sc Currency symbol p{Sc} Sk Modifier symbol p{Sk} Sm Mathematical symbol p{Sm} So Other symbol p{So}
  • 91.
    Enable UTF-8 supportwith u modifier: preg_match('/foo/u') Match a character with a Unicode property: p{xx} (37 different codes) Match a character with a Unicode script: p{xxxx} (102 different scripts) Examples: p{Greek} or p{Egyptian_Hieroglyphs} ext/pcre
  • 92.
    Enable UTF-8 supportwith u modifier: preg_match('/foo/u') Match a character with a Unicode property: p{xx} (37 different codes) Match a character with a Unicode script: p{xxxx} (102 different scripts) Match a character without a Unicode property: P{xx} ext/pcre
  • 93.
    Enable UTF-8 supportwith u modifier: preg_match('/foo/u') Match a character with a Unicode property: p{xx} (37 different codes) Match a character with a Unicode script: p{xxxx} (102 different scripts) Match a character without a Unicode property: P{xx} Match a Unicode extended grapheme cluster: X Think of it like a . but for multiple characters that combine into a single glyph ext/pcre
  • 94.
    ext/intl - IntlCharclass var_dump(IntlChar::charName('⛄')); // string(20) "SNOWMAN WITHOUT SNOW" $name = "RECYCLING SYMBOL FOR TYPE-1 PLASTICS"; var_dump(IntlChar::charFromName($name)); // int(9843) var_dump(IntlChar::isupper("A")); // bool(true)
  • 95.
    ext/intl - Normalizerclass 1. U+01FA - “Precomposed” character (LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE) 2. A + U+030A + U+0301 - A base letter A followed by two combining marks (U+030A COMBINING RING ABOVE and U+0301 COMBINING ACUTE ACCENT) 3. U+00C5 + U+0301 - An accented letter (U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE) followed by a combining accent (U+0301 COMBINING ACUTE ACCENT) 4. U+212B + U+0301 - A compatibility character (U+212B ANGSTROM SIGN) followed by a combining accent (U+0301 COMBINING ACUTE ACCENT) Ǻ
  • 96.
    $variations = [ "xC7xBA", "A". "xCCx8AxCCx81", "xC3x85xCCx81", "xE2x84xABxCCx81", ]; Ǻ
  • 97.
    $variations = [ "xC7xBA", "A". "xCCx8AxCCx81", "xC3x85xCCx81", "xE2x84xABxCCx81", ]; foreach ($variations as $str) { echo urlencode(Normalizer::normalize($str)); echo "n"; } Ǻ
  • 98.
    $variations = [ "xC7xBA", "A". "xCCx8AxCCx81", "xC3x85xCCx81", "xE2x84xABxCCx81", ]; foreach ($variations as $str) { echo urlencode(Normalizer::normalize($str)); echo "n"; } // %C7%BA // %C7%BA // %C7%BA // %C7%BA Ǻ
  • 99.
    ext/intl - GraphemeFunctions grapheme_​ extract() grapheme_​ stripos() grapheme_​ stristr() grapheme_​ strlen() grapheme_​ strpos() grapheme_​ strripos() grapheme_​ strrpos() grapheme_​ strstr() grapheme_​ substr() $str = '⛄ Café'; echo strlen($str); // 10 echo mb_strlen($str); // 7 echo grapheme_strlen($str); // 6
  • 100.
    ext/iconv - iconv()function to convert encodings $text = "This is the Euro symbol '€'."; // UTF-8 string
  • 101.
    ext/iconv - iconv()function to convert encodings $text = "This is the Euro symbol '€'."; // UTF-8 string echo iconv("UTF-8", "ISO-8859-1", $text), PHP_EOL; // Notice: iconv(): Detected an illegal character in input string
  • 102.
    ext/iconv - iconv()function to convert encodings $text = "This is the Euro symbol '€'."; // UTF-8 string echo iconv("UTF-8", "ISO-8859-1", $text), PHP_EOL; // Notice: iconv(): Detected an illegal character in input string echo iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text), PHP_EOL; // This is the Euro symbol 'EUR'. echo iconv("UTF-8", "ISO-8859-1//IGNORE", $text), PHP_EOL; // This is the Euro symbol ''.
  • 103.
    PHP Extension Summary ext/iconv:Convert between encodings ext/mbstring: Work with multi-byte string encodings like UTF-8 ext/pcre: Special UTF-compatible matching when /u modifier enabled ext/intl: Work with individual codepoints and graphemes
  • 104.
    Fun Tricks &Micro-Optimizations
  • 105.
    Disclaimer Clever hacks andmicro-optimizations are usually unnecessary and can be detrimental to long-term maintenance! Don’t use these unless you absolutely need them.
  • 106.
    Taking Advantage ofUTF-Encoded Bytes PHP string functions can still be used in some cases: if (str_contains($utf8, '&')) { … } $trimmed = trim($utf8); $firstChar = substr($utf32, 0, 4); Requires solid understanding of UTF encodings and what the functions do Don’t be clever unless there’s a clear advantage!
  • 107.
    Splitting Strings IntoCodepoints mb_str_split($str) - returns array of individual codepoints (PHP 7.4+) UTF-8 polyfill for older versions: preg_split('//u', $str) (Works for codepoints, not graphemes)
  • 108.
    ASCII-Only UTF-8 Strings Isa UTF-8 string pure ASCII? If so, no need for (slower) mbstring functions: $isAscii = mb_detect_encoding($str, 'ASCII', true); Micro-optimization (2x faster): $isASCII = strlen($str) === mb_strlen($str); Speed is fractions of milliseconds; micro-optimization only important for parsing-heavy applications
  • 109.
    Writing Silly Code PHPsupports Unicode in variable and function names: class (╯°□°)╯︵┻━┻ extends Exception {} throw new (╯°□°)╯︵┻━┻;
  • 110.
    Writing Silly Code PHPsupports Unicode in variable and function names: class (╯°□°)╯︵┻━┻ extends Exception {} throw new (╯°□°)╯︵┻━┻; Uses U+FF08 FULLWIDTH LEFT PARENTHESIS and U+FF09 FULLWIDTH RIGHT PARENTHESIS since normal parens (U+0028/U+0029) are not allowed here.
  • 111.
    Writing Silly Code(Don’t Do This) PHP supports Unicode in variable and function names: class (╯°□°)╯︵┻━┻ extends Exception {} throw new (╯°□°)╯︵┻━┻; $👉😎👉 = "Ann Perkins!"; // Parks and Rec reference
  • 112.
    Writing Silly Code(Seriously, Don’t Do This) PHP supports Unicode in variable and function names: class (╯°□°)╯︵┻━┻ extends Exception {} throw new (╯°□°)╯︵┻━┻; $👉😎👉 = "Ann Perkins!"; // Parks and Rec reference $you can use = 'U+2000 EN QUAD whitespace';
  • 113.
  • 114.
    Recap & Recommendations ●Unicode supports virtually every known modern and historic writing system ● Codepoints != Glyphs/Graphemes != Encoding ● Use and support UTF-8 everywhere, especially for user input ● PHP strings are just raw bytes ● Use mbstring functions
  • 115.
  • 116.
    Thank You! Slides &feedback: https://joind.in/talk/9bdc2 Questions? @colinodell or colinodell@gmail.com

Editor's Notes

  • #4 Questions as we go? Raise hand
  • #8 Converts characters into electrical signals
  • #9 Standardized in 1865
  • #10 Simple device Type a key, sends some numbers, same letter comes out the other side
  • #11 But there needs to be a standard
  • #12 Developed in 1960s for teleprinters (“Teletype”) and early computers 7-bit: each letter you type in gets converted into 7 bits
  • #13 Support for: Upper and lowercase letters Numbers Basic, common symbols More control codes (CR, LF, BS, HT, BEL) (next for examples)
  • #14 (how to encode/decode)
  • #15 Something really clever going on here Group by first two bits 4 “pages” or sections, 32 chars each
  • #16 Letters in alphabetical order, starting at 1 (not random)
  • #17 Even more clever - converting between upper and lowercase by changing one bit
  • #20 “Extended ASCII” sounds like a standard, but it’s not
  • #21 AKA Latin 1 for the Americas, Western Europe, Oceania, and much of Africa
  • #22 Superset/extension of ISO 8859-1 Adds curly quotation marks De-facto standard for Windows
  • #23 Aka Latin 2 for Central or Eastern European Languages
  • #24 UI graphics, science, and math Standard EGA VGA encoding on gfx cards
  • #27 That’s a lot! However,
  • #28 In practice, most users only used one standard locally. Which was fine...
  • #30 Standards proliferation
  • #33 (Problem) You could add more bits, but that wasted computing resources (which were scarce at the time) for users who only needed Latin or ASCII-like characters
  • #35 ATTN: 4 vs 5 char convention
  • #44 Support for 1,114,112 codepoints (0x000000 - 0x10FFFF) Code Planes: Continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal format (U+hhhhhh) Codespace: entire range of numerical values available for encoding characters
  • #45 Code Planes: Continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal format (U+hhhhhh) Codespace: entire range of numerical values available for encoding characters Support for 1,114,112 codepoints (0x000000 - 0x10FFFF)
  • #48 Unicode does not specify how the character / code point should be displayed (or encoded)!
  • #49 Unicode does not specify how the character / code point should be displayed (or encoded)!
  • #51 Combining Diacritical Marks
  • #52 In this example: 5 code points but 4 graphemes GRAPHEME = smallest unit of a writing system Think about putting cursor in this text and selecting something or pressing backspace
  • #53 “Zalgo text” or “glitch text”
  • #54 Combining Diacritical Marks
  • #55 Combining Diacritical Marks
  • #56 Combining Diacritical Marks
  • #57 Combining Diacritical Marks
  • #58 Combining Diacritical Marks
  • #59 Windows supports 52,000 family combinations
  • #60 Windows supports 52,000 family combinations
  • #61 If system lacks dedicated image, individual emojis are shown
  • #63 Combining Diacritical Marks
  • #69 Pros: Code points always use some number of bytes; very straight-forward Cons: not very memory efficient, can contain null bytes, not self-synchronizing
  • #70 BMP = basically everything except emojis and historical scripts
  • #74 “Surrogate pairs”; values are reserved, no code points with those values
  • #75 Pros: more memory efficient (most of the time), works well for BMP; is self-synchronizing Cons: 4-byte encoding logic somewhat messy; can contain null bytes
  • #96 This symbol can be encoded 4 different ways
  • #98 Intl normalizer class
  • #100 In UTF-8: 3 bytes for snowman, 1 for space, 1 for each letter c a f e, and 1 for diacritical combining acute accent mark
  • #110 Now for some fun tricks