Demystifying Unicode - Longhorn PHP 2021

Demystifying Unicode
@colinodell

Colin O’Dell
● Principal Engineer at Unleashed Technologies
● PHP for ~20 years; 13 years professionally
● Creator & maintainer of league/commonmark library
● PHP League leadership team
● Owner of moderngeekware.com
● @colinodell

Agenda
● A History of Encoding Systems
● Unicode Standard
● Unicode Encodings
● Using Unicode in PHP
● Tips & Tricks
● Questions & Answers

Assumptions
● Some familiarity with PHP
● Basic understanding of binary and hexadecimal
● Focus on high-level concepts!

A (Brief) History of
Encoding Systems

1837: Morse Code (Internationalized in 1844)
“Morse-Vail Telegraph Key” by the National Museum of American History is licenced under CC BY-NC 2.0

1960s: Teletypes (TTYs) For Computing

1960s: ASCII
● American Standard Code for Information Interchange
● 7-bit binary encoding
○ 0000000 = 0
○ ...
○ 1111111 = 127

0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPAC
E ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL

0 1 2 3 4 5 6 7 8 9 A B C D E F
2 SPAC
E ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
5 P Q R S T U V W X Y Z [ ] ^ _
Character Hex Binary Character Hex Binary
LF (line feed) 0x0A 0001010 E 0x45 1000101
3 0x33 0110011 e 0x65 1100101

0 1 2 3 4 5 6 7 8 9 A B C D E F
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
5 P Q R S T U V W X Y Z [ ] ^ _
00xxxxx
01xxxxx
10xxxxx
11xxxxx
00xxxxx = 32 control codes
01xxxxx = 32 numbers & symbols
10xxxxx = 32 uppercase letters and some extra symbols
11xxxxx = 32 lowercase letters and some extra symbols

A = 0x41 = 1000001
B = 0x42 = 1000010
…
Z = 0x5A = 1011010
0 1 2 3 4 5 6 7 8 9 A B C D E F
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
5 P Q R S T U V W X Y Z [ ] ^ _
00xxxxx
01xxxxx
10xxxxx
11xxxxx

A = 0x41 = 1000001
B = 0x42 = 1000010
…
Z = 0x5A = 1011010
a = 0x61 = 1100001
b = 0x62 = 1100010
…
z = 0x7A = 1111010
0 1 2 3 4 5 6 7 8 9 A B C D E F
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
5 P Q R S T U V W X Y Z [ ] ^ _
00xxxxx
01xxxxx
10xxxxx
11xxxxx

But computers use 8-bit bytes...
ASCII (7 Bits) ???
Start 00000000 10000000
End 01111111 11111111
Count 128 128

0 1 2 3 4 5 6 7 8 9 A B C D E F
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
5 P Q R S T U V W X Y Z [ ] ^ _
7-bit
ASCII

0 1 2 3 4 5 6 7 8 9 A B C D E F
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
5 P Q R S T U V W X Y Z [ ] ^ _
8
???
9
A
B
C
D
E
F
8-bit
“Extended
ASCII”

0 1 2 3 4 5 6 7 8 9 A B C D E F
0
1
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
5 P Q R S T U V W X Y Z [ ] ^ _
7 p q r s t u v w x y z { | } ~
8
9
A NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ SHY ® ¯
B ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
C À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
D Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
E à á â ã ä å æ ç è é ê ë ì í î ï
F ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
ISO
8859-1

0 1 2 3 4 5 6 7 8 9 A B C D E F
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
5 P Q R S T U V W X Y Z [ ] ^ _
8 € ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž
9 ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ
A NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ SHY ® ¯
B ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
C À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
D Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
E à á â ã ä å æ ç è é ê ë ì í î ï
F ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
Windows-1252

0 1 2 3 4 5 6 7 8 9 A B C D E F
0
1
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
5 P Q R S T U V W X Y Z [ ] ^ _
7 p q r s t u v w x y z { | } ~
8
9
A NBSP Ą ˘ Ł ¤ Ľ Ś § ¨ Š Ş Ť Ź SHY Ž Ż
B ° ą ˛ ł ´ ľ ś ˇ ¸ š ş ť ź ˝ ž ż
C Ŕ Á Â Ă Ä Ĺ Ć Ç Č É Ę Ë Ě Í Î Ď
D Đ Ń Ň Ó Ô Ő Ö × Ř Ů Ú Ű Ü Ý Ţ ß
E ŕ á â ă ä ĺ ć ç č é ę ë ě í î ď
F đ ń ň ó ô ő ö ÷ ř ů ú ű ü ý ţ ˙
ISO
8859-2

0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL ☺ ☻ ♥ ♦ ♣ ♠ • ◘ ○ ◙ ♂ ♀ ♪ ♫ ☼
1 ► ◄ ↕ ‼ ¶ § ▬ ↨ ↑ ↓ → ← ∟ ↔ ▲ ▼
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
5 P Q R S T U V W X Y Z [ ] ^ _
7 p q r s t u v w x y z { | } ~ ⌂
8 Ç ü é â ä à å ç ê ë è ï î ì Ä Å
9 É æ Æ ô ö ò û ù ÿ Ö Ü ¢ £ ¥ ₧ ƒ
A á í ó ú ñ Ñ ª º ¿ ⌐ ¬ ½ ¼ ¡ « »
B ░ ▒ ▓ │ ┤ ╡ ╢ ╖ ╕ ╣ ║ ╗ ╝ ╜ ╛ ┐
C └ ┴ ┬ ├ ─ ┼ ╞ ╟ ╚ ╔ ╩ ╦ ╠ ═ ╬ ╧
D ╨ ╤ ╥ ╙ ╘ ╒ ╓ ╫ ╪ ┘ ┌ █ ▄ ▌ ▐ ▀
E α ß Γ π Σ σ µ τ Φ Θ Ω δ ∞ φ ε ∩
F ≡ ± ≥ ≤ ⌠ ⌡ ÷ ≈ ° ∙ · √ ⁿ ² ■ NBSP
Code
Page
437
(IBM
PC)

8-bit “Extended ASCII”
● ISO 8859 - 16 variations:
○ ISO 8859-1 (“Latin 1”, Western European)
○ ISO 8859-2 (“Latin 2”, Central European)
○ ISO 8859-3 (“Latin 3”, South European)
○ ISO 8859-4 (“Latin 4”, North European)
○ ISO 8859-5 (Latin/Cyrillic)
○ ISO 8859-6 (Latin/Arabic)
○ ISO 8859-7 (Latin/Greek)
○ ISO 8859-8 (Latin/Hebrew)
○ ISO 8859-9 (“Latin 5”, Turkish)
○ ISO 8859-10 (“Latin 6”, Nordic)
○ ISO 8859-11 (Latin/Thai)
○ ISO 8859-12 (Latin/Devanagari) - abandoned
○ ISO 8859-13 (“Latin 7”, Baltic Rim)
○ ISO 8859-14 (“Latin 8”, Celtic)
○ ISO 8859-15 (“Latin 9”)
■ Revision of 8859-1 with swaps out less-
used chars; adds euro currency symbol
○ ISO 8859-16 (“Latin 10”, South-Eastern European)
● Windows-1252
● CP 437 - Original IBM PC
● Mac OS Roman character set
● TRS-80 character set
● Atari’s ATASCII
● Commodore’s PETSCII
● HP Roman-8 and Roman-9
● DEC’s Multinational Character Set
● Lotus International Character Set
● ECMA-94

But then along came the Internet...

“The Unicode Standard is the universal character
encoding standard for written characters and text. It
defines a consistent way of encoding multilingual text
that enables the exchange of text data internationally and
creates the foundation for global software”

Code Points
Problem:
How to accommodate larger character sets without wasting memory?
Solution:
Break the one-to-one correspondence between characters and
bits/encoding! Offer different ways to encode based on
different needs.

ASCII vs. Unicode
Character Encoded Bits
H 01001000 (0x48)
P 01010000 (0x50)
Glyph Code Point
P U+0050
LATIN CAPITAL LETTER P
H U+0048
LATIN CAPITAL LETTER H
Encoded Bits
????
????

Glyph Code Point Encoded Bits
P U+0050
LATIN CAPITAL LETTER P
????
h U+0068
LATIN SMALL LETTER H
????
Σ U+03A3
GREEK CAPITAL LETTER SIGMA
????
U+0634
ARABIC LETTER SHEEN
????
U+1D2ED
MAYAN NUMERAL
THIRTEEN
????
😸
U+1F638
GRINNING CAT FACE WITH
SMILING EYES
????
H U+0048
LATIN CAPITAL LETTER H
????

D
U+0044
LATIN CAPITAL LETTER D

U+1F604
SMILING FACE WITH
OPEN MOUTH AND
SMILING EYES

Recap
● Code Point: a number representing a single character*
○ 143,859 defined as of Unicode 13.0
○ Format: U+hhhhhh
● Codespace: A range of numerical values available for encoding characters
○ Support for 1,114,112 codepoints (0x000000 - 0x10FFFF)
● Code Planes: Continuous group of 65,536 (216) code points
○ 17 planes, numbered 0 - 16, which corresponds with the possible values 00–1016 of the first
two positions in six position hexadecimal format (U+hhhhhh)

Character / Code Point:
a
U+0061
LATIN SMALL LETTER A

Character / Code Point:
a
U+0061
LATIN SMALL LETTER A
a a a a a a a a
Glyphs:

Glyphs and Graphemes
Glyph /
Grapheme c a f e
Unicode
Character
c a f e
Code Point
U+0063 U+0061 U+0066 U+0065
LATIN
SMALL
LETTER C
LATIN
SMALL
LETTER A
LATIN
SMALL
LETTER F
LATIN
SMALL
LETTER E

Glyphs and Graphemes: Combining Diacritical Marks
Glyph /
Grapheme c a f é
Unicode
Character
c a f e ◌́
Code Point
U+0063 U+0061 U+0066 U+0065 U+0301
LATIN
SMALL
LETTER C
LATIN
SMALL
LETTER A
LATIN
SMALL
LETTER F
LATIN
SMALL
LETTER E
COMBINING
ACUTE ACCENT

Glyph /
Grapheme c a f é
Unicode
Character
c a f e ◌́
Code Point
U+0063 U+0061 U+0066 U+0065 U+0301
LATIN
SMALL
LETTER C
LATIN
SMALL
LETTER A
LATIN
SMALL
LETTER F
LATIN
SMALL
LETTER E
COMBINING
ACUTE ACCENT
e + ◌́ = é
e

Z̷̧̨̰̋Å̸̮͉ ̵͉̣̄̇̀
L̵͉̣̄̇̀G
̸̮͉̊ O
̸̱͒̓ ̷̧̨̰̋Ț͝E̪̘̗̓͝X̪̘̗T
̸̰̺̝̍̈

Glyphs and Graphemes: Variation Selectors
Glyph /
Grapheme ✈
Unicode
Character
✈
Code Point
U+2708 U+FE0E
AIRPLANE
VARIATION
SELECTOR 15
(TEXT STYLE)
VS
15

Glyphs and Graphemes: Variation Selectors
Glyph /
Grapheme ✈
Unicode
Character
✈
Code Point
U+2708 U+FE0E
AIRPLANE
VARIATION
SELECTOR 15
(TEXT STYLE)
Glyph /
Grapheme
Unicode
Character
✈
Code Point
U+2708 U+FE0F
AIRPLANE
VARIATION
SELECTOR 16
(EMOJI STYLE)
VS
16
VS
15

Glyphs and Graphemes: Regional Indicator Symbols
Glyph /
Grapheme 🇺🇸
Unicode
Character
🇺 🇸
Code Point
U+1F1FA U+1F1F8
REGIONAL
INDICATOR
SYMBOL
LETTER U
REGIONAL
INDICATOR
SYMBOL
LETTER S
Glyph /
Grapheme 🇨🇦
Unicode
Character
🇨 🇦
Code Point
U+1F1E8 U+1F1E6
REGIONAL
INDICATOR
SYMBOL
LETTER C
REGIONAL
INDICATOR
SYMBOL
LETTER A

Glyphs and Graphemes: Modifiers
Glyph /
Grapheme
Unicode
Character
👋
Code Point
U+1F44B U+1F3FC
WAVING
HAND SIGN
EMOJI
MODIFIER
FITZPATRICK
TYPE-3
Glyph /
Grapheme
Unicode
Character
👋
Code Point
U+1F44B U+1F3FE
WAVING
HAND SIGN
EMOJI
MODIFIER
FITZPATRICK
TYPE-5

Glyphs and Graphemes: ZWJ Sequences
Glyph /
Grapheme
👨 👩 👶 👧
Unicode
Character
👨 👩 👶 👧
Code
Point
U+1F468 U+1F469 U+1F476 U+1F467
MAN WOMAN BABY GIRL

Glyph /
Grapheme
Unicode
Character
👨 👩 👶 👧
Code
Point
U+1F468 U+200D U+1F469 U+200D U+1F476 U+200D U+1F467
MAN
ZERO
WIDTH
JOINER
WOMAN
ZERO
WIDTH
JOINER
BABY
ZERO
WIDTH
JOINER
GIRL
ZWJ ZWJ ZWJ

Glyph /
Grapheme
Unicode
Character
Code
Point
U+1F477 U+200D U+2642
CONSTRU
CTION
WORKER
ZERO
WIDTH
JOINER
MALE SIGN
ZWJ
Glyph /
Grapheme
Unicode
Character
Code
Point
U+1F477 U+200D U+2640
CONSTRU
CTION
WORKER
ZERO
WIDTH
JOINER
FEMALE
SIGN
ZWJ

Glyph / Grapheme
Unicode Character
Code Point
U+1F477 U+1F3FE U+200D U+2640
CONSTRUCTION
WORKER
EMOJI MODIFIER
FITZPATRICK
TYPE-5
ZERO WIDTH
JOINER
FEMALE SIGN
ZWJ

Encoding Schemes
● Most popular:
○ UTF-8
○ UTF-16
○ UTF-32

UTF-32
Fixed-byte encoding; 4 bytes per code point
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+10FFFF
xxxxxxxxxxxxxxxxxxxxx 00000000 000xxxxx xxxxxxxx xxxxxxxx

UTF-32
Fixed-byte encoding; 4 bytes per character
U+0000..U+D7FF,
U+E000..U+10FFFF
xxxxxxxxxxxxxxxxxxxxx 00000000 000xxxxx xxxxxxxx xxxxxxxx
Examples:
A
U+0041
LATIN CAPITAL A
0x0041 => 1000001 00000000 00000000 00000000 01000001
😸
U+1F638
GRINNING CAT WITH
SMILING EYES
0x1F638 => 11111011000111000 00000000 00000001 11110110 00111000

UTF-16
Variable-length encoding; 2 or 4 bytes per character
U+0000..U+D7FF,
U+E000..U+EFFF
(Basic Multilingual Plane)
00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx

Example:
A
U+0041
LATIN CAPITAL A
0x0041 => 1000001 00000000 01000001
UTF-16
U+0000..U+D7FF,
U+E000..U+EFFF

UTF-16
U+0000..U+D7FF,
U+E000..U+EFFF
U+010000..U+10FFFF
(Supplementary Planes)
Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy

U' = xxxxxxxxxxyyyyyyyyyy // U - 0x10000
W1 = 110110xxxxxxxxxx // 0xD800 + xxxxxxxxxx
W2 = 110111yyyyyyyyyy // 0xDC00 + yyyyyyyyyy
UTF-16
U+0000..U+D7FF,
U+E000..U+EFFF
U+010000..U+10FFFF

U+0000..U+D7FF,
U+E000..U+EFFF
U+010000..U+10FFFF
U' = xxxxxxxxxxyyyyyyyyyy // U - 0x10000
W1 = 110110xxxxxxxxxx // 0xD800 + xxxxxxxxxx
W2 = 110111yyyyyyyyyy // 0xDC00 + yyyyyyyyyy
UTF-16
0xD800-
0xDBFF
0xDC00-
0xDFFF

Example:
😸
U+1F638
GRINNING CAT WITH
SMILING EYES
0x1F638 => 1 11110110 00111000 11011000 00111101 11011110 00111000
U+010000..U+10FFFF
U' = 11110110 00111000 // 0x1F638 - 0x10000 = 0xF638
W1 = 11011000 00111101 // 0xD800 + 0000111101
W2 = 11011110 00111000 // 0xDC00 + 1000111000
UTF-16

UTF-8
Variable-length encoding; 1-4 bytes per code point
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.

UTF-8
Trick 1: ASCII === UTF-8
characters
and emoji.

UTF-8
Trick 2: Virtually all languages only need 1, 2, or 3 bytes
characters
and emoji.

UTF-8
Trick 3: First byte tells you the length
characters
and emoji.

UTF-8
Trick 4: Self-synchronization
characters
and emoji.

UTF-8
Trick 5: No 0x00 bytes, except for NUL
characters
and emoji.

UTF Encoding Summary
UTF-32 UTF-16 UTF-8
Encoding length Fixed Variable Variable
4 bytes per code
point
2 or 4 bytes per
code point
1-4 bytes per code
point
Memory-efficient No Somewhat Yes
CPU-efficient Yes Somewhat Somewhat
Self-synchronizing No Yes Yes
Contains null
(0x00) bytes
Yes Yes No
ASCII-compatible No No Yes

https://commons.wikimedia.org/wiki/File:Utf8webgrowth.svg

Handling Text In Programming Languages
1. Treat text as a sequence of bytes (PHP, C)
$smile = "xF0x9Fx98x80";
echo $smile; // => '😀'
echo strlen($smile); // => 4
1. Treat text as a sequence of Unicode code points (Python 3)
2. Treat text as a sequence of UTF-16 code units (JavaScript, C#)
const smile = 'uD83DuDE00';
console.log(smile); // => '😀'
console.log(smile.length); // => 2

PHP Strings
Be careful!
● Strings are simply byte sequences
● Encoding-agnostic
● Some (not all) string functions assume fixed-width, 8-bit ASCII encoding

PHP String Functions
Function What It Actually Does
strlen() Counts the length in bytes
str_replace() Replaces bytes
substr() Returns a subset of bytes
strtoupper() Converts alphabetic ASCII bytes to uppercase based on
globally-set locale
Works for ASCII; not entirely safe* for Unicode!

ext/mbstring
Provides multibyte-safe string functions
Standard Function mbstring Alternative
strlen() mb_strlen()
str_replace() (none)
substr() mb_substr()
strtoupper() mb_strtoupper()
Tip: All functions accept an
optional parameter to specify
the encoding, if known; will be
auto-detected otherwise.

ext/mbstring
mb_convert_case(string $string, int $mode, ?string $encoding = null): string
Input $string $mode Output
Mary had a little lamb
MB_CASE_UPPER MARY HAD A LITTLE LAMB
MB_CASE_LOWER mary had a little lamb
MB_CASE_TITLE Mary Had A Little Lamb
MB_CASE_FOLD mary had a little lamb

ext/mbstring
mb_convert_case(string $string, int $mode, ?string $encoding = null): string
Input $string $mode Output
Ich grüße den Mann
(I greet the man)
MB_CASE_UPPER ICH GRÜSSE DEN MANN
MB_CASE_LOWER ich grüße den mann
MB_CASE_TITLE Ich Grüße Den Mann
MB_CASE_FOLD ich grüsse den mann

ext/pcre
Enable UTF-8 support with u modifier: preg_match('/foo/u')
Match a character with a Unicode property: p{xx} (37 different codes)
Property
Code
Matches Example
L Any letter p{L}
Ll Lower case letter p{Ll}
Lu Upper case letter p{Lu}
Lm Modifier letter p{Lm}
Lt Title case letter p{Lt}
Lo Other letter p{Lo}
Property
Code
Matches Example
S Any symbol p{S}
Sc Currency symbol p{Sc}
Sk Modifier symbol p{Sk}
Sm Mathematical
symbol
p{Sm}
So Other symbol p{So}

Match a character with a Unicode script: p{xxxx} (102 different scripts)
Examples: p{Greek} or p{Egyptian_Hieroglyphs}
ext/pcre

Match a character without a Unicode property: P{xx}
ext/pcre

Match a character without a Unicode property: P{xx}
Match a Unicode extended grapheme cluster: X
Think of it like a . but for multiple characters
that combine into a single glyph
ext/pcre

ext/intl - IntlChar class
var_dump(IntlChar::charName('⛄'));
// string(20) "SNOWMAN WITHOUT SNOW"
$name = "RECYCLING SYMBOL FOR TYPE-1 PLASTICS";
var_dump(IntlChar::charFromName($name));
// int(9843)
var_dump(IntlChar::isupper("A"));
// bool(true)

ext/intl - Normalizer class
1. U+01FA - “Precomposed” character (LATIN CAPITAL
LETTER A WITH RING ABOVE AND ACUTE)
2. A + U+030A + U+0301 - A base letter A followed by two
combining marks (U+030A COMBINING RING ABOVE
and U+0301 COMBINING ACUTE ACCENT)
3. U+00C5 + U+0301 - An accented letter (U+00C5 LATIN
CAPITAL LETTER A WITH RING ABOVE) followed by a
combining accent (U+0301 COMBINING ACUTE
ACCENT)
4. U+212B + U+0301 - A compatibility character (U+212B
ANGSTROM SIGN) followed by a combining accent
(U+0301 COMBINING ACUTE ACCENT)
Ǻ

$variations = [
"xC7xBA",
"A" . "xCCx8AxCCx81",
"xC3x85xCCx81",
"xE2x84xABxCCx81",
];
Ǻ

$variations = [
"xC7xBA",
"xC3x85xCCx81",
"xE2x84xABxCCx81",
];
foreach ($variations as $str) {
echo urlencode(Normalizer::normalize($str));
echo "n";
}
Ǻ

$variations = [
"xC7xBA",
"xC3x85xCCx81",
"xE2x84xABxCCx81",
];
foreach ($variations as $str) {
echo urlencode(Normalizer::normalize($str));
echo "n";
}
// %C7%BA
// %C7%BA
// %C7%BA
// %C7%BA
Ǻ

ext/intl - Grapheme Functions
grapheme_
extract()
grapheme_
stripos()
grapheme_
stristr()
grapheme_
strlen()
grapheme_
strpos()
grapheme_
strripos()
grapheme_
strrpos()
grapheme_
strstr()
grapheme_
substr()
$str = '⛄ Café';
echo strlen($str); // 10
echo mb_strlen($str); // 7
echo grapheme_strlen($str); // 6

ext/iconv - iconv() function to convert encodings
$text = "This is the Euro symbol '€'."; // UTF-8 string

echo iconv("UTF-8", "ISO-8859-1", $text), PHP_EOL;
// Notice: iconv(): Detected an illegal character in input string

echo iconv("UTF-8", "ISO-8859-1", $text), PHP_EOL;
// Notice: iconv(): Detected an illegal character in input string
echo iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text), PHP_EOL;
// This is the Euro symbol 'EUR'.
echo iconv("UTF-8", "ISO-8859-1//IGNORE", $text), PHP_EOL;
// This is the Euro symbol ''.

PHP Extension Summary
ext/iconv: Convert between encodings
ext/mbstring: Work with multi-byte string encodings like UTF-8
ext/pcre: Special UTF-compatible matching when /u modifier enabled
ext/intl: Work with individual codepoints and graphemes

Fun Tricks & Micro-Optimizations

Disclaimer
Clever hacks and micro-optimizations are usually unnecessary and can be
detrimental to long-term maintenance!
Don’t use these unless you absolutely need them.

Taking Advantage of UTF-Encoded Bytes
PHP string functions can still be used in some cases:
if (str_contains($utf8, '&')) { … }
$trimmed = trim($utf8);
$firstChar = substr($utf32, 0, 4);
Requires solid understanding of UTF encodings and what the functions do
Don’t be clever unless there’s a clear advantage!

Splitting Strings Into Codepoints
mb_str_split($str) - returns array of individual codepoints (PHP 7.4+)
UTF-8 polyfill for older versions: preg_split('//u', $str)
(Works for codepoints, not graphemes)

ASCII-Only UTF-8 Strings
Is a UTF-8 string pure ASCII? If so, no need for (slower) mbstring functions:
$isAscii = mb_detect_encoding($str, 'ASCII', true);
Micro-optimization (2x faster):
$isASCII = strlen($str) === mb_strlen($str);
Speed is fractions of milliseconds; micro-optimization only
important for parsing-heavy applications

Writing Silly Code
PHP supports Unicode in variable and function names:
class （╯°□°）╯︵┻━┻ extends Exception {}
throw new （╯°□°）╯︵┻━┻;

Writing Silly Code
throw new （╯°□°）╯︵┻━┻;
Uses U+FF08 FULLWIDTH LEFT PARENTHESIS and U+FF09 FULLWIDTH RIGHT
PARENTHESIS since normal parens (U+0028/U+0029) are not allowed here.

Writing Silly Code (Don’t Do This)
throw new （╯°□°）╯︵┻━┻;
$👉😎👉 = "Ann Perkins!"; // Parks and Rec reference

Writing Silly Code (Seriously, Don’t Do This)
throw new （╯°□°）╯︵┻━┻;
$👉😎👉 = "Ann Perkins!"; // Parks and Rec reference
$you can use = 'U+2000 EN QUAD whitespace';

Recap & Recommendations
● Unicode supports virtually every known modern and historic writing system
● Codepoints != Glyphs/Graphemes != Encoding
● Use and support UTF-8 everywhere, especially for user input
● PHP strings are just raw bytes
● Use mbstring functions

Thank You!
Slides & feedback: https://joind.in/talk/9bdc2
Questions? @colinodell or colinodell@gmail.com

Demystifying Unicode - Longhorn PHP 2021

More Related Content

What's hot

Similar to Demystifying Unicode - Longhorn PHP 2021

More from Colin O'Dell

Recently uploaded

Demystifying Unicode - Longhorn PHP 2021

Editor's Notes