Successfully reported this slideshow.
Your SlideShare is downloading. ×

Demystifying Unicode - Longhorn PHP 2021

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 116 Ad

Demystifying Unicode - Longhorn PHP 2021

Download to read offline

ASCII is so 1963. Nowadays, computers must support a broad range of different characters beyond the 128 we had in the early days of computing - not just accents and emojis but also completely different writing systems used around the globe. The Unicode standard packs a whopping 143,859 characters into an elegant system used by over 95% of the Internet, but PHP's string functions don't play nicely with Unicode by default, making it difficult for developers to properly handle such a wide array of possible user inputs.

In this talk, we'll explore why Unicode is important, how the various encodings like UTF-8 work under-the-hood, how to handle them within PHP, and some nifty tricks and shortcuts to preserve performance.

ASCII is so 1963. Nowadays, computers must support a broad range of different characters beyond the 128 we had in the early days of computing - not just accents and emojis but also completely different writing systems used around the globe. The Unicode standard packs a whopping 143,859 characters into an elegant system used by over 95% of the Internet, but PHP's string functions don't play nicely with Unicode by default, making it difficult for developers to properly handle such a wide array of possible user inputs.

In this talk, we'll explore why Unicode is important, how the various encodings like UTF-8 work under-the-hood, how to handle them within PHP, and some nifty tricks and shortcuts to preserve performance.

Advertisement
Advertisement

More Related Content

Slideshows for you (19)

More from Colin O'Dell (20)

Advertisement

Recently uploaded (20)

Demystifying Unicode - Longhorn PHP 2021

  1. 1. Demystifying Unicode @colinodell
  2. 2. Colin O’Dell ● Principal Engineer at Unleashed Technologies ● PHP for ~20 years; 13 years professionally ● Creator & maintainer of league/commonmark library ● PHP League leadership team ● Owner of moderngeekware.com ● @colinodell
  3. 3. Agenda ● A History of Encoding Systems ● Unicode Standard ● Unicode Encodings ● Using Unicode in PHP ● Tips & Tricks ● Questions & Answers
  4. 4. Assumptions ● Some familiarity with PHP ● Basic understanding of binary and hexadecimal ● Focus on high-level concepts!
  5. 5. Encoding Systems
  6. 6. Encoding Systems L 1001100 L
  7. 7. A (Brief) History of Encoding Systems
  8. 8. 1837: Morse Code (Internationalized in 1844) “Morse-Vail Telegraph Key” by the National Museum of American History is licenced under CC BY-NC 2.0
  9. 9. 1930s: Teleprinters
  10. 10. 1960s: Teletypes (TTYs) For Computing
  11. 11. 1960s: ASCII ● American Standard Code for Information Interchange ● 7-bit binary encoding ○ 0000000 = 0 ○ ... ○ 1111111 = 127
  12. 12. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPAC E ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL
  13. 13. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPAC E ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL Character Hex Binary Character Hex Binary LF (line feed) 0x0A 0001010 E 0x45 1000101 3 0x33 0110011 e 0x65 1100101
  14. 14. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL 00xxxxx 01xxxxx 10xxxxx 11xxxxx 00xxxxx = 32 control codes 01xxxxx = 32 numbers & symbols 10xxxxx = 32 uppercase letters and some extra symbols 11xxxxx = 32 lowercase letters and some extra symbols
  15. 15. A = 0x41 = 1000001 B = 0x42 = 1000010 … Z = 0x5A = 1011010 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL 00xxxxx 01xxxxx 10xxxxx 11xxxxx
  16. 16. A = 0x41 = 1000001 B = 0x42 = 1000010 … Z = 0x5A = 1011010 a = 0x61 = 1100001 b = 0x62 = 1100010 … z = 0x7A = 1111010 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL 00xxxxx 01xxxxx 10xxxxx 11xxxxx
  17. 17. But computers use 8-bit bytes... ASCII (7 Bits) ??? Start 00000000 10000000 End 01111111 11111111 Count 128 128
  18. 18. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL 7-bit ASCII
  19. 19. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL 8 ??? 9 A B C D E F 8-bit “Extended ASCII”
  20. 20. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 1 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ 8 9 A NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ SHY ® ¯ B ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ C À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï D Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß E à á â ã ä å æ ç è é ê ë ì í î ï F ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ ISO 8859-1
  21. 21. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL 8 € ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž 9 ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ A NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ SHY ® ¯ B ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ C À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï D Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß E à á â ã ä å æ ç è é ê ë ì í î ï F ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ Windows-1252
  22. 22. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 1 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ 8 9 A NBSP Ą ˘ Ł ¤ Ľ Ś § ¨ Š Ş Ť Ź SHY Ž Ż B ° ą ˛ ł ´ ľ ś ˇ ¸ š ş ť ź ˝ ž ż C Ŕ Á Â Ă Ä Ĺ Ć Ç Č É Ę Ë Ě Í Î Ď D Đ Ń Ň Ó Ô Ő Ö × Ř Ů Ú Ű Ü Ý Ţ ß E ŕ á â ă ä ĺ ć ç č é ę ë ě í î ď F đ ń ň ó ô ő ö ÷ ř ů ú ű ü ý ţ ˙ ISO 8859-2
  23. 23. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL ☺ ☻ ♥ ♦ ♣ ♠ • ◘ ○ ◙ ♂ ♀ ♪ ♫ ☼ 1 ► ◄ ↕ ‼ ¶ § ▬ ↨ ↑ ↓ → ← ∟ ↔ ▲ ▼ 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ ⌂ 8 Ç ü é â ä à å ç ê ë è ï î ì Ä Å 9 É æ Æ ô ö ò û ù ÿ Ö Ü ¢ £ ¥ ₧ ƒ A á í ó ú ñ Ñ ª º ¿ ⌐ ¬ ½ ¼ ¡ « » B ░ ▒ ▓ │ ┤ ╡ ╢ ╖ ╕ ╣ ║ ╗ ╝ ╜ ╛ ┐ C └ ┴ ┬ ├ ─ ┼ ╞ ╟ ╚ ╔ ╩ ╦ ╠ ═ ╬ ╧ D ╨ ╤ ╥ ╙ ╘ ╒ ╓ ╫ ╪ ┘ ┌ █ ▄ ▌ ▐ ▀ E α ß Γ π Σ σ µ τ Φ Θ Ω δ ∞ φ ε ∩ F ≡ ± ≥ ≤ ⌠ ⌡ ÷ ≈ ° ∙ · √ ⁿ ² ■ NBSP Code Page 437 (IBM PC)
  24. 24. 8-bit “Extended ASCII” ● ISO 8859 - 16 variations: ○ ISO 8859-1 (“Latin 1”, Western European) ○ ISO 8859-2 (“Latin 2”, Central European) ○ ISO 8859-3 (“Latin 3”, South European) ○ ISO 8859-4 (“Latin 4”, North European) ○ ISO 8859-5 (Latin/Cyrillic) ○ ISO 8859-6 (Latin/Arabic) ○ ISO 8859-7 (Latin/Greek) ○ ISO 8859-8 (Latin/Hebrew) ○ ISO 8859-9 (“Latin 5”, Turkish) ○ ISO 8859-10 (“Latin 6”, Nordic) ○ ISO 8859-11 (Latin/Thai) ○ ISO 8859-12 (Latin/Devanagari) - abandoned ○ ISO 8859-13 (“Latin 7”, Baltic Rim) ○ ISO 8859-14 (“Latin 8”, Celtic) ○ ISO 8859-15 (“Latin 9”) ■ Revision of 8859-1 with swaps out less- used chars; adds euro currency symbol ○ ISO 8859-16 (“Latin 10”, South-Eastern European) ● Windows-1252 ● CP 437 - Original IBM PC ● Mac OS Roman character set ● TRS-80 character set ● Atari’s ATASCII ● Commodore’s PETSCII ● HP Roman-8 and Roman-9 ● DEC’s Multinational Character Set ● Lotus International Character Set ● ECMA-94
  25. 25. But then along came the Internet...
  26. 26. https://xkcd.com/927/
  27. 27. “The Unicode Standard is the universal character encoding standard for written characters and text. It defines a consistent way of encoding multilingual text that enables the exchange of text data internationally and creates the foundation for global software”
  28. 28. Code Points Problem: How to accommodate larger character sets without wasting memory? Solution: Break the one-to-one correspondence between characters and bits/encoding! Offer different ways to encode based on different needs.
  29. 29. ASCII vs. Unicode Character Encoded Bits H 01001000 (0x48) P 01010000 (0x50) Glyph Code Point P U+0050 LATIN CAPITAL LETTER P H U+0048 LATIN CAPITAL LETTER H Encoded Bits ???? ????
  30. 30. Glyph Code Point Encoded Bits P U+0050 LATIN CAPITAL LETTER P ???? h U+0068 LATIN SMALL LETTER H ???? Σ U+03A3 GREEK CAPITAL LETTER SIGMA ???? U+0634 ARABIC LETTER SHEEN ???? U+1D2ED MAYAN NUMERAL THIRTEEN ???? 😸 U+1F638 GRINNING CAT FACE WITH SMILING EYES ???? H U+0048 LATIN CAPITAL LETTER H ????
  31. 31. D U+0044 LATIN CAPITAL LETTER D
  32. 32. U+1F604 SMILING FACE WITH OPEN MOUTH AND SMILING EYES
  33. 33. Code Planes
  34. 34. Recap ● Code Point: a number representing a single character* ○ 143,859 defined as of Unicode 13.0 ○ Format: U+hhhhhh ● Codespace: A range of numerical values available for encoding characters ○ Support for 1,114,112 codepoints (0x000000 - 0x10FFFF) ● Code Planes: Continuous group of 65,536 (216) code points ○ 17 planes, numbered 0 - 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal format (U+hhhhhh)
  35. 35. Glyphs and Graphemes
  36. 36. Character / Code Point: a U+0061 LATIN SMALL LETTER A
  37. 37. Character / Code Point: a U+0061 LATIN SMALL LETTER A a a a a a a a a Glyphs:
  38. 38. Glyphs and Graphemes Glyph / Grapheme c a f e Unicode Character c a f e Code Point U+0063 U+0061 U+0066 U+0065 LATIN SMALL LETTER C LATIN SMALL LETTER A LATIN SMALL LETTER F LATIN SMALL LETTER E
  39. 39. Glyphs and Graphemes: Combining Diacritical Marks Glyph / Grapheme c a f é Unicode Character c a f e ◌́ Code Point U+0063 U+0061 U+0066 U+0065 U+0301 LATIN SMALL LETTER C LATIN SMALL LETTER A LATIN SMALL LETTER F LATIN SMALL LETTER E COMBINING ACUTE ACCENT
  40. 40. Glyphs and Graphemes: Combining Diacritical Marks Glyph / Grapheme c a f é Unicode Character c a f e ◌́ Code Point U+0063 U+0061 U+0066 U+0065 U+0301 LATIN SMALL LETTER C LATIN SMALL LETTER A LATIN SMALL LETTER F LATIN SMALL LETTER E COMBINING ACUTE ACCENT e + ◌́ = é e
  41. 41. Glyphs and Graphemes: Combining Diacritical Marks Z̷̧̨̰̋Å̸̮͉ ̵͉̣̄̇̀ L̵͉̣̄̇̀G ̸̮͉̊ O ̸̱͒̓ ̷̧̨̰̋Ț͝E̪̘̗̓͝X̪̘̗T ̸̰̺̝̍̈
  42. 42. Glyphs and Graphemes: Variation Selectors Glyph / Grapheme ✈ Unicode Character ✈ Code Point U+2708 U+FE0E AIRPLANE VARIATION SELECTOR 15 (TEXT STYLE) VS 15
  43. 43. Glyphs and Graphemes: Variation Selectors Glyph / Grapheme ✈ Unicode Character ✈ Code Point U+2708 U+FE0E AIRPLANE VARIATION SELECTOR 15 (TEXT STYLE) Glyph / Grapheme Unicode Character ✈ Code Point U+2708 U+FE0F AIRPLANE VARIATION SELECTOR 16 (EMOJI STYLE) VS 16 VS 15
  44. 44. Glyphs and Graphemes: Regional Indicator Symbols Glyph / Grapheme 🇺🇸 Unicode Character 🇺 🇸 Code Point U+1F1FA U+1F1F8 REGIONAL INDICATOR SYMBOL LETTER U REGIONAL INDICATOR SYMBOL LETTER S Glyph / Grapheme 🇨🇦 Unicode Character 🇨 🇦 Code Point U+1F1E8 U+1F1E6 REGIONAL INDICATOR SYMBOL LETTER C REGIONAL INDICATOR SYMBOL LETTER A
  45. 45. Glyphs and Graphemes: Modifiers Glyph / Grapheme Unicode Character 👋 Code Point U+1F44B U+1F3FC WAVING HAND SIGN EMOJI MODIFIER FITZPATRICK TYPE-3 Glyph / Grapheme Unicode Character 👋 Code Point U+1F44B U+1F3FE WAVING HAND SIGN EMOJI MODIFIER FITZPATRICK TYPE-5
  46. 46. Glyphs and Graphemes: ZWJ Sequences Glyph / Grapheme 👨 👩 👶 👧 Unicode Character 👨 👩 👶 👧 Code Point U+1F468 U+1F469 U+1F476 U+1F467 MAN WOMAN BABY GIRL
  47. 47. Glyphs and Graphemes: ZWJ Sequences Glyph / Grapheme Unicode Character 👨 👩 👶 👧 Code Point U+1F468 U+200D U+1F469 U+200D U+1F476 U+200D U+1F467 MAN ZERO WIDTH JOINER WOMAN ZERO WIDTH JOINER BABY ZERO WIDTH JOINER GIRL ZWJ ZWJ ZWJ
  48. 48. Glyphs and Graphemes: ZWJ Sequences
  49. 49. Glyphs and Graphemes: ZWJ Sequences
  50. 50. Glyphs and Graphemes: ZWJ Sequences Glyph / Grapheme Unicode Character Code Point U+1F477 U+200D U+2642 CONSTRU CTION WORKER ZERO WIDTH JOINER MALE SIGN ZWJ Glyph / Grapheme Unicode Character Code Point U+1F477 U+200D U+2640 CONSTRU CTION WORKER ZERO WIDTH JOINER FEMALE SIGN ZWJ
  51. 51. Glyphs and Graphemes: ZWJ Sequences Glyph / Grapheme Unicode Character Code Point U+1F477 U+1F3FE U+200D U+2640 CONSTRUCTION WORKER EMOJI MODIFIER FITZPATRICK TYPE-5 ZERO WIDTH JOINER FEMALE SIGN ZWJ
  52. 52. Enough about code points...
  53. 53. Encoding Schemes
  54. 54. Glyph Code Point Encoded Bits P U+0050 LATIN CAPITAL LETTER P ???? h U+0068 LATIN SMALL LETTER H ???? Σ U+03A3 GREEK CAPITAL LETTER SIGMA ???? U+0634 ARABIC LETTER SHEEN ???? U+1D2ED MAYAN NUMERAL THIRTEEN ???? 😸 U+1F638 GRINNING CAT FACE WITH SMILING EYES ???? H U+0048 LATIN CAPITAL LETTER H ????
  55. 55. Encoding Schemes ● Most popular: ○ UTF-8 ○ UTF-16 ○ UTF-32
  56. 56. UTF-32 Fixed-byte encoding; 4 bytes per code point Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+10FFFF xxxxxxxxxxxxxxxxxxxxx 00000000 000xxxxx xxxxxxxx xxxxxxxx
  57. 57. UTF-32 Fixed-byte encoding; 4 bytes per character Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+10FFFF xxxxxxxxxxxxxxxxxxxxx 00000000 000xxxxx xxxxxxxx xxxxxxxx Examples: A U+0041 LATIN CAPITAL A 0x0041 => 1000001 00000000 00000000 00000000 01000001 😸 U+1F638 GRINNING CAT WITH SMILING EYES 0x1F638 => 11111011000111000 00000000 00000001 11110110 00111000
  58. 58. UTF-16 Variable-length encoding; 2 or 4 bytes per character Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+EFFF (Basic Multilingual Plane) 00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
  59. 59. Example: A U+0041 LATIN CAPITAL A 0x0041 => 1000001 00000000 01000001 Variable-length encoding; 2 or 4 bytes per character UTF-16 Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+EFFF (Basic Multilingual Plane) 00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
  60. 60. UTF-16 Variable-length encoding; 2 or 4 bytes per character Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+EFFF (Basic Multilingual Plane) 00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx U+010000..U+10FFFF (Supplementary Planes) Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy
  61. 61. U' = xxxxxxxxxxyyyyyyyyyy // U - 0x10000 W1 = 110110xxxxxxxxxx // 0xD800 + xxxxxxxxxx W2 = 110111yyyyyyyyyy // 0xDC00 + yyyyyyyyyy UTF-16 Variable-length encoding; 2 or 4 bytes per character Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+EFFF (Basic Multilingual Plane) 00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx U+010000..U+10FFFF (Supplementary Planes) Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy
  62. 62. Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+EFFF (Basic Multilingual Plane) 00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx U+010000..U+10FFFF (Supplementary Planes) Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy Variable-length encoding; 2 or 4 bytes per character U' = xxxxxxxxxxyyyyyyyyyy // U - 0x10000 W1 = 110110xxxxxxxxxx // 0xD800 + xxxxxxxxxx W2 = 110111yyyyyyyyyy // 0xDC00 + yyyyyyyyyy UTF-16 0xD800- 0xDBFF 0xDC00- 0xDFFF
  63. 63. Example: 😸 U+1F638 GRINNING CAT WITH SMILING EYES 0x1F638 => 1 11110110 00111000 11011000 00111101 11011110 00111000 Codepoint range Unicode scalar value (binary) Encoded bytes U+010000..U+10FFFF (Supplementary Planes) Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy Variable-length encoding; 2 or 4 bytes per character U' = 11110110 00111000 // 0x1F638 - 0x10000 = 0xF638 W1 = 11011000 00111101 // 0xD800 + 0000111101 W2 = 11011110 00111000 // 0xDC00 + 1000111000 UTF-16
  64. 64. UTF-8 Variable-length encoding; 1-4 bytes per code point Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic, Arabic, Hebrew, and more, plus combining diacritical marks U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most Chinese, Japanese, and Korean characters U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic scripts, mathematical symbols, and emoji.
  65. 65. UTF-8 Trick 1: ASCII === UTF-8 Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic, Arabic, Hebrew, and more, plus combining diacritical marks U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most Chinese, Japanese, and Korean characters U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic scripts, mathematical symbols, and emoji.
  66. 66. UTF-8 Trick 2: Virtually all languages only need 1, 2, or 3 bytes Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic, Arabic, Hebrew, and more, plus combining diacritical marks U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most Chinese, Japanese, and Korean characters U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic scripts, mathematical symbols, and emoji.
  67. 67. UTF-8 Trick 3: First byte tells you the length Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic, Arabic, Hebrew, and more, plus combining diacritical marks U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most Chinese, Japanese, and Korean characters U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic scripts, mathematical symbols, and emoji.
  68. 68. UTF-8 Trick 4: Self-synchronization Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic, Arabic, Hebrew, and more, plus combining diacritical marks U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most Chinese, Japanese, and Korean characters U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic scripts, mathematical symbols, and emoji.
  69. 69. UTF-8 Trick 5: No 0x00 bytes, except for NUL Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic, Arabic, Hebrew, and more, plus combining diacritical marks U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most Chinese, Japanese, and Korean characters U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic scripts, mathematical symbols, and emoji.
  70. 70. UTF Encoding Summary UTF-32 UTF-16 UTF-8 Encoding length Fixed Variable Variable 4 bytes per code point 2 or 4 bytes per code point 1-4 bytes per code point Memory-efficient No Somewhat Yes CPU-efficient Yes Somewhat Somewhat Self-synchronizing No Yes Yes Contains null (0x00) bytes Yes Yes No ASCII-compatible No No Yes
  71. 71. https://commons.wikimedia.org/wiki/File:Utf8webgrowth.svg
  72. 72. Unicode in PHP
  73. 73. Handling Text In Programming Languages 1. Treat text as a sequence of bytes (PHP, C) $smile = "xF0x9Fx98x80"; echo $smile; // => '😀' echo strlen($smile); // => 4 1. Treat text as a sequence of Unicode code points (Python 3) 2. Treat text as a sequence of UTF-16 code units (JavaScript, C#) const smile = 'uD83DuDE00'; console.log(smile); // => '😀' console.log(smile.length); // => 2
  74. 74. PHP Strings Be careful! ● Strings are simply byte sequences ● Encoding-agnostic ● Some (not all) string functions assume fixed-width, 8-bit ASCII encoding
  75. 75. PHP String Functions Function What It Actually Does strlen() Counts the length in bytes str_replace() Replaces bytes substr() Returns a subset of bytes strtoupper() Converts alphabetic ASCII bytes to uppercase based on globally-set locale Works for ASCII; not entirely safe* for Unicode!
  76. 76. ext/mbstring Provides multibyte-safe string functions Standard Function mbstring Alternative strlen() mb_strlen() str_replace() (none) substr() mb_substr() strtoupper() mb_strtoupper() Tip: All functions accept an optional parameter to specify the encoding, if known; will be auto-detected otherwise.
  77. 77. ext/mbstring Provides multibyte-safe string functions mb_convert_case(string $string, int $mode, ?string $encoding = null): string Input $string $mode Output Mary had a little lamb MB_CASE_UPPER MARY HAD A LITTLE LAMB MB_CASE_LOWER mary had a little lamb MB_CASE_TITLE Mary Had A Little Lamb MB_CASE_FOLD mary had a little lamb
  78. 78. ext/mbstring Provides multibyte-safe string functions mb_convert_case(string $string, int $mode, ?string $encoding = null): string Input $string $mode Output Ich grüße den Mann (I greet the man) MB_CASE_UPPER ICH GRÜSSE DEN MANN MB_CASE_LOWER ich grüße den mann MB_CASE_TITLE Ich Grüße Den Mann MB_CASE_FOLD ich grüsse den mann
  79. 79. ext/pcre Enable UTF-8 support with u modifier: preg_match('/foo/u') Match a character with a Unicode property: p{xx} (37 different codes) Property Code Matches Example L Any letter p{L} Ll Lower case letter p{Ll} Lu Upper case letter p{Lu} Lm Modifier letter p{Lm} Lt Title case letter p{Lt} Lo Other letter p{Lo} Property Code Matches Example S Any symbol p{S} Sc Currency symbol p{Sc} Sk Modifier symbol p{Sk} Sm Mathematical symbol p{Sm} So Other symbol p{So}
  80. 80. Enable UTF-8 support with u modifier: preg_match('/foo/u') Match a character with a Unicode property: p{xx} (37 different codes) Match a character with a Unicode script: p{xxxx} (102 different scripts) Examples: p{Greek} or p{Egyptian_Hieroglyphs} ext/pcre
  81. 81. Enable UTF-8 support with u modifier: preg_match('/foo/u') Match a character with a Unicode property: p{xx} (37 different codes) Match a character with a Unicode script: p{xxxx} (102 different scripts) Match a character without a Unicode property: P{xx} ext/pcre
  82. 82. Enable UTF-8 support with u modifier: preg_match('/foo/u') Match a character with a Unicode property: p{xx} (37 different codes) Match a character with a Unicode script: p{xxxx} (102 different scripts) Match a character without a Unicode property: P{xx} Match a Unicode extended grapheme cluster: X Think of it like a . but for multiple characters that combine into a single glyph ext/pcre
  83. 83. ext/intl - IntlChar class var_dump(IntlChar::charName('⛄')); // string(20) "SNOWMAN WITHOUT SNOW" $name = "RECYCLING SYMBOL FOR TYPE-1 PLASTICS"; var_dump(IntlChar::charFromName($name)); // int(9843) var_dump(IntlChar::isupper("A")); // bool(true)
  84. 84. ext/intl - Normalizer class 1. U+01FA - “Precomposed” character (LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE) 2. A + U+030A + U+0301 - A base letter A followed by two combining marks (U+030A COMBINING RING ABOVE and U+0301 COMBINING ACUTE ACCENT) 3. U+00C5 + U+0301 - An accented letter (U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE) followed by a combining accent (U+0301 COMBINING ACUTE ACCENT) 4. U+212B + U+0301 - A compatibility character (U+212B ANGSTROM SIGN) followed by a combining accent (U+0301 COMBINING ACUTE ACCENT) Ǻ
  85. 85. $variations = [ "xC7xBA", "A" . "xCCx8AxCCx81", "xC3x85xCCx81", "xE2x84xABxCCx81", ]; Ǻ
  86. 86. $variations = [ "xC7xBA", "A" . "xCCx8AxCCx81", "xC3x85xCCx81", "xE2x84xABxCCx81", ]; foreach ($variations as $str) { echo urlencode(Normalizer::normalize($str)); echo "n"; } Ǻ
  87. 87. $variations = [ "xC7xBA", "A" . "xCCx8AxCCx81", "xC3x85xCCx81", "xE2x84xABxCCx81", ]; foreach ($variations as $str) { echo urlencode(Normalizer::normalize($str)); echo "n"; } // %C7%BA // %C7%BA // %C7%BA // %C7%BA Ǻ
  88. 88. ext/intl - Grapheme Functions grapheme_​ extract() grapheme_​ stripos() grapheme_​ stristr() grapheme_​ strlen() grapheme_​ strpos() grapheme_​ strripos() grapheme_​ strrpos() grapheme_​ strstr() grapheme_​ substr() $str = '⛄ Café'; echo strlen($str); // 10 echo mb_strlen($str); // 7 echo grapheme_strlen($str); // 6
  89. 89. ext/iconv - iconv() function to convert encodings $text = "This is the Euro symbol '€'."; // UTF-8 string
  90. 90. ext/iconv - iconv() function to convert encodings $text = "This is the Euro symbol '€'."; // UTF-8 string echo iconv("UTF-8", "ISO-8859-1", $text), PHP_EOL; // Notice: iconv(): Detected an illegal character in input string
  91. 91. ext/iconv - iconv() function to convert encodings $text = "This is the Euro symbol '€'."; // UTF-8 string echo iconv("UTF-8", "ISO-8859-1", $text), PHP_EOL; // Notice: iconv(): Detected an illegal character in input string echo iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text), PHP_EOL; // This is the Euro symbol 'EUR'. echo iconv("UTF-8", "ISO-8859-1//IGNORE", $text), PHP_EOL; // This is the Euro symbol ''.
  92. 92. PHP Extension Summary ext/iconv: Convert between encodings ext/mbstring: Work with multi-byte string encodings like UTF-8 ext/pcre: Special UTF-compatible matching when /u modifier enabled ext/intl: Work with individual codepoints and graphemes
  93. 93. Fun Tricks & Micro-Optimizations
  94. 94. Disclaimer Clever hacks and micro-optimizations are usually unnecessary and can be detrimental to long-term maintenance! Don’t use these unless you absolutely need them.
  95. 95. Taking Advantage of UTF-Encoded Bytes PHP string functions can still be used in some cases: if (str_contains($utf8, '&')) { … } $trimmed = trim($utf8); $firstChar = substr($utf32, 0, 4); Requires solid understanding of UTF encodings and what the functions do Don’t be clever unless there’s a clear advantage!
  96. 96. Splitting Strings Into Codepoints mb_str_split($str) - returns array of individual codepoints (PHP 7.4+) UTF-8 polyfill for older versions: preg_split('//u', $str) (Works for codepoints, not graphemes)
  97. 97. ASCII-Only UTF-8 Strings Is a UTF-8 string pure ASCII? If so, no need for (slower) mbstring functions: $isAscii = mb_detect_encoding($str, 'ASCII', true); Micro-optimization (2x faster): $isASCII = strlen($str) === mb_strlen($str); Speed is fractions of milliseconds; micro-optimization only important for parsing-heavy applications
  98. 98. Writing Silly Code PHP supports Unicode in variable and function names: class (╯°□°)╯︵┻━┻ extends Exception {} throw new (╯°□°)╯︵┻━┻;
  99. 99. Writing Silly Code PHP supports Unicode in variable and function names: class (╯°□°)╯︵┻━┻ extends Exception {} throw new (╯°□°)╯︵┻━┻; Uses U+FF08 FULLWIDTH LEFT PARENTHESIS and U+FF09 FULLWIDTH RIGHT PARENTHESIS since normal parens (U+0028/U+0029) are not allowed here.
  100. 100. Writing Silly Code (Don’t Do This) PHP supports Unicode in variable and function names: class (╯°□°)╯︵┻━┻ extends Exception {} throw new (╯°□°)╯︵┻━┻; $👉😎👉 = "Ann Perkins!"; // Parks and Rec reference
  101. 101. Writing Silly Code (Seriously, Don’t Do This) PHP supports Unicode in variable and function names: class (╯°□°)╯︵┻━┻ extends Exception {} throw new (╯°□°)╯︵┻━┻; $👉😎👉 = "Ann Perkins!"; // Parks and Rec reference $you can use = 'U+2000 EN QUAD whitespace';
  102. 102. Recap
  103. 103. Recap & Recommendations ● Unicode supports virtually every known modern and historic writing system ● Codepoints != Glyphs/Graphemes != Encoding ● Use and support UTF-8 everywhere, especially for user input ● PHP strings are just raw bytes ● Use mbstring functions
  104. 104. Questions?
  105. 105. Thank You! Slides & feedback: https://joind.in/talk/9bdc2 Questions? @colinodell or colinodell@gmail.com

Editor's Notes

  • Questions as we go? Raise hand
  • Converts characters into electrical signals
  • Standardized in 1865
  • Simple device
    Type a key, sends some numbers, same letter comes out the other side
  • But there needs to be a standard
  • Developed in 1960s for teleprinters (“Teletype”) and early computers
    7-bit: each letter you type in gets converted into 7 bits
  • Support for:
    Upper and lowercase letters
    Numbers
    Basic, common symbols
    More control codes (CR, LF, BS, HT, BEL)
    (next for examples)
  • (how to encode/decode)
  • Something really clever going on here
    Group by first two bits
    4 “pages” or sections, 32 chars each
  • Letters in alphabetical order, starting at 1 (not random)
  • Even more clever - converting between upper and lowercase by changing one bit
  • “Extended ASCII” sounds like a standard, but it’s not
  • AKA Latin 1 for the Americas, Western Europe, Oceania, and much of Africa
  • Superset/extension of ISO 8859-1
    Adds curly quotation marks
    De-facto standard for Windows
  • Aka Latin 2 for Central or Eastern European Languages
  • UI graphics, science, and math
    Standard EGA VGA encoding on gfx cards
  • That’s a lot! However,
  • In practice, most users only used one standard locally. Which was fine...
  • Standards proliferation
  • (Problem) You could add more bits, but that wasted computing resources (which were scarce at the time) for users who only needed Latin or ASCII-like characters
  • ATTN: 4 vs 5 char convention
  • Support for 1,114,112 codepoints (0x000000 - 0x10FFFF)
    Code Planes: Continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal format (U+hhhhhh)
    Codespace: entire range of numerical values available for encoding characters

  • Code Planes: Continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal format (U+hhhhhh)
    Codespace: entire range of numerical values available for encoding characters
    Support for 1,114,112 codepoints (0x000000 - 0x10FFFF)

  • Unicode does not specify how the character / code point should be displayed (or encoded)!
  • Unicode does not specify how the character / code point should be displayed (or encoded)!
  • Combining Diacritical Marks
  • In this example: 5 code points but 4 graphemes
    GRAPHEME = smallest unit of a writing system
    Think about putting cursor in this text and selecting something or pressing backspace
  • “Zalgo text” or “glitch text”
  • Combining Diacritical Marks
  • Combining Diacritical Marks
  • Combining Diacritical Marks
  • Combining Diacritical Marks
  • Combining Diacritical Marks
  • Windows supports 52,000 family combinations
  • Windows supports 52,000 family combinations
  • If system lacks dedicated image, individual emojis are shown
  • Combining Diacritical Marks
  • Pros: Code points always use some number of bytes; very straight-forward
    Cons: not very memory efficient, can contain null bytes, not self-synchronizing
  • BMP = basically everything except emojis and historical scripts
  • “Surrogate pairs”; values are reserved, no code points with those values
  • Pros: more memory efficient (most of the time), works well for BMP; is self-synchronizing

    Cons: 4-byte encoding logic somewhat messy; can contain null bytes
  • This symbol can be encoded 4 different ways
  • Intl normalizer class
  • In UTF-8: 3 bytes for snowman, 1 for space, 1 for each letter c a f e, and 1 for diacritical combining acute accent mark
  • Now for some fun tricks

×