THE GOOD, THE BAD,     AND THE UGLYWhat Happened to Unicode and PHP 6  Andrei Zmievski ! PHP Community Conference
ABOUT 1 YEAR AGO…“Hello PHP 5.4, open for all new stuff.” — Jani
TIME OF DEATHMarch!11,!11:09:37!2010 GMT
5 YEARS EARLIER…       PHP 5.0.0 released in July 2004
5 YEARS EARLIER…         Firefox 1.0released in November 2004
5 YEARS EARLIER…             Chromenot even a twinkle in Google’s eye
5 YEARS EARLIER…      Unicode    version 4.0.1
WHAT IS UNICODE?  and why do I need it?
Unicode    …is a computing industry   standard for the consistent  encoding, representation andhandling of text expressed ...
Unicode provides a unique number    for every character:no matter what the platform,no matter what the program,no matter w...
UNICODE STANDARD!   Developed by the Unicode Consortium!   Covers all major living scripts!   Version 6.0 has 109,000+ cha...
FEATURES!   Rich property set for every character!   Standard, unified encodings: UTF-8/16/32!   Extensive rules and docume...
UNICODE != I18N!   Unicode simplifies development!   Unicode does not fix all internationalization problems
TIME FORMATS!   USA:    !"##$%&&!   France: ()&##!   Japan: ()##$!   Don’t forget to identify the time zone
CURRENCY!   Symbol placement                                     *+$,(-&.!!   Symbol length (1-15)            (-&.!/0)1$2!...
SORTING!   Swedish:        z<ö!   German:         ö<z!   Dictionary:     öf < of!   Phonebook:      of < öf!   Upper-first:...
CLDR!   Hosted by Unicode Consortium!   Latest release: December 2010 (CLDR 1.9)!   516 locales, with 187 languages and 16...
WHY WEB NEEDS UNICODE
MOJIBAKE
MOJIBAKEnoun: phenomenon of incorrect, unreadablecharacters shown when computer softwarefails to render a text correctly a...
MOJIBAKE
MOJIBAKE
I UNICODE,YOU UNICODE
I | UNICODE,YOU | UNICODE
Helgi
Helgi
Helgi  ÞormarÞorbjörnsson
ISLTHORP         orMr. SECURITY OVERRIDE
Joel
Joël
Joël
Joël
Joël
WEAKEST LINKPRINCIPLE APPLIES
WHY PHP NEEDS UNICODE
PHP!   Essential Web platform!   Since Web needs Unicode…!   …so does PHP!   Do not want to be the weakest link
THE PROJECT
THE PROJECT!   Launched in February 2005 by me at Yahoo!   Small group from Yahoo, Zend, and PHP development    community!...
UNICODE SUPPORT!   Everywhere:    ‣   in the engine    ‣   in the extensions    ‣   in the API
UNICODE SUPPORT!   Native and complete    ‣   no hacks    ‣   no mishmash of external libraries    ‣   no missing locales ...
ICU LIBRARY       International Components for Unicode✓   Unicode Character Properties       ✓ Formatting: Date/Time/✓   U...
THE PROJECT!   Development was in a separate repository!   Merged into PHP tree once the basics were working!   Initially ...
PHP 6 = PHP 5 + Unicode
PHP 5 = PHP 6 - Unicode
Unicode = PHP 6 - PHP 5
PHP 6
PHP 6   6
STRING TYPES!   Unicode    ‣   text    ‣   default for literals, etc!   Binary    ‣   bytes    ‣   everything ∉ Unicode type
Conversions  Dataflow                                                  streams                                             ...
STRINGS!   String literals are Unicode!   String offsets work on code points        ,456$7$8     89: :   ;;$-$<=>?$@=AB54  ...
IDENTIFIERS!   Unicode identifiers are allowed      <GL44$               $M$      $$$IFB<5A=B$!"#$$%&()NO: : M$&&&$P      $...
FUNCTIONS! Functions understand Unicode text and apply  appropriate rules! i.e. case manipulation    ,456$7$4565=F@@?6N8IF...
TRANSLITERATION,BLX?4$7$8$$$!0$"#$$$!0$$%$$$     0$       $$$       0$      $$$YZ[]^_`0$abc]bd$$$eZfg[_`0$hij[_k$$$lmnopqr...
PECL/INTL
FEATURES!   Locales!   Collation!   Number and Currency Formatters!   Date and Time Formatters!   Time Zones!   Calendars!...
COLLATIONsorting,456ABJ4$7$L66LQN$$$$$$$$$8<=5?80$8<ì5?80$8îì5?80$8<=5ë80$$$$$$$$$8î=5ë80$8<ì5ë80$8îì5ë80$8<=5?68O9$,<=GG$...
NUMBER FORMATTING                (-.!/)&1ùû$AB$?BÅ*+!   ôFXS?6ï=6XL55?6""öêîõÇÉ    123456.789!   ôFXS?6ï=6XL55?6""î*ññêôîú...
MESSAGE FORMATTINGwith modifiers,@L55?6B$7$üáB$M#0>L5?0IFGGP$Q=F$6?<?Aâ?>$$$$$$$$$$$$M(0BFXS?60†0††#&##P$?XLAG4&°9,L6J4$7$L...
POSTMORTEM
WHAT WENT RIGHT
1. RAISED AWARENESS!   Spoke at multiple conferences about the project    ‣   including Unicode Conference!   Shoved Unico...
2. CHOSE THE RIGHT TECH!   ICU library had everything we needed!   Low- and high-level functionality!   Good support from ...
3. UNIT TESTS!   Every function handling strings had to be ported!   Unit tests showed us where things broke!   Also easy ...
4. PECL/INTL EXTENSION!   A lot of i18n/l10n functionality in a self-contained    extension!   Ensuring that it worked wit...
5. CODE SEGREGATION!   Proof-of-concept developed by only a few people!   Faster decisions, iteration, development!   Thin...
WHAT WENT RONG
1. CHOICE OF UTF-16Thought to be the best compromise
UTF-8!   Backward-compatible with ASCII!   Avoids complications of endianness!   Dominant UTF encoding for the Web!   Supp...
UTF-8, BUT…!   Variable-length encoding (1-4 bytes)!   Uses 3 bytes for BMP code points > U+07FF!   Not all byte sequences...
UTF-32!   Uses exactly 4 bytes for each code point    ‣   directly indexable!
UTF-32, BUT...!   Uses exactly 4 bytes for each code point    ‣   4x the size of UTF-8 for majority of languages!   Only a...
UTF-16!   “65,536 code points should be enough for everyone…”!   2 bytes to represent all of BMP (U+0 to U+FFFF)    ‣   di...
UTF-16, BUT…!   Requires surrogate pairs for code points > U+FFFF    ‣   still variable-length!   2x the size of UTF-8 for...
CHOICE OF UTF-16!   Thought that CJK languages would benefit from UTF-16!   Primary driver was the ICU APIs!   Problems: no...
2. CRUCIAL CODE LAGGED
2. CRUCIAL CODE LAGGED!   PDO (and native DB extensions)!   filter!   proper substring search (collation-based)!   some ?§5...
3. LACK OF MINDSHARE
3. LACK OF MINDSHARE!   Probably <10 people who understood the intricacies of    the Unicode and ICU!   In the end, implem...
4. DELAYED NEW FEATURES
5. MEA CULPA
END GAME
RE-ORG!   PHP 6 trunk was moved to a branch!   PHP 5.4 became the trunk!   Kick-started development of new features!   Som...
PEOPLE MATTER!   The project ran out of steam    ‣   PHP development culture means that people work on        what they’re...
INTERNALS“Because it’s nearly impossible toparticipate on Internals if yourpoo-throwing arm isn’t strong.”                ...
PERSISTENCE“Those with talent, competence, energy,and good ideas over a period of timetend to be the main drivers behind P...
PERSISTENCE“Those with talent, competence, energy,and good ideas over a period of time        and who outlast the resttend...
PROGRESS!   No development on the Unicode branch!   No visible effort to develop alternatives
FUTURE?!   Lighter, gentler implementations?    ‣   mbstring is clunky    ‣   separate Unicode String class would also be ...
STEPPING AWAY!   Invalidation of several man-years of hard work is    discouraging!   Did not feel like pushing the projec...
LESSONS LEARNED!   Rewriting large existing code base is hard!   Making people do tedious stuff is hard    ‣   make it inte...
FINITA LA COMEDIAhttp://joind.in/3349 ! http://zazzle.com/andreiz
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
Upcoming SlideShare
Loading in...5
×

The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6

48,313

Published on

n the halcyon days of early 2005, a project was launched to bring long overdue native Unicode and internationalization support to PHP. It was deemed so far reaching and important that PHP needed to have a version bump. After more than 4 years of development, the project (and PHP 6 for now) was shelved. This talk will introduce Unicode and i18n concepts, explain why Web needs Unicode, why PHP needs Unicode, how we tried to solve it (with examples), and what eventually happened. No sordid details will be left uncovered.

Published in: Technology
6 Comments
42 Likes
Statistics
Notes
No Downloads
Views
Total Views
48,313
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
0
Comments
6
Likes
42
Embeds 0
No embeds

No notes for slide

The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6

  1. 1. THE GOOD, THE BAD, AND THE UGLYWhat Happened to Unicode and PHP 6 Andrei Zmievski ! PHP Community Conference
  2. 2. ABOUT 1 YEAR AGO…“Hello PHP 5.4, open for all new stuff.” — Jani
  3. 3. TIME OF DEATHMarch!11,!11:09:37!2010 GMT
  4. 4. 5 YEARS EARLIER… PHP 5.0.0 released in July 2004
  5. 5. 5 YEARS EARLIER… Firefox 1.0released in November 2004
  6. 6. 5 YEARS EARLIER… Chromenot even a twinkle in Google’s eye
  7. 7. 5 YEARS EARLIER… Unicode version 4.0.1
  8. 8. WHAT IS UNICODE? and why do I need it?
  9. 9. Unicode …is a computing industry standard for the consistent encoding, representation andhandling of text expressed in most of the worlds writing systems.
  10. 10. Unicode provides a unique number for every character:no matter what the platform,no matter what the program,no matter what the language.
  11. 11. UNICODE STANDARD! Developed by the Unicode Consortium! Covers all major living scripts! Version 6.0 has 109,000+ characters! Capacity for 1 million+ characters! Widely supported by standards & industry
  12. 12. FEATURES! Rich property set for every character! Standard, unified encodings: UTF-8/16/32! Extensive rules and documents for implementation! Everything works, as long as everyone follows the rules
  13. 13. UNICODE != I18N! Unicode simplifies development! Unicode does not fix all internationalization problems
  14. 14. TIME FORMATS! USA: !"##$%&&! France: ()&##! Japan: ()##$! Don’t forget to identify the time zone
  15. 15. CURRENCY! Symbol placement *+$,(-&.!! Symbol length (1-15) (-&.!/0)1$2! Number width (-,.!2! Number precision: 3(-. ‣ Spain, Japan –0 ‣ Mexico, Brazil – 2 ‣ Egypt, Iraq –3
  16. 16. SORTING! Swedish: z<ö! German: ö<z! Dictionary: öf < of! Phonebook: of < öf! Upper-first: A<a! Lower-First: a<A! Contractions: H < Z, but CH > CZ! Expansions: OE < Œ < OF
  17. 17. CLDR! Hosted by Unicode Consortium! Latest release: December 2010 (CLDR 1.9)! 516 locales, with 187 languages and 166 territories
  18. 18. WHY WEB NEEDS UNICODE
  19. 19. MOJIBAKE
  20. 20. MOJIBAKEnoun: phenomenon of incorrect, unreadablecharacters shown when computer softwarefails to render a text correctly according toits associated character encoding.
  21. 21. MOJIBAKE
  22. 22. MOJIBAKE
  23. 23. I UNICODE,YOU UNICODE
  24. 24. I | UNICODE,YOU | UNICODE
  25. 25. Helgi
  26. 26. Helgi
  27. 27. Helgi ÞormarÞorbjörnsson
  28. 28. ISLTHORP orMr. SECURITY OVERRIDE
  29. 29. Joel
  30. 30. Joël
  31. 31. Joël
  32. 32. Joël
  33. 33. Joël
  34. 34. WEAKEST LINKPRINCIPLE APPLIES
  35. 35. WHY PHP NEEDS UNICODE
  36. 36. PHP! Essential Web platform! Since Web needs Unicode…! …so does PHP! Do not want to be the weakest link
  37. 37. THE PROJECT
  38. 38. THE PROJECT! Launched in February 2005 by me at Yahoo! Small group from Yahoo, Zend, and PHP development community! Design before code
  39. 39. UNICODE SUPPORT! Everywhere: ‣ in the engine ‣ in the extensions ‣ in the API
  40. 40. UNICODE SUPPORT! Native and complete ‣ no hacks ‣ no mishmash of external libraries ‣ no missing locales ‣ no language bias
  41. 41. ICU LIBRARY International Components for Unicode✓ Unicode Character Properties ✓ Formatting: Date/Time/✓ Unicode String Class & text Numbers/Currency processing ✓ Cultural Calendars & Time✓ Text transformations Zones (normalization, upper/lowercase, ✓ (230+) Locale handling etc) ✓ Resource Bundles✓ Text Boundary Analysis ✓ Transliterations (50+ script (Character/Word/Sentence Break pairs) Iterators) ✓ Complex Text Layout for Arabic,✓ Encoding Conversions for 500+ Hebrew, Indic & Thai legacy encodings ✓ International Domain Names✓ Language-sensitive collation and Web addresses (sorting) and searching ✓ Java model for locale-✓ Unicode regular expressions hierarchical resource bundles.✓ Thread-safe Multiple locales can be used at a time
  42. 42. THE PROJECT! Development was in a separate repository! Merged into PHP tree once the basics were working! Initially slated for 5.x! Extensive changes necessitated a major version bump
  43. 43. PHP 6 = PHP 5 + Unicode
  44. 44. PHP 5 = PHP 6 - Unicode
  45. 45. Unicode = PHP 6 - PHP 5
  46. 46. PHP 6
  47. 47. PHP 6 6
  48. 48. STRING TYPES! Unicode ‣ text ‣ default for literals, etc! Binary ‣ bytes ‣ everything ∉ Unicode type
  49. 49. Conversions Dataflow streams s c ng ifi di ec PHP co -sp en am Unicode re st strings runtime encodingrequest response HTTP input binary HTTP output encoding strings encoding ng fil co i es d od en ys in nc te g te m rip sc scripts filesystem
  50. 50. STRINGS! String literals are Unicode! String offsets work on code points ,456$7$8 89: : ;;$-$<=>?$@=AB54 ?<C=$,456D(E9: : ;;$6?4FG5$A4$ $ ,456D#E$7$H H9: ;;$IFGG$456ABJ$A4$B=K$
  51. 51. IDENTIFIERS! Unicode identifiers are allowed <GL44$ $M$ $$$IFB<5A=B$!"#$$%&()NO: : M$&&&$P $$$IFB<5A=B$!"#$$%&()::NO$$M$&&&$P $$$IFB<5A=B$!"#$%&NO: : : $$$$M$&&&$P P$ , $7$L66LQNO9 , DH‫ַרעְיולּוחַ$ׁשָנָה‬HE$7$B?K$ 9
  52. 52. FUNCTIONS! Functions understand Unicode text and apply appropriate rules! i.e. case manipulation ,456$7$4565=F@@?6N8IFRSLGG8O9: ;;$6?4FG5$A4$!"##$%&& ,456$7$4565=G=K?6N8TUVVWT8O9: ;;$6?4FG5$A4$())*+$
  53. 53. TRANSLITERATION,BLX?4$7$8$$$!0$"#$$$!0$$%$$$ 0$ $$$ 0$ $$$YZ[]^_`0$abc]bd$$$eZfg[_`0$hij[_k$$$lmnopqrstuvtw0$xornyvtw$$$xotz{|}ptu0$Uv~Ä$89$,6$7$4565=5A5G?N456Å56LB4GA5?6L5?N,BLX?40$8ÇBQ80$8ÉL5AB8OO9ÑAX0$ÑFJ4LXÑAX0$Q?=BJCQAÖLÜ?>L0$L4LQFÜAá=CL6L0$LBLSFÑ=6SLà?â0$ACLAGä=ãQ6?â0$ÇB>6?åäL@C?5ãç@=FG=40$ÖC?ç@CAG=4ÖC?=>é6è5=F0$êGëBí
  54. 54. PECL/INTL
  55. 55. FEATURES! Locales! Collation! Number and Currency Formatters! Date and Time Formatters! Time Zones! Calendars! Message Formatter! Choice Formatter! Resource Handler! Normalization
  56. 56. COLLATIONsorting,456ABJ4$7$L66LQN$$$$$$$$$8<=5?80$8<ì5?80$8îì5?80$8<=5ë80$$$$$$$$$8î=5ë80$8<ì5ë80$8îì5ë80$8<=5?68O9$,<=GG$7$B?K$î=GGL5=6N8I6Åïñ8O9$,<=GGóò4=65N,456ABJ4O9result<=5?<ì5?îì5?<=5ëî=5ë<ì5ëîì5ë<=5?6
  57. 57. NUMBER FORMATTING (-.!/)&1ùû$AB$?BÅ*+! ôFXS?6ï=6XL55?6""öêîõÇÉ 123456.789! ôFXS?6ï=6XL55?6""î*ññêôîú $123,456.79! ôFXS?6ï=6XL55?6""áñöõôÇÉ 123,457th! ôFXS?6ï=6XL55?6""+%êÉÉá*Ö one hundred and twenty-three thousand, four hundred and fifty-six point seven eight nine
  58. 58. MESSAGE FORMATTINGwith modifiers,@L55?6B$7$üáB$M#0>L5?0IFGGP$Q=F$6?<?Aâ?>$$$$$$$$$$$$M(0BFXS?60†0††#&##P$?XLAG4&°9,L6J4$7$L66LQN5AX?NO0$((ù!O9$,IX5$7$B?K$?44LJ?ï=6XL55?6N¢?BÅ*+£0$,@L55?6BO9?<C=$,IX5óòI=6XL5N,L6J4O9resultáB$ÖF?4>LQ0$ô=â?XS?6$--0$-##1$Q=F$6?<?Aâ?>$(0(ù!&##$?XLAG4&
  59. 59. POSTMORTEM
  60. 60. WHAT WENT RIGHT
  61. 61. 1. RAISED AWARENESS! Spoke at multiple conferences about the project ‣ including Unicode Conference! Shoved Unicode down people’s throats at every opportunity
  62. 62. 2. CHOSE THE RIGHT TECH! ICU library had everything we needed! Low- and high-level functionality! Good support from its developers
  63. 63. 3. UNIT TESTS! Every function handling strings had to be ported! Unit tests showed us where things broke! Also easy to track progress
  64. 64. 4. PECL/INTL EXTENSION! A lot of i18n/l10n functionality in a self-contained extension! Ensuring that it worked with PHP 5
  65. 65. 5. CODE SEGREGATION! Proof-of-concept developed by only a few people! Faster decisions, iteration, development! Things slowed down after merging into the main tree ‣ but was necessary to spread the workload
  66. 66. WHAT WENT RONG
  67. 67. 1. CHOICE OF UTF-16Thought to be the best compromise
  68. 68. UTF-8! Backward-compatible with ASCII! Avoids complications of endianness! Dominant UTF encoding for the Web! Supported in a lot of libraries, APIs, etc
  69. 69. UTF-8, BUT…! Variable-length encoding (1-4 bytes)! Uses 3 bytes for BMP code points > U+07FF! Not all byte sequences are valid! ICU did not have many UTF-8 APIs (at the time) ‣ on-the-fly conversion is necessary
  70. 70. UTF-32! Uses exactly 4 bytes for each code point ‣ directly indexable!
  71. 71. UTF-32, BUT...! Uses exactly 4 bytes for each code point ‣ 4x the size of UTF-8 for majority of languages! Only affordable by people from rich oil countries! Still needs conversion to UTF-16 when using ICU! Endianness
  72. 72. UTF-16! “65,536 code points should be enough for everyone…”! 2 bytes to represent all of BMP (U+0 to U+FFFF) ‣ directly indexable in that plane! Internal encoding of ICU
  73. 73. UTF-16, BUT…! Requires surrogate pairs for code points > U+FFFF ‣ still variable-length! 2x the size of UTF-8 for Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic and other scripts! Can’t be manipulated by normal C string handling! Endianness
  74. 74. CHOICE OF UTF-16! Thought that CJK languages would benefit from UTF-16! Primary driver was the ICU APIs! Problems: no direct indexing, many conversions! Would probably choose UTF-8, if started over ‣ no need for decoding/encoding on the periphery ‣ can be used by C-based libraries
  75. 75. 2. CRUCIAL CODE LAGGED
  76. 76. 2. CRUCIAL CODE LAGGED! PDO (and native DB extensions)! filter! proper substring search (collation-based)! some ?§5;45LB>L6> functionality
  77. 77. 3. LACK OF MINDSHARE
  78. 78. 3. LACK OF MINDSHARE! Probably <10 people who understood the intricacies of the Unicode and ICU! In the end, implementation deemed too technically difficult! People were bored converting large chunks of already working code
  79. 79. 4. DELAYED NEW FEATURES
  80. 80. 5. MEA CULPA
  81. 81. END GAME
  82. 82. RE-ORG! PHP 6 trunk was moved to a branch! PHP 5.4 became the trunk! Kick-started development of new features! Some clean-ups and improvements from 6 back-ported to 5.4
  83. 83. PEOPLE MATTER! The project ran out of steam ‣ PHP development culture means that people work on what they’re interested in ‣ Clearly, the Unicode/i18n implementation wasn’t interesting enough to be viable
  84. 84. INTERNALS“Because it’s nearly impossible toparticipate on Internals if yourpoo-throwing arm isn’t strong.” — @coates
  85. 85. PERSISTENCE“Those with talent, competence, energy,and good ideas over a period of timetend to be the main drivers behind PHPdevelopment.” — me
  86. 86. PERSISTENCE“Those with talent, competence, energy,and good ideas over a period of time and who outlast the resttend to be the main drivers behind PHPdevelopment.” — me
  87. 87. PROGRESS! No development on the Unicode branch! No visible effort to develop alternatives
  88. 88. FUTURE?! Lighter, gentler implementations? ‣ mbstring is clunky ‣ separate Unicode String class would also be clunky! Open field for someone with a great idea, persistence, and people skills
  89. 89. STEPPING AWAY! Invalidation of several man-years of hard work is discouraging! Did not feel like pushing the project up the hill again! Working on more fun stuff these days
  90. 90. LESSONS LEARNED! Rewriting large existing code base is hard! Making people do tedious stuff is hard ‣ make it interesting for them (game-like)! Waiting for results of long iterations is hard ‣ short, results-oriented projects (if possible)! Stay committed
  91. 91. FINITA LA COMEDIAhttp://joind.in/3349 ! http://zazzle.com/andreiz

×