The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6

Andrei Zmievski
Andrei ZmievskiDeveloper at Analog Coop
THE GOOD, THE BAD,
     AND THE UGLY




What Happened to Unicode and PHP 6


  Andrei Zmievski ❖ PHP Community Conference
ABOUT 1 YEAR AGO…

“Hello PHP 5.4, open for all new stuff.” — Jani
TIME OF DEATH

March 11, 11:09:37 2010 GMT
5 YEARS EARLIER…
       PHP 5.0.0
 released in July 2004
5 YEARS EARLIER…
         Firefox 1.0
released in November 2004
5 YEARS EARLIER…
             Chrome
not even a twinkle in Google’s eye
5 YEARS EARLIER…
      Unicode
    version 4.0.1
WHAT IS UNICODE?
  and why do I need it?
Unicode
    …is a computing industry
   standard for the consistent
  encoding, representation and
handling of text expressed in most
 of the world's writing systems.
Unicode

 provides a unique number
    for every character:

no matter what the platform,
no matter what the program,
no matter what the language.
UNICODE STANDARD

❖   Developed by the Unicode Consortium
❖   Covers all major living scripts
❖   Version 6.0 has 109,000+ characters
❖   Capacity for 1 million+ characters
❖   Widely supported by standards & industry
FEATURES


❖   Rich property set for every character

❖   Standard, unified encodings: UTF-8/16/32

❖   Extensive rules and documents for implementation

❖   Everything works, as long as everyone follows the rules
UNICODE != I18N



❖   Unicode simplifies development

❖   Unicode does not fix all internationalization problems
TIME FORMATS
❖   USA:    4:00  P.M.

❖   France: 16.00

❖   Japan: 1600  

❖   Don’t forget to identify the time zone
CURRENCY

❖   Symbol placement
                                     US  $12.34
❖   Symbol length (1-15)            12.345,67  €
❖   Number width                       12$34€
❖   Number precision:                   ¥123
    ‣   Spain, Japan   –0
    ‣   Mexico, Brazil – 2
    ‣   Egypt, Iraq    –3
SORTING
❖   Swedish:        z<ö
❖   German:         ö<z
❖   Dictionary:     öf < of
❖   Phonebook:      of < öf
❖   Upper-first:     A<a
❖   Lower-First:    a<A
❖   Contractions:   H < Z, but CH > CZ
❖   Expansions:     OE < Œ < OF
CLDR


❖   Hosted by Unicode Consortium

❖   Latest release: December 2010 (CLDR 1.9)

❖   516 locales, with 187 languages and 166 territories
WHY WEB NEEDS UNICODE
MOJIBAKE
MOJIBAKE

noun: phenomenon of incorrect, unreadable
characters shown when computer software
fails to render a text correctly according to
its associated character encoding.
MOJIBAKE
MOJIBAKE
I UNICODE,
YOU UNICODE
I | UNICODE,
YOU | UNICODE
Helgi
Helgi
Helgi
  Þormar
Þorbjörnsson
ISLTHORP
         or

Mr. SECURITY OVERRIDE
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
Joel
Joël
Joël
Joël
Joël
WEAKEST LINK
PRINCIPLE APPLIES
WHY PHP NEEDS UNICODE
PHP


❖   Essential Web platform

❖   Since Web needs Unicode…

❖   …so does PHP

❖   Do not want to be the weakest link
THE PROJECT
THE PROJECT


❖   Launched in February 2005 by me at Yahoo

❖   Small group from Yahoo, Zend, and PHP development
    community

❖   Design before code
UNICODE SUPPORT


❖   Everywhere:

    ‣   in the engine

    ‣   in the extensions

    ‣   in the API
UNICODE SUPPORT

❖   Native and complete

    ‣   no hacks

    ‣   no mishmash of external libraries

    ‣   no missing locales

    ‣   no language bias
ICU LIBRARY
       International Components for Unicode
✓   Unicode Character Properties       ✓ Formatting: Date/Time/
✓   Unicode String Class & text          Numbers/Currency
    processing                         ✓ Cultural Calendars & Time

✓   Text transformations                 Zones
    (normalization, upper/lowercase,   ✓ (230+) Locale handling
    etc)                               ✓ Resource Bundles
✓   Text Boundary Analysis             ✓ Transliterations (50+ script
    (Character/Word/Sentence Break       pairs)
    Iterators)
                                       ✓ Complex Text Layout for Arabic,
✓   Encoding Conversions for 500+        Hebrew, Indic & Thai
    legacy encodings
                                       ✓ International Domain Names
✓   Language-sensitive collation         and Web addresses
    (sorting) and searching
                                       ✓ Java model for locale-
✓   Unicode regular expressions          hierarchical resource bundles.
✓   Thread-safe                          Multiple locales can be used at a
                                         time
THE PROJECT


❖   Development was in a separate repository

❖   Merged into PHP tree once the basics were working

❖   Initially slated for 5.x

❖   Extensive changes necessitated a major version bump
PHP 6 = PHP 5 + Unicode
PHP 5 = PHP 6 - Unicode
Unicode = PHP 6 - PHP 5
PHP 6
PHP 6


   6
STRING TYPES

❖   Unicode

    ‣   text

    ‣   default for literals, etc

❖   Binary

    ‣   bytes

    ‣   everything ∉ Unicode type
Conversions
  Dataflow                                                  streams




                                                        s c
                                                      ng ifi
                                                    di ec
                                PHP




                                                  co -sp
                                              en am
                                Unicode




                                                 re
                                              st
                                strings
                          runtime encoding
request                                                       response
          HTTP input            binary       HTTP output
           encoding             strings       encoding
                          ng




                                          fil co
                            i




                                            es d
                         od




                                             en

                                              ys in
                       nc




                                                te g
                    te




                                                  m
                rip
               sc




             scripts                         filesystem
STRINGS
❖   String literals are Unicode

❖   String offsets work on code points

        $str  =  "   ";
 
   //  2  code  points
        echo  $str[1];
 
    //  result  is     
        $str[0]  =  ' ';
 //  full  string  is  now  
IDENTIFIERS
❖   Unicode identifiers are allowed

      class                       {  
            function  ᓱᓴᓐ  ᐊᒡᓗᒃᑲᖅ()
 
 {  ...  }
            function  !வா$  கேனச)

()    {  ...  }
            function  འ"ག་%ལ།()
 
 
         {  ...  }
      }  

      $              =  array();
      $            ['‫  =  ]'ַרעְיולּוחַ  ׁשָנָה‬new       ;
FUNCTIONS
❖ Functions understand Unicode text and apply
  appropriate rules
❖ i.e. case manipulation



    $str  =  strtoupper("fußball");
 //  result  is  FUSSBALL

    $str  =  strtolower("ΣΕΛΛΑΣ");
   //  result  is  σελλάς  
TRANSLITERATION
$names  =  "  
    김,  국삼  
    김,  명희  
         ,         
           ,          
    Горбачев,  Михаил  
    Козырев,  Андрей  
    Καφετζόπουλος,  Θεόφιλος  
    Θεοδωράτου,  Ελένη  
";  
$r  =  strtotitle(str_transliterate($names,  "Any",  "Latin"));

Gim,  Gugsam
Gim,  Myeonghyi
Takeda,  Masayuki
Oohara,  Manabu
Gorbačev,  Mihail
Kozyrev,  Andrej
Kaphetzópoulos,  Theóphilos
Theodōrátou,  Elénē
PECL/INTL
FEATURES
❖   Locales
❖   Collation
❖   Number and Currency Formatters
❖   Date and Time Formatters
❖   Time Zones
❖   Calendars
❖   Message Formatter
❖   Choice Formatter
❖   Resource Handler
❖   Normalization
COLLATION
sorting
$strings  =  array(  
                "cote",  "côte",  "Côte",  "coté",  
                "Coté",  "côté",  "Côté",  "coter");  
$coll  =  new  Collator("fr_FR");  
$coll-­‐>sort($strings);

result
cote
côte
Côte
coté
Coté
côté
Côté
coter
NUMBER FORMATTING
                123456.789  in  en_US
❖   NumberFormatter::DECIMAL
    123456.789

❖   NumberFormatter::CURRENCY
    $123,456.79

❖   NumberFormatter::ORDINAL
    123,457th

❖   NumberFormatter::SPELLOUT
    one hundred and twenty-three thousand, four hundred
    and fifty-six point seven eight nine
MESSAGE FORMATTING

with modifiers
$pattern  =  “On  {0,date,full}  you  received
                        {1,number,#,##0.00}  emails.”;
$args  =  array(time(),  1184);  
$fmt  =  new  MessageFormatter(‘en_US’,  $pattern);
echo  $fmt-­‐>format($args);

result
On  Tuesday,  November  22,  2007  you  received  
1,184.00  emails.
POSTMORTEM
WHAT WENT RIGHT
1. RAISED AWARENESS


❖   Spoke at multiple conferences about the project

    ‣   including Unicode Conference

❖   Shoved Unicode down people’s throats at every
    opportunity
2. CHOSE THE RIGHT TECH


❖   ICU library had everything we needed

❖   Low- and high-level functionality

❖   Good support from its developers
3. UNIT TESTS


❖   Every function handling strings had to be ported

❖   Unit tests showed us where things broke

❖   Also easy to track progress
4. PECL/INTL EXTENSION


❖   A lot of i18n/l10n functionality in a self-contained
    extension

❖   Ensuring that it worked with PHP 5
5. CODE SEGREGATION


❖   Proof-of-concept developed by only a few people

❖   Faster decisions, iteration, development

❖   Things slowed down after merging into the main tree

    ‣   but was necessary to spread the workload
WHAT WENT RONG
1. CHOICE OF UTF-16
Thought to be the best compromise
UTF-8


❖   Backward-compatible with ASCII

❖   Avoids complications of endianness

❖   Dominant UTF encoding for the Web

❖   Supported in a lot of libraries, APIs, etc
UTF-8, BUT…

❖   Variable-length encoding (1-4 bytes)

❖   Uses 3 bytes for BMP code points > U+07FF

❖   Not all byte sequences are valid

❖   ICU did not have many UTF-8 APIs (at the time)

    ‣   on-the-fly conversion is necessary
UTF-32



❖   Uses exactly 4 bytes for each code point

    ‣   directly indexable!
UTF-32, BUT...

❖   Uses exactly 4 bytes for each code point

    ‣   4x the size of UTF-8 for majority of languages

❖   Only affordable by people from rich oil countries

❖   Still needs conversion to UTF-16 when using ICU

❖   Endianness
UTF-16


❖   “65,536 code points should be enough for everyone…”

❖   2 bytes to represent all of BMP (U+0 to U+FFFF)

    ‣   directly indexable in that plane

❖   Internal encoding of ICU
UTF-16, BUT…

❖   Requires surrogate pairs for code points > U+FFFF

    ‣   still variable-length

❖   2x the size of UTF-8 for Latin, Greek, Cyrillic, Armenian,
    Hebrew, Arabic and other scripts

❖   Can’t be manipulated by normal C string handling

❖   Endianness
CHOICE OF UTF-16

❖   Thought that CJK languages would benefit from UTF-16

❖   Primary driver was the ICU APIs

❖   Problems: no direct indexing, many conversions

❖   Would probably choose UTF-8, if started over

    ‣   no need for decoding/encoding on the periphery

    ‣   can be used by C-based libraries
2. CRUCIAL CODE LAGGED
2. CRUCIAL CODE LAGGED


❖   PDO (and native DB extensions)

❖   filter

❖   proper substring search (collation-based)

❖   some ext/standard functionality
3. LACK OF MINDSHARE
3. LACK OF MINDSHARE

❖   Probably <10 people who understood the intricacies of
    the Unicode and ICU

❖   In the end, implementation deemed too technically
    difficult

❖   People were bored converting large chunks of already
    working code
4. DELAYED NEW FEATURES
5. MEA CULPA
END GAME
RE-ORG

❖   PHP 6 trunk was moved to a branch

❖   PHP 5.4 became the trunk

❖   Kick-started development of new features

❖   Some clean-ups and improvements from 6 back-ported
    to 5.4
PEOPLE MATTER


❖   The project ran out of steam

    ‣   PHP development culture means that people work on
        what they’re interested in

    ‣   Clearly, the Unicode/i18n implementation wasn’t
        interesting enough to be viable
INTERNALS

“Because it’s nearly impossible to
participate on Internals if your
poo-throwing arm isn’t strong.”
                       — @coates
PERSISTENCE

“Those with talent, competence, energy,
and good ideas over a period of time
tend to be the main drivers behind PHP
development.”
                                   — me
PERSISTENCE

“Those with talent, competence, energy,
and good ideas over a period of time
        and who outlast the rest
tend to be the main drivers behind PHP
development.”
                                  — me
PROGRESS



❖   No development on the Unicode branch

❖   No visible effort to develop alternatives
FUTURE?

❖   Lighter, gentler implementations?

    ‣   mbstring is clunky

    ‣   separate Unicode String class would also be clunky

❖   Open field for someone with a great idea, persistence,
    and people skills
STEPPING AWAY


❖   Invalidation of several man-years of hard work is
    discouraging

❖   Did not feel like pushing the project up the hill again

❖   Working on more fun stuff these days
LESSONS LEARNED

❖   Rewriting large existing code base is hard

❖   Making people do tedious stuff is hard

    ‣   make it interesting for them (game-like)

❖   Waiting for results of long iterations is hard

    ‣   short, results-oriented projects (if possible)

❖   Stay committed
FINITA LA COMEDIA




http://joind.in/3349 ❖ http://zazzle.com/andreiz
1 of 92

Recommended

What is dotnet (.NET) ? by
What is dotnet (.NET) ?What is dotnet (.NET) ?
What is dotnet (.NET) ?Talha Shahzad
480 views25 slides
Developing Hybrid Applications with IONIC by
Developing Hybrid Applications with IONICDeveloping Hybrid Applications with IONIC
Developing Hybrid Applications with IONICFuat Buğra AYDIN
920 views57 slides
[2019.04] 쿠버네티스 기반 하이퍼레저 패브릭 네트워크 구축하기 by
[2019.04] 쿠버네티스 기반 하이퍼레저 패브릭 네트워크 구축하기[2019.04] 쿠버네티스 기반 하이퍼레저 패브릭 네트워크 구축하기
[2019.04] 쿠버네티스 기반 하이퍼레저 패브릭 네트워크 구축하기Hyperledger Korea User Group
1.4K views22 slides
Developing rich SIP applications with SIPSIMPLE SDK by
Developing rich SIP applications with SIPSIMPLE SDKDeveloping rich SIP applications with SIPSIMPLE SDK
Developing rich SIP applications with SIPSIMPLE SDKSaúl Ibarra Corretgé
2.8K views24 slides
Docker: From Zero to Hero by
Docker: From Zero to HeroDocker: From Zero to Hero
Docker: From Zero to Herofazalraja
1.9K views21 slides
.Net Core by
.Net Core.Net Core
.Net CoreBertrand Le Roy
10.1K views21 slides

More Related Content

What's hot

Odoo - From v7 to v8: the new api by
Odoo - From v7 to v8: the new apiOdoo - From v7 to v8: the new api
Odoo - From v7 to v8: the new apiOdoo
61.8K views30 slides
PHP 와 MySQL을 이용한 게임 랭킹 구축하기 by
PHP 와 MySQL을 이용한 게임 랭킹 구축하기PHP 와 MySQL을 이용한 게임 랭킹 구축하기
PHP 와 MySQL을 이용한 게임 랭킹 구축하기Yo-Chang Song
12.2K views31 slides
Node.js Express by
Node.js  ExpressNode.js  Express
Node.js ExpressEyal Vardi
4.8K views26 slides
How to Import data into OpenERP V7 by
How to Import data into OpenERP V7How to Import data into OpenERP V7
How to Import data into OpenERP V7Audaxis
38K views14 slides
Introduction to .NET Core by
Introduction to .NET CoreIntroduction to .NET Core
Introduction to .NET CoreMarco Parenzan
3.6K views65 slides
Nodejs presentation by
Nodejs presentationNodejs presentation
Nodejs presentationArvind Devaraj
8.8K views17 slides

What's hot(20)

Odoo - From v7 to v8: the new api by Odoo
Odoo - From v7 to v8: the new apiOdoo - From v7 to v8: the new api
Odoo - From v7 to v8: the new api
Odoo61.8K views
PHP 와 MySQL을 이용한 게임 랭킹 구축하기 by Yo-Chang Song
PHP 와 MySQL을 이용한 게임 랭킹 구축하기PHP 와 MySQL을 이용한 게임 랭킹 구축하기
PHP 와 MySQL을 이용한 게임 랭킹 구축하기
Yo-Chang Song12.2K views
Node.js Express by Eyal Vardi
Node.js  ExpressNode.js  Express
Node.js Express
Eyal Vardi4.8K views
How to Import data into OpenERP V7 by Audaxis
How to Import data into OpenERP V7How to Import data into OpenERP V7
How to Import data into OpenERP V7
Audaxis38K views
Introduction to .NET Core by Marco Parenzan
Introduction to .NET CoreIntroduction to .NET Core
Introduction to .NET Core
Marco Parenzan3.6K views
07 윈도우 핸들 by jaypi Ko
07 윈도우 핸들07 윈도우 핸들
07 윈도우 핸들
jaypi Ko682 views
Why and How to Use Virtual DOM by Daiwei Lu
Why and How to Use Virtual DOMWhy and How to Use Virtual DOM
Why and How to Use Virtual DOM
Daiwei Lu1.3K views
Introducing ASP.NET Core 2.0 by Steven Smith
Introducing ASP.NET Core 2.0Introducing ASP.NET Core 2.0
Introducing ASP.NET Core 2.0
Steven Smith4.9K views
An Introduction to WebAssembly by Daniel Budden
An Introduction to WebAssemblyAn Introduction to WebAssembly
An Introduction to WebAssembly
Daniel Budden3.2K views
Asp.Net Core MVC , Razor page , Entity Framework Core by mohamed elshafey
Asp.Net Core MVC , Razor page , Entity Framework CoreAsp.Net Core MVC , Razor page , Entity Framework Core
Asp.Net Core MVC , Razor page , Entity Framework Core
mohamed elshafey372 views
WebRTC + Socket.io: building a skype-like video chat with native javascript by Michele Di Salvatore
WebRTC + Socket.io: building a skype-like video chat with native javascriptWebRTC + Socket.io: building a skype-like video chat with native javascript
WebRTC + Socket.io: building a skype-like video chat with native javascript
Michele Di Salvatore26.7K views

Viewers also liked

Php Unicode I18n by
Php Unicode I18nPhp Unicode I18n
Php Unicode I18nEnrique Place
4K views81 slides
PHP 7 – What changed internally? by
PHP 7 – What changed internally?PHP 7 – What changed internally?
PHP 7 – What changed internally?Nikita Popov
13.2K views122 slides
W3 conf hill-html5-security-realities by
W3 conf hill-html5-security-realitiesW3 conf hill-html5-security-realities
W3 conf hill-html5-security-realitiesBrad Hill
10.4K views60 slides
Php Presentation by
Php PresentationPhp Presentation
Php PresentationManish Bothra
95.8K views51 slides
HTML 5 Accessibility by
HTML 5 AccessibilityHTML 5 Accessibility
HTML 5 AccessibilitySteven Faulkner
10.4K views15 slides
PHP 5.3 And PHP 6 A Look Ahead by
PHP 5.3 And PHP 6 A Look AheadPHP 5.3 And PHP 6 A Look Ahead
PHP 5.3 And PHP 6 A Look Aheadthinkphp
4.1K views53 slides

Viewers also liked(20)

PHP 7 – What changed internally? by Nikita Popov
PHP 7 – What changed internally?PHP 7 – What changed internally?
PHP 7 – What changed internally?
Nikita Popov13.2K views
W3 conf hill-html5-security-realities by Brad Hill
W3 conf hill-html5-security-realitiesW3 conf hill-html5-security-realities
W3 conf hill-html5-security-realities
Brad Hill10.4K views
PHP 5.3 And PHP 6 A Look Ahead by thinkphp
PHP 5.3 And PHP 6 A Look AheadPHP 5.3 And PHP 6 A Look Ahead
PHP 5.3 And PHP 6 A Look Ahead
thinkphp4.1K views
Introduction to php 6 by pctechnology
Introduction to php   6Introduction to php   6
Introduction to php 6
pctechnology611 views
Diapositiva reclutamiento y seleccion leymar jimenez by Leymar jimenez
Diapositiva reclutamiento y seleccion leymar jimenezDiapositiva reclutamiento y seleccion leymar jimenez
Diapositiva reclutamiento y seleccion leymar jimenez
Leymar jimenez285 views
Consórcio realiza consta da relação das administradoras de consórcios autoriz... by Jessica R.
Consórcio realiza consta da relação das administradoras de consórcios autoriz...Consórcio realiza consta da relação das administradoras de consórcios autoriz...
Consórcio realiza consta da relação das administradoras de consórcios autoriz...
Jessica R.5.4K views
Prophet's Sunnah, tagalog by Arab Muslim
Prophet's Sunnah, tagalogProphet's Sunnah, tagalog
Prophet's Sunnah, tagalog
Arab Muslim1.8K views
Perfil sociodemográfico de los internautas 2013 - ONTSI by Aritz Pérez
Perfil sociodemográfico de los internautas 2013 - ONTSIPerfil sociodemográfico de los internautas 2013 - ONTSI
Perfil sociodemográfico de los internautas 2013 - ONTSI
Aritz Pérez521 views
Einführung ins eCampaigning by more onion
Einführung ins eCampaigning Einführung ins eCampaigning
Einführung ins eCampaigning
more onion526 views
HTML5 on Mobile by Adam Lu
HTML5 on MobileHTML5 on Mobile
HTML5 on Mobile
Adam Lu8.4K views
Vloge (funkcije) umetnosti v družbi (2) 2.del by sKastelic
Vloge (funkcije) umetnosti v družbi (2) 2.delVloge (funkcije) umetnosti v družbi (2) 2.del
Vloge (funkcije) umetnosti v družbi (2) 2.del
sKastelic1.3K views
First Beat Media - Rad od kuće #tnt3 by SICEF
First Beat Media - Rad od kuće #tnt3First Beat Media - Rad od kuće #tnt3
First Beat Media - Rad od kuće #tnt3
SICEF739 views

Similar to The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6

Unicode & PHP6 by
Unicode & PHP6Unicode & PHP6
Unicode & PHP6Karsten Dambekalns
4.1K views60 slides
Using unicode with php by
Using unicode with phpUsing unicode with php
Using unicode with phpElizabeth Smith
8K views51 slides
Using unicode with php by
Using unicode with phpUsing unicode with php
Using unicode with phpElizabeth Smith
15.3K views50 slides
How To Build And Launch A Successful Globalized App From Day One Or All The ... by
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...agileware
638 views46 slides
Uncdtalk by
UncdtalkUncdtalk
UncdtalkBilal Maqbool ツ
266 views52 slides
Perl5 VS JSON by
Perl5 VS JSONPerl5 VS JSON
Perl5 VS JSONkarupanerura
1.3K views33 slides

Similar to The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6(20)

How To Build And Launch A Successful Globalized App From Day One Or All The ... by agileware
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...
agileware638 views
Internationlization by Tuan Ngo
InternationlizationInternationlization
Internationlization
Tuan Ngo582 views
Unicode 101 by davidfstr
Unicode 101Unicode 101
Unicode 101
davidfstr752 views
Comprehasive Exam - IT by guest6ddfb98
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - IT
guest6ddfb98603 views
PHP for Grown-ups by Manuel Lemos
PHP for Grown-upsPHP for Grown-ups
PHP for Grown-ups
Manuel Lemos1.5K views
Exploring Challenges in Mining Historical Text by Beatrice Alex
Exploring Challenges in Mining Historical Text Exploring Challenges in Mining Historical Text
Exploring Challenges in Mining Historical Text
Beatrice Alex993 views
Abap slide class4 unicode-plusfiles by Milind Patil
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfiles
Milind Patil2.5K views
COMPUTER LANGUAGES AND THERE DIFFERENCE by Pavan Kalyan
COMPUTER LANGUAGES AND THERE DIFFERENCE COMPUTER LANGUAGES AND THERE DIFFERENCE
COMPUTER LANGUAGES AND THERE DIFFERENCE
Pavan Kalyan101 views
Unicode for Small Children (and Children at Heart) by Feihong Hsu
Unicode for Small Children (and Children at Heart)Unicode for Small Children (and Children at Heart)
Unicode for Small Children (and Children at Heart)
Feihong Hsu7.2K views
Lect 1. introduction to programming languages by Varun Garg
Lect 1. introduction to programming languagesLect 1. introduction to programming languages
Lect 1. introduction to programming languages
Varun Garg235.2K views
Desert Code Camp 2014: C#, the best programming language by James Montemagno
Desert Code Camp 2014: C#, the best programming languageDesert Code Camp 2014: C#, the best programming language
Desert Code Camp 2014: C#, the best programming language
James Montemagno976 views

Recently uploaded

DALI Basics Course 2023 by
DALI Basics Course  2023DALI Basics Course  2023
DALI Basics Course 2023Ivory Egg
16 views12 slides
Kyo - Functional Scala 2023.pdf by
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfFlavio W. Brasil
298 views92 slides
Business Analyst Series 2023 - Week 3 Session 5 by
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5DianaGray10
237 views20 slides
Java Platform Approach 1.0 - Picnic Meetup by
Java Platform Approach 1.0 - Picnic MeetupJava Platform Approach 1.0 - Picnic Meetup
Java Platform Approach 1.0 - Picnic MeetupRick Ossendrijver
27 views39 slides
Info Session November 2023.pdf by
Info Session November 2023.pdfInfo Session November 2023.pdf
Info Session November 2023.pdfAleksandraKoprivica4
11 views15 slides
Web Dev - 1 PPT.pdf by
Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdfgdsczhcet
60 views45 slides

Recently uploaded(20)

DALI Basics Course 2023 by Ivory Egg
DALI Basics Course  2023DALI Basics Course  2023
DALI Basics Course 2023
Ivory Egg16 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10237 views
Web Dev - 1 PPT.pdf by gdsczhcet
Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdf
gdsczhcet60 views
handbook for web 3 adoption.pdf by Liveplex
handbook for web 3 adoption.pdfhandbook for web 3 adoption.pdf
handbook for web 3 adoption.pdf
Liveplex22 views
Lilypad @ Labweek, Istanbul, 2023.pdf by Ally339821
Lilypad @ Labweek, Istanbul, 2023.pdfLilypad @ Labweek, Istanbul, 2023.pdf
Lilypad @ Labweek, Istanbul, 2023.pdf
Ally3398219 views
1st parposal presentation.pptx by i238212
1st parposal presentation.pptx1st parposal presentation.pptx
1st parposal presentation.pptx
i2382129 views
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Perth MeetUp November 2023 by Michael Price
Perth MeetUp November 2023 Perth MeetUp November 2023
Perth MeetUp November 2023
Michael Price19 views
Special_edition_innovator_2023.pdf by WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2217 views
Five Things You SHOULD Know About Postman by Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman30 views
From chaos to control: Managing migrations and Microsoft 365 with ShareGate! by sammart93
From chaos to control: Managing migrations and Microsoft 365 with ShareGate!From chaos to control: Managing migrations and Microsoft 365 with ShareGate!
From chaos to control: Managing migrations and Microsoft 365 with ShareGate!
sammart939 views
Empathic Computing: Delivering the Potential of the Metaverse by Mark Billinghurst
Empathic Computing: Delivering  the Potential of the MetaverseEmpathic Computing: Delivering  the Potential of the Metaverse
Empathic Computing: Delivering the Potential of the Metaverse
Mark Billinghurst476 views
Unit 1_Lecture 2_Physical Design of IoT.pdf by StephenTec
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdf
StephenTec12 views

The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6

  • 1. THE GOOD, THE BAD, AND THE UGLY What Happened to Unicode and PHP 6 Andrei Zmievski ❖ PHP Community Conference
  • 2. ABOUT 1 YEAR AGO… “Hello PHP 5.4, open for all new stuff.” — Jani
  • 4. 5 YEARS EARLIER… PHP 5.0.0 released in July 2004
  • 5. 5 YEARS EARLIER… Firefox 1.0 released in November 2004
  • 6. 5 YEARS EARLIER… Chrome not even a twinkle in Google’s eye
  • 7. 5 YEARS EARLIER… Unicode version 4.0.1
  • 8. WHAT IS UNICODE? and why do I need it?
  • 9. Unicode …is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems.
  • 10. Unicode provides a unique number for every character: no matter what the platform, no matter what the program, no matter what the language.
  • 11. UNICODE STANDARD ❖ Developed by the Unicode Consortium ❖ Covers all major living scripts ❖ Version 6.0 has 109,000+ characters ❖ Capacity for 1 million+ characters ❖ Widely supported by standards & industry
  • 12. FEATURES ❖ Rich property set for every character ❖ Standard, unified encodings: UTF-8/16/32 ❖ Extensive rules and documents for implementation ❖ Everything works, as long as everyone follows the rules
  • 13. UNICODE != I18N ❖ Unicode simplifies development ❖ Unicode does not fix all internationalization problems
  • 14. TIME FORMATS ❖ USA: 4:00  P.M. ❖ France: 16.00 ❖ Japan: 1600   ❖ Don’t forget to identify the time zone
  • 15. CURRENCY ❖ Symbol placement US  $12.34 ❖ Symbol length (1-15) 12.345,67  € ❖ Number width 12$34€ ❖ Number precision: ¥123 ‣ Spain, Japan –0 ‣ Mexico, Brazil – 2 ‣ Egypt, Iraq –3
  • 16. SORTING ❖ Swedish: z<ö ❖ German: ö<z ❖ Dictionary: öf < of ❖ Phonebook: of < öf ❖ Upper-first: A<a ❖ Lower-First: a<A ❖ Contractions: H < Z, but CH > CZ ❖ Expansions: OE < Œ < OF
  • 17. CLDR ❖ Hosted by Unicode Consortium ❖ Latest release: December 2010 (CLDR 1.9) ❖ 516 locales, with 187 languages and 166 territories
  • 18. WHY WEB NEEDS UNICODE
  • 20. MOJIBAKE noun: phenomenon of incorrect, unreadable characters shown when computer software fails to render a text correctly according to its associated character encoding.
  • 24. I | UNICODE, YOU | UNICODE
  • 25. Helgi
  • 26. Helgi
  • 28. ISLTHORP or Mr. SECURITY OVERRIDE
  • 30. Joel
  • 31. Joël
  • 32. Joël
  • 36. WHY PHP NEEDS UNICODE
  • 37. PHP ❖ Essential Web platform ❖ Since Web needs Unicode… ❖ …so does PHP ❖ Do not want to be the weakest link
  • 39. THE PROJECT ❖ Launched in February 2005 by me at Yahoo ❖ Small group from Yahoo, Zend, and PHP development community ❖ Design before code
  • 40. UNICODE SUPPORT ❖ Everywhere: ‣ in the engine ‣ in the extensions ‣ in the API
  • 41. UNICODE SUPPORT ❖ Native and complete ‣ no hacks ‣ no mishmash of external libraries ‣ no missing locales ‣ no language bias
  • 42. ICU LIBRARY International Components for Unicode ✓ Unicode Character Properties ✓ Formatting: Date/Time/ ✓ Unicode String Class & text Numbers/Currency processing ✓ Cultural Calendars & Time ✓ Text transformations Zones (normalization, upper/lowercase, ✓ (230+) Locale handling etc) ✓ Resource Bundles ✓ Text Boundary Analysis ✓ Transliterations (50+ script (Character/Word/Sentence Break pairs) Iterators) ✓ Complex Text Layout for Arabic, ✓ Encoding Conversions for 500+ Hebrew, Indic & Thai legacy encodings ✓ International Domain Names ✓ Language-sensitive collation and Web addresses (sorting) and searching ✓ Java model for locale- ✓ Unicode regular expressions hierarchical resource bundles. ✓ Thread-safe Multiple locales can be used at a time
  • 43. THE PROJECT ❖ Development was in a separate repository ❖ Merged into PHP tree once the basics were working ❖ Initially slated for 5.x ❖ Extensive changes necessitated a major version bump
  • 44. PHP 6 = PHP 5 + Unicode
  • 45. PHP 5 = PHP 6 - Unicode
  • 46. Unicode = PHP 6 - PHP 5
  • 47. PHP 6
  • 48. PHP 6 6
  • 49. STRING TYPES ❖ Unicode ‣ text ‣ default for literals, etc ❖ Binary ‣ bytes ‣ everything ∉ Unicode type
  • 50. Conversions Dataflow streams s c ng ifi di ec PHP co -sp en am Unicode re st strings runtime encoding request response HTTP input binary HTTP output encoding strings encoding ng fil co i es d od en ys in nc te g te m rip sc scripts filesystem
  • 51. STRINGS ❖ String literals are Unicode ❖ String offsets work on code points $str  =  " "; //  2  code  points echo  $str[1]; //  result  is     $str[0]  =  ' '; //  full  string  is  now  
  • 52. IDENTIFIERS ❖ Unicode identifiers are allowed class    {        function  ᓱᓴᓐ  ᐊᒡᓗᒃᑲᖅ() {  ...  }      function  !வா$  கேனச) ()    {  ...  }      function  འ"ག་%ལ།()        {  ...  } }   $  =  array(); $ ['‫  =  ]'ַרעְיולּוחַ  ׁשָנָה‬new   ;
  • 53. FUNCTIONS ❖ Functions understand Unicode text and apply appropriate rules ❖ i.e. case manipulation $str  =  strtoupper("fußball"); //  result  is  FUSSBALL $str  =  strtolower("ΣΕΛΛΑΣ"); //  result  is  σελλάς  
  • 54. TRANSLITERATION $names  =  "      김,  국삼      김,  명희       ,         ,        Горбачев,  Михаил      Козырев,  Андрей      Καφετζόπουλος,  Θεόφιλος      Θεοδωράτου,  Ελένη   ";   $r  =  strtotitle(str_transliterate($names,  "Any",  "Latin")); Gim,  Gugsam Gim,  Myeonghyi Takeda,  Masayuki Oohara,  Manabu Gorbačev,  Mihail Kozyrev,  Andrej Kaphetzópoulos,  Theóphilos Theodōrátou,  Elénē
  • 56. FEATURES ❖ Locales ❖ Collation ❖ Number and Currency Formatters ❖ Date and Time Formatters ❖ Time Zones ❖ Calendars ❖ Message Formatter ❖ Choice Formatter ❖ Resource Handler ❖ Normalization
  • 57. COLLATION sorting $strings  =  array(                  "cote",  "côte",  "Côte",  "coté",                  "Coté",  "côté",  "Côté",  "coter");   $coll  =  new  Collator("fr_FR");   $coll-­‐>sort($strings); result cote côte Côte coté Coté côté Côté coter
  • 58. NUMBER FORMATTING 123456.789  in  en_US ❖ NumberFormatter::DECIMAL 123456.789 ❖ NumberFormatter::CURRENCY $123,456.79 ❖ NumberFormatter::ORDINAL 123,457th ❖ NumberFormatter::SPELLOUT one hundred and twenty-three thousand, four hundred and fifty-six point seven eight nine
  • 59. MESSAGE FORMATTING with modifiers $pattern  =  “On  {0,date,full}  you  received                        {1,number,#,##0.00}  emails.”; $args  =  array(time(),  1184);   $fmt  =  new  MessageFormatter(‘en_US’,  $pattern); echo  $fmt-­‐>format($args); result On  Tuesday,  November  22,  2007  you  received   1,184.00  emails.
  • 62. 1. RAISED AWARENESS ❖ Spoke at multiple conferences about the project ‣ including Unicode Conference ❖ Shoved Unicode down people’s throats at every opportunity
  • 63. 2. CHOSE THE RIGHT TECH ❖ ICU library had everything we needed ❖ Low- and high-level functionality ❖ Good support from its developers
  • 64. 3. UNIT TESTS ❖ Every function handling strings had to be ported ❖ Unit tests showed us where things broke ❖ Also easy to track progress
  • 65. 4. PECL/INTL EXTENSION ❖ A lot of i18n/l10n functionality in a self-contained extension ❖ Ensuring that it worked with PHP 5
  • 66. 5. CODE SEGREGATION ❖ Proof-of-concept developed by only a few people ❖ Faster decisions, iteration, development ❖ Things slowed down after merging into the main tree ‣ but was necessary to spread the workload
  • 68. 1. CHOICE OF UTF-16 Thought to be the best compromise
  • 69. UTF-8 ❖ Backward-compatible with ASCII ❖ Avoids complications of endianness ❖ Dominant UTF encoding for the Web ❖ Supported in a lot of libraries, APIs, etc
  • 70. UTF-8, BUT… ❖ Variable-length encoding (1-4 bytes) ❖ Uses 3 bytes for BMP code points > U+07FF ❖ Not all byte sequences are valid ❖ ICU did not have many UTF-8 APIs (at the time) ‣ on-the-fly conversion is necessary
  • 71. UTF-32 ❖ Uses exactly 4 bytes for each code point ‣ directly indexable!
  • 72. UTF-32, BUT... ❖ Uses exactly 4 bytes for each code point ‣ 4x the size of UTF-8 for majority of languages ❖ Only affordable by people from rich oil countries ❖ Still needs conversion to UTF-16 when using ICU ❖ Endianness
  • 73. UTF-16 ❖ “65,536 code points should be enough for everyone…” ❖ 2 bytes to represent all of BMP (U+0 to U+FFFF) ‣ directly indexable in that plane ❖ Internal encoding of ICU
  • 74. UTF-16, BUT… ❖ Requires surrogate pairs for code points > U+FFFF ‣ still variable-length ❖ 2x the size of UTF-8 for Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic and other scripts ❖ Can’t be manipulated by normal C string handling ❖ Endianness
  • 75. CHOICE OF UTF-16 ❖ Thought that CJK languages would benefit from UTF-16 ❖ Primary driver was the ICU APIs ❖ Problems: no direct indexing, many conversions ❖ Would probably choose UTF-8, if started over ‣ no need for decoding/encoding on the periphery ‣ can be used by C-based libraries
  • 76. 2. CRUCIAL CODE LAGGED
  • 77. 2. CRUCIAL CODE LAGGED ❖ PDO (and native DB extensions) ❖ filter ❖ proper substring search (collation-based) ❖ some ext/standard functionality
  • 78. 3. LACK OF MINDSHARE
  • 79. 3. LACK OF MINDSHARE ❖ Probably <10 people who understood the intricacies of the Unicode and ICU ❖ In the end, implementation deemed too technically difficult ❖ People were bored converting large chunks of already working code
  • 80. 4. DELAYED NEW FEATURES
  • 83. RE-ORG ❖ PHP 6 trunk was moved to a branch ❖ PHP 5.4 became the trunk ❖ Kick-started development of new features ❖ Some clean-ups and improvements from 6 back-ported to 5.4
  • 84. PEOPLE MATTER ❖ The project ran out of steam ‣ PHP development culture means that people work on what they’re interested in ‣ Clearly, the Unicode/i18n implementation wasn’t interesting enough to be viable
  • 85. INTERNALS “Because it’s nearly impossible to participate on Internals if your poo-throwing arm isn’t strong.” — @coates
  • 86. PERSISTENCE “Those with talent, competence, energy, and good ideas over a period of time tend to be the main drivers behind PHP development.” — me
  • 87. PERSISTENCE “Those with talent, competence, energy, and good ideas over a period of time and who outlast the rest tend to be the main drivers behind PHP development.” — me
  • 88. PROGRESS ❖ No development on the Unicode branch ❖ No visible effort to develop alternatives
  • 89. FUTURE? ❖ Lighter, gentler implementations? ‣ mbstring is clunky ‣ separate Unicode String class would also be clunky ❖ Open field for someone with a great idea, persistence, and people skills
  • 90. STEPPING AWAY ❖ Invalidation of several man-years of hard work is discouraging ❖ Did not feel like pushing the project up the hill again ❖ Working on more fun stuff these days
  • 91. LESSONS LEARNED ❖ Rewriting large existing code base is hard ❖ Making people do tedious stuff is hard ‣ make it interesting for them (game-like) ❖ Waiting for results of long iterations is hard ‣ short, results-oriented projects (if possible) ❖ Stay committed
  • 92. FINITA LA COMEDIA http://joind.in/3349 ❖ http://zazzle.com/andreiz