SlideShare a Scribd company logo
1 of 60
Download to read offline
Unicode & PHP6
  Andrei Zmievski




               Inspiring people to
               share
Section Title




                Inspiring people to
                share
Unicode & PHP6
   Sara Golemon




              Inspiring people to
              share
Section Title




                Inspiring people to
                share
Unicode & PHP6
  Karsten & Robert




               Inspiring people to
               share
Tower of babel


The tower of Babel

     “Come, let us descend and confuse
     their language, so that one will not
     understand the language of his
     companionquot;. — Genesis 11:6




                                            Inspiring people to
                                            share
Tower of Babel


We all know it’s true
     The Babel story rings true

     Dealing with multiple encodings is a pain

     Requires different algorithms, conversion, detection, validation,
     processing...

     Dealing with multiple languages is a pain too

     But cannot be avoided in this day and age




                                              Inspiring people to
                                              share
Tower of Babel


PHP in the past
     PHP has always been a binary processor

     The string type is byte-oriented and is used for everything
     from text to images

     Core language knows little to nothing about encodings and
     processing multilingual data

     iconv and mbstring extensions are not sufficient

     Relies on POSIX locales




                                              Inspiring people to
                                              share
Tower of Babel


PHP in the past
     But does it have to stay that way? No!




                                              Inspiring people to
                                              share
Unicode
 Unicode provides a unique number for every character, no
 matter what the platform, no matter what the program, no
 matter what the language is




                                       Inspiring people to
                                       share
Unicode


Unicode
    Developed by the Unicode Consortium

    Designed to allow text and symbols from all languages to be
    consistently represented and manipulated

    Covers all major living scripts

    Version 5.0 has 99,000+ characters

    Capacity for 1 million+ characters

    Unicode Character Set = ISO 10646



                                           Inspiring people to
                                           share
Unicode


Unicode is here to stay
    Multilingual

    Generative

    Rich and reliable set of character properties

    Standard encodings: UTF-8, UTF-16, UTF-32

    Algorithm specifications provide interoperability

    Unified character set and algorithms are essential for creating
    modern software

    But Unicode != i18n

                                             Inspiring people to
                                             share
Locales
 I18N and L10N rely on consistent and correct locale data

 Locale doesn’t refer to data like in POSIX

 Locale = identifier referring to linguistic and cultural
 preferences of a user community: en_US, en_GB, ja_JP

 These preferences can change over time due to cultural and
 political reasons:

 •   Introduction of new currencies, like the Euro

 •   Standard sorting of Spanish changes


                                           Inspiring people to
                                           share
Locales


Types of locale data
     Date/time formats

     Number/currency formats

     Measurement system

     Collation Specification, i.e. sorting, searching, matching

     Translated names for language, territory, script, time zones,
     currencies,...

     Script and characters used by a language



                                              Inspiring people to
                                              share
Locales


Common Locale Data
     Hosted by Unicode Consortium

     Goals:

    •     Common, necessary software locale data for all world
          languages

    •     Collect and maintain locale data

    •     XML format for effective interchange

    •     Freely available

     360 locales: 121 languages and 142 territories in CLDR 1.4

                                                 Inspiring people to
                                                 share
Goals for PHP6
 Native Unicode string type

 Distinct binary string type

 Updated language semantics

 Where possible, upgrade the existing functions

 Backwards compatibility

 Make simple things easy and complex things possible




                                         Inspiring people to
                                         share
Goals for PHP6


How will it work?
     UTF-16 as internal encoding

     All functions and operators work on Normalized Composed
     Characters (NFC)

     All identifiers can contain Unicode characters

     Internationalization is explicit, not implicit

     You can turn off Unicode semantics if you don't need it




                                                Inspiring people to
                                                share
ICU
 International Components for Unicode

 •   Encoding conversions

 •   Collation

 •   Unicode text processing

 •   much more

 The wheel has already been invented and is pretty good




                                        Inspiring people to
                                        share
ICU


ICU Features
      Unicode Character Properties

      Unicode String Class & text processing

      Text transformations (normalization, upper/lowercase, etc)

      Text Boundary Analysis (Character/Word/Sentence Break
      Iterators)

      Encoding Conversions for 500+ legacy encodings

      Language-sensitive collation (sorting) and searching

      Unicode regular expressions

                                               Inspiring people to
                                               share
ICU


more ICU Features
      Thread-safe

      Formatting: Date/Time/Numbers/Currency

      Cultural Calendars & Time Zones

      Transliterations (50+ script pairs)

      Complex Text Layout for Arabic, Hebrew, Indic & Thai

      International Domain Names and Web addresses

      Java model for locale-hierarchical resource bundles. Multiple
      locales can be used at a time

                                              Inspiring people to
                                              share
Let There Be Unicode!
 A control switch called unicode.semantics

 Global, not per request or virtual server

 No changes to program behavior unless enabled

 Does not imply no Unicode at all when disabled!




                                             Inspiring people to
                                             share
Functions
 All the functions in the PHP default distribution are being
 analyzed to see whether they need be upgraded to understand
 Unicode and if so, how

 The upgrade is in progress and requires involvement from
 extension authors

 Parameter parsing API will perform automatic conversions
 while upgrades are being done




                                        Inspiring people to
                                        share
Let There Be Unicode!


String Types
     PHP 4/5 string types

    •   only one, used for everything

     PHP 6 string types

    •   Unicode: textual data (in UTF-16 encoding)

    •   Binary: textual data in other encodings and true binary data




                                             Inspiring people to
                                             share
Let There Be Unicode!


String Literals
     With unicode.semantics=off, string literals are old‐fashioned 8-
     bit strings

     1 character = 1 byte




                                              Inspiring people to
                                              share
Let There Be Unicode!


Unicode String Literals
     With unicode.semantics=on, string literals are of Unicode type

     1 character may be > 1 byte

     To obtain length in bytes one would use a separate function




                                             Inspiring people to
                                             share
Let There Be Unicode!


Binary String Literals
     Binary string literals require new syntax

     The contents, which are the literal byte sequence inside the
     delimiters, depend on the encoding of the script




                                                 Inspiring people to
                                                 share
Let There Be Unicode!


Escape Sequences
     Inside Unicode strings uXXXX and UXXXXXX escape sequences
     may be used to specify Unicode code points explicitly

     Characters can also be specified by name, using the C{..}
     escape sequence




                                             Inspiring people to
                                             share
Conversions




              Inspiring people to
              share
Conversions & Encoding


Runtime Encoding
     Runtime Encoding

     Specifies which encoding to use when converting between
     Unicode and binary strings at runtime

     Also used when interfacing with functions that do not yet
     support Unicode type




                                             Inspiring people to
                                             share
Conversions & Encoding


Script/Source Encoding
     Currently, scripts may be written in a variety of encodings:
     ISO-8859-1, Shift-JIS, UTF-8, etc.

     The engine needs to know the encoding of a script in order to
     parse it correctly

     Encoding can be specified as an INI setting or with declare()
     pragma

     Affects how identifiers and string literals are interpreted




                                              Inspiring people to
                                              share
Conversions & Encoding


Script Encoding
     Whatever the encoding of the script, the resulting string literals
     are of Unicode type




     In both cases $uni is a Unicode string containing two
     codepoints: U+00F8 (ø) and U+006C (l)


                                               Inspiring people to
                                               share
Conversions & Encoding


Script Encoding
     Encoding can be also changed with a pragma

    •   Has to be the very first statement in the script

    •   Does not propagate to included files




                                              Inspiring people to
                                              share
Conversions & Encoding


Output Encoding
     Specifies the encoding for the standard output stream

     The script output is transcoded on the fly

     Affects only Unicode strings




                                             Inspiring people to
                                             share
Conversions & Encoding


“HTTP Input Encoding”
     With Unicode semantics switch enabled, we need to convert
     HTTP input to Unicode

     GET requests have no encoding at all and POST ones rarely
     come marked with the encoding

     Encoding detection is not reliable

     Correctly decoding HTTP input is somewhat of an unsolved
     problem




                                            Inspiring people to
                                            share
Conversions & Encoding


“HTTP Input Encoding”
     PHP will perform lazy decoding

     Delays decoding data in $_GET, $_POST, and $_REQUEST until
     the first time you access them

     Allows user to set expected encoding or just rely on a default
     one

     Allows decoding errors to be handled by the same mechanism

     Applications should also use filter extension to filter incoming
     data



                                              Inspiring people to
                                             share
Conversions & Encoding


Filesystem Encoding
     Specifies the encoding of the file and directory names on the
     filesystem

     Filesystem-related functions will do the conversion when
     accepting and returning filenames




                                            Inspiring people to
                                            share
Conversions & Encoding


Type Conversions
     Unicode and binary string types can be converted to one
     another explicitly or implicitly

     Conversions use unicode.runtime_encoding

     Explicit conversions: casting

    •   (binary) casts to binary string type
        (unicode) casts to Unicode string type
        (string) casts to Unicode type if unicode.semantics is on and
        to binary otherwise

     Implicit conversions: concatenation, comparison, parameter
     passing
                                              Inspiring people to
                                              share
Conversions & Encoding


Generic Conversions
     Casting is just a shortcut for converting using runtime encoding

     For all other encodings, use provided functions




                                             Inspiring people to
                                             share
Conversions & Encoding


Conversion Issues
     Unicode is a superset of legacy character sets

     Many Unicode characters cannot be represented in legacy
     encodings

     Strings may also contain corrupt data or irregular byte
     sequences

     You can customize what PHP should do when it runs into a
     conversion error

     Global settings apply to all conversions by default



                                              Inspiring people to
                                              share
Unicode Identifiers
 PHP will allow Unicode characters in identifiers

 You may start with something quite simple and old-fashioned




                                         Inspiring people to
                                         share
Unicode Identifiers


For beginners...
     Perhaps you feel that a few accented characters won’t hurt




                                             Inspiring people to
                                            share
Unicode Identifiers


...and advanced
     Then you learn a couple more languages…

     …and the fun begins




                                          Inspiring people to
                                          share
Text Iterator
  Get used to the idea of using TextIterator

  It is very fast and gives you access to various text units in a
  generic fashion

  You can iterate over code points, combining sequences,
  characters, words, lines, and sentences forward and backward

  Provides access to ICU’s boundary analysis API




                                            Inspiring people to
                                            share
Text Iterator
  Truncate text at a word boundary




  Get the last 2 sentences of the text




                                         Inspiring people to
                                         share
Locales in PHP6
 Unicode support in PHP relies exclusively on ICU locales

 The legacy setlocale() should not be used

 Default locale can be accessed with:

 •   locale_set_default()

 •   locale_get_default()

 ICU locale IDs have a somewhat different format from POSIX
 locale IDs:

 •   sr_Latn_YU_REVISED@currency=USD
                                   <language>[_<script>]_<country>[_<variant>][@<keywords>]
                                   Serbian (Latin, Yugoslavia, Revised Orthography,
                                   Currency=USInspiring people to
                                                Dollar)
                                              share
Collation
  Collation is the process of ordering units of textual information

  Specific to a particular locale, language, and document:




                                           Inspiring people to
                                           share
Collation


Collators
     Languages may sort more than one way

    •   German dictionary vs. phone book

    •   Japanese stroke-radical vs. radical-stroke

    •   Traditional vs. modern Spanish

     PHP comparison operators do not use collation

     But PHP6 provides collators to do the job




                                             Inspiring people to
                                             share
Collation


Using collators


     We can ignore accents if we want:

    •   $coll->setStrength(Collator::PRIMARY);

     We can sort arrays as well:

    •   $coll->sort(array(quot;cotequot;, quot;côtequot;, quot;Côtequot;, quot;cotéquot;));
                                                Inspiring people to
                                                share
Collation


Default collators
     There is a default collator associated with the default locale

     Can be accessed with:

    •   collator_get_default()

    •   collator_set_default()

     When the default locale is changed, the default collator
     changes as well




                                              Inspiring people to
                                              share
Collation


Collation API
     Full collation API is very flexible and customizable

     You can change collation strength, make it use numeric
     ordering, ignore or respect case level or punctuation, and much
     more

     PHP will always update its collation algorithm and data with
     each version of Unicode, without breaking backwards
     compatibility




                                              Inspiring people to
                                              share
Stream I/O
 PHP has a streams-based I/O system

 Generalized file, network, data compression, and other
 operations

 PHP cannot assume that data on the other end of the stream is
 in a particular encoding

 Need to apply encoding conversions




                                        Inspiring people to
                                        share
Stream I/O
 By default, a stream is in binary mode and no encoding
 conversion is done

 Applications can convert data explicitly

 But ... we’re lazy so it’s easier to let the streams do it

 t mode - it’s not just for Windows line endings anymore!

 Uses encoding setting in default context, which is UTF-8 unless
 changed




                                             Inspiring people to
                                             share
Stream I/O
 If you mainly work with files in an encoding other than UTF-8,
 change default context:

 •   stream_default_encoding('Shift-JIS');

 Or create a custom context and use it instead

 If you have a stream that was opened in binary mode, you can
 also automate encoding handling

 •   stream_encoding($fp, 'utf-8');

 fopen() can actually detect the encoding from the headers, if
 it’s available

                                             Inspiring people to
                                             share
Text Transforms
 Powerful and flexible way to process Unicode text

 •   script-to-script conversions

 •   normalization

 •   case mappings and full-/halfwidth conversions

 •   accent removal

 •   and more

 Allows chained transforms

 •   [:Latin:]; NFKD; Lower; Latin-Katakana;
                                           Inspiring people to
                                           share
Text Transforms


Transliteration
     Here’s how to get (a fairly reliable) Japanese pronunciation of
     your name:




     And the result:

    •
        Buritenei Supearusu

                                              Inspiring people to
                                              share
Text Transforms


Transliteration
     Here’s how to get (a fairly reliable) Japanese pronunciation of
     your name:




     And the result:

    •
        Buritenei Supearusu

                                              Inspiring people to
                                              share
Current Status
 Most of the described functionality has been implemented

 Development still underway, so minor feature tweaks are
 possible

 ext/standard and a number of other extensions are being
 upgraded first

 Overall about 61% (of 3047) of extension functions have been
 upgraded

 Download PHP 6 snapshots today!



                                        Inspiring people to
                                        share
Current Status


Finished Extensions
     XML extensions:
     dom, xml, xsl, simplexml, xmlreader/xmlwriter

     soap, json

     spl, reflection

     mysql, mysqli, sqlite, oci8

     pcre

     curl, gd, session

     zlib, bz2, zip, hash, mcrypt, shmop, tidy

                                                 Inspiring people to
                                                 share
Current Status


What’s Coming?
     PHP 6 “Unicode Preview Release”

    •   core functionality done

    •   first-tier extensions upgraded

     Afterwards

    •   Upgrade the rest of extensions

    •   Expose more ICU services

    •   Update PHP manual

    •   Optimize performance
                                         Inspiring people to
                                         share
Unicode & PHP6

More Related Content

Viewers also liked

JSP Tag Library
JSP Tag LibraryJSP Tag Library
JSP Tag Libraryjgiudici
 
080312 talk about 3D-Internet Overview
080312 talk about 3D-Internet Overview080312 talk about 3D-Internet Overview
080312 talk about 3D-Internet OverviewKohei Nishikawa
 
2 2010 T M C Community Building Preso
2 2010  T M C Community Building Preso2 2010  T M C Community Building Preso
2 2010 T M C Community Building PresoHack the Hood
 
Image Sensors 2008 Final Revised V5
Image Sensors 2008   Final   Revised   V5Image Sensors 2008   Final   Revised   V5
Image Sensors 2008 Final Revised V5Shri Sundaram
 
Building Tech Vision
Building Tech VisionBuilding Tech Vision
Building Tech VisionJosh Allen
 
It Works! Presenting DBAL use in real life
It Works! Presenting DBAL use in real lifeIt Works! Presenting DBAL use in real life
It Works! Presenting DBAL use in real lifeKarsten Dambekalns
 
Organizing with web 2.0
Organizing with web 2.0Organizing with web 2.0
Organizing with web 2.0Josh Allen
 
VIBEWELL Hard Chrome Process
VIBEWELL Hard Chrome Process VIBEWELL Hard Chrome Process
VIBEWELL Hard Chrome Process flyrr
 
A Content Repository for TYPO3 5.0
A Content Repository for TYPO3 5.0A Content Repository for TYPO3 5.0
A Content Repository for TYPO3 5.0Karsten Dambekalns
 
Hex Colors At A Glance
Hex Colors At A GlanceHex Colors At A Glance
Hex Colors At A GlanceDino Baskovic
 
INERGY SLIDE SHOW
INERGY SLIDE SHOWINERGY SLIDE SHOW
INERGY SLIDE SHOWguest5ae2c9
 
Implementing a JSR-283 Content Repository in PHP
Implementing a JSR-283 Content Repository in PHPImplementing a JSR-283 Content Repository in PHP
Implementing a JSR-283 Content Repository in PHPKarsten Dambekalns
 
Case Study: Knight News Challenge
Case Study: Knight News ChallengeCase Study: Knight News Challenge
Case Study: Knight News ChallengeHack the Hood
 
Wms Student Research Conference 2008 Presentation (F Niemi)
Wms Student Research Conference 2008 Presentation (F Niemi)Wms Student Research Conference 2008 Presentation (F Niemi)
Wms Student Research Conference 2008 Presentation (F Niemi)Fa Niemi
 

Viewers also liked (20)

JSP Tag Library
JSP Tag LibraryJSP Tag Library
JSP Tag Library
 
080312 talk about 3D-Internet Overview
080312 talk about 3D-Internet Overview080312 talk about 3D-Internet Overview
080312 talk about 3D-Internet Overview
 
2 2010 T M C Community Building Preso
2 2010  T M C Community Building Preso2 2010  T M C Community Building Preso
2 2010 T M C Community Building Preso
 
Image Sensors 2008 Final Revised V5
Image Sensors 2008   Final   Revised   V5Image Sensors 2008   Final   Revised   V5
Image Sensors 2008 Final Revised V5
 
Building Tech Vision
Building Tech VisionBuilding Tech Vision
Building Tech Vision
 
It Works! Presenting DBAL use in real life
It Works! Presenting DBAL use in real lifeIt Works! Presenting DBAL use in real life
It Works! Presenting DBAL use in real life
 
Tiny Frogs
Tiny FrogsTiny Frogs
Tiny Frogs
 
Organizing with web 2.0
Organizing with web 2.0Organizing with web 2.0
Organizing with web 2.0
 
I Like Shiny Things
I Like Shiny ThingsI Like Shiny Things
I Like Shiny Things
 
Lvw Model
Lvw ModelLvw Model
Lvw Model
 
VIBEWELL Hard Chrome Process
VIBEWELL Hard Chrome Process VIBEWELL Hard Chrome Process
VIBEWELL Hard Chrome Process
 
A Content Repository for TYPO3 5.0
A Content Repository for TYPO3 5.0A Content Repository for TYPO3 5.0
A Content Repository for TYPO3 5.0
 
Hex Colors At A Glance
Hex Colors At A GlanceHex Colors At A Glance
Hex Colors At A Glance
 
INERGY SLIDE SHOW
INERGY SLIDE SHOWINERGY SLIDE SHOW
INERGY SLIDE SHOW
 
Implementing a JSR-283 Content Repository in PHP
Implementing a JSR-283 Content Repository in PHPImplementing a JSR-283 Content Repository in PHP
Implementing a JSR-283 Content Repository in PHP
 
To italy by plane
To italy by planeTo italy by plane
To italy by plane
 
Case Study: Knight News Challenge
Case Study: Knight News ChallengeCase Study: Knight News Challenge
Case Study: Knight News Challenge
 
Comic strip i
Comic strip iComic strip i
Comic strip i
 
Pijamas Gutty
Pijamas GuttyPijamas Gutty
Pijamas Gutty
 
Wms Student Research Conference 2008 Presentation (F Niemi)
Wms Student Research Conference 2008 Presentation (F Niemi)Wms Student Research Conference 2008 Presentation (F Niemi)
Wms Student Research Conference 2008 Presentation (F Niemi)
 

Similar to Unicode & PHP6

The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6Andrei Zmievski
 
Internationlization
InternationlizationInternationlization
InternationlizationTuan Ngo
 
2016 bioinformatics i_python_part_1_wim_vancriekinge
2016 bioinformatics i_python_part_1_wim_vancriekinge2016 bioinformatics i_python_part_1_wim_vancriekinge
2016 bioinformatics i_python_part_1_wim_vancriekingeProf. Wim Van Criekinge
 
Challenges In Dsl Design
Challenges In Dsl DesignChallenges In Dsl Design
Challenges In Dsl DesignSven Efftinge
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesMilind Patil
 
Unicode Fundamentals
Unicode Fundamentals Unicode Fundamentals
Unicode Fundamentals SamiHsDU
 
Porting E-poetry: The Case of First Screening
Porting E-poetry: The Case of First ScreeningPorting E-poetry: The Case of First Screening
Porting E-poetry: The Case of First ScreeningLeonardo Flores
 
Understanding Character Encodings
Understanding Character EncodingsUnderstanding Character Encodings
Understanding Character EncodingsMobisoft Infotech
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsGuy De Pauw
 
Xml For Dummies Chapter 6 Adding Character(S) To Xml
Xml For Dummies   Chapter 6 Adding Character(S) To XmlXml For Dummies   Chapter 6 Adding Character(S) To Xml
Xml For Dummies Chapter 6 Adding Character(S) To Xmlphanleson
 
How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...agileware
 
The Spoofax Language Workbench (SPLASH 2010)
The Spoofax Language Workbench (SPLASH 2010)The Spoofax Language Workbench (SPLASH 2010)
The Spoofax Language Workbench (SPLASH 2010)lennartkats
 

Similar to Unicode & PHP6 (20)

The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
 
Unicode
UnicodeUnicode
Unicode
 
Internationlization
InternationlizationInternationlization
Internationlization
 
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
 
2016 bioinformatics i_python_part_1_wim_vancriekinge
2016 bioinformatics i_python_part_1_wim_vancriekinge2016 bioinformatics i_python_part_1_wim_vancriekinge
2016 bioinformatics i_python_part_1_wim_vancriekinge
 
Uncdtalk
UncdtalkUncdtalk
Uncdtalk
 
Unicode Primer for the Uninitiated
Unicode Primer for the UninitiatedUnicode Primer for the Uninitiated
Unicode Primer for the Uninitiated
 
Challenges In Dsl Design
Challenges In Dsl DesignChallenges In Dsl Design
Challenges In Dsl Design
 
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfiles
 
Unicode Fundamentals
Unicode Fundamentals Unicode Fundamentals
Unicode Fundamentals
 
P1 2017 python
P1 2017 pythonP1 2017 python
P1 2017 python
 
Porting E-poetry: The Case of First Screening
Porting E-poetry: The Case of First ScreeningPorting E-poetry: The Case of First Screening
Porting E-poetry: The Case of First Screening
 
Understanding Character Encodings
Understanding Character EncodingsUnderstanding Character Encodings
Understanding Character Encodings
 
P1 2018 python
P1 2018 pythonP1 2018 python
P1 2018 python
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
 
Xml For Dummies Chapter 6 Adding Character(S) To Xml
Xml For Dummies   Chapter 6 Adding Character(S) To XmlXml For Dummies   Chapter 6 Adding Character(S) To Xml
Xml For Dummies Chapter 6 Adding Character(S) To Xml
 
How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...
 
Drupal entity translation
Drupal entity translationDrupal entity translation
Drupal entity translation
 
The Spoofax Language Workbench (SPLASH 2010)
The Spoofax Language Workbench (SPLASH 2010)The Spoofax Language Workbench (SPLASH 2010)
The Spoofax Language Workbench (SPLASH 2010)
 

More from Karsten Dambekalns

The Perfect Neos Project Setup
The Perfect Neos Project SetupThe Perfect Neos Project Setup
The Perfect Neos Project SetupKarsten Dambekalns
 
Sawubona! Content Dimensions with Neos
Sawubona! Content Dimensions with NeosSawubona! Content Dimensions with Neos
Sawubona! Content Dimensions with NeosKarsten Dambekalns
 
Deploying TYPO3 Neos websites using Surf
Deploying TYPO3 Neos websites using SurfDeploying TYPO3 Neos websites using Surf
Deploying TYPO3 Neos websites using SurfKarsten Dambekalns
 
Profiling TYPO3 Flow Applications
Profiling TYPO3 Flow ApplicationsProfiling TYPO3 Flow Applications
Profiling TYPO3 Flow ApplicationsKarsten Dambekalns
 
Using Document Databases with TYPO3 Flow
Using Document Databases with TYPO3 FlowUsing Document Databases with TYPO3 Flow
Using Document Databases with TYPO3 FlowKarsten Dambekalns
 
How Git and Gerrit make you more productive
How Git and Gerrit make you more productiveHow Git and Gerrit make you more productive
How Git and Gerrit make you more productiveKarsten Dambekalns
 
The agile future of a ponderous project
The agile future of a ponderous projectThe agile future of a ponderous project
The agile future of a ponderous projectKarsten Dambekalns
 
How Domain-Driven Design helps you to migrate into the future
How Domain-Driven Design helps you to migrate into the futureHow Domain-Driven Design helps you to migrate into the future
How Domain-Driven Design helps you to migrate into the futureKarsten Dambekalns
 
Content Repository, Versioning and Workspaces in TYPO3 Phoenix
Content Repository, Versioning and Workspaces in TYPO3 PhoenixContent Repository, Versioning and Workspaces in TYPO3 Phoenix
Content Repository, Versioning and Workspaces in TYPO3 PhoenixKarsten Dambekalns
 
Transparent Object Persistence (within FLOW3)
Transparent Object Persistence (within FLOW3)Transparent Object Persistence (within FLOW3)
Transparent Object Persistence (within FLOW3)Karsten Dambekalns
 
Transparent Object Persistence with FLOW3
Transparent Object Persistence with FLOW3Transparent Object Persistence with FLOW3
Transparent Object Persistence with FLOW3Karsten Dambekalns
 
Knowledge Management in der TYPO3 Community
Knowledge Management in der TYPO3 CommunityKnowledge Management in der TYPO3 Community
Knowledge Management in der TYPO3 CommunityKarsten Dambekalns
 
Implementing a JSR-283 Content Repository in PHP
Implementing a JSR-283 Content Repository in PHPImplementing a JSR-283 Content Repository in PHP
Implementing a JSR-283 Content Repository in PHPKarsten Dambekalns
 
Introduction to Source Code Management
Introduction to Source Code ManagementIntroduction to Source Code Management
Introduction to Source Code ManagementKarsten Dambekalns
 
DB API Usage - How to write DBAL compliant code
DB API Usage - How to write DBAL compliant codeDB API Usage - How to write DBAL compliant code
DB API Usage - How to write DBAL compliant codeKarsten Dambekalns
 

More from Karsten Dambekalns (20)

The Perfect Neos Project Setup
The Perfect Neos Project SetupThe Perfect Neos Project Setup
The Perfect Neos Project Setup
 
Sawubona! Content Dimensions with Neos
Sawubona! Content Dimensions with NeosSawubona! Content Dimensions with Neos
Sawubona! Content Dimensions with Neos
 
Deploying TYPO3 Neos websites using Surf
Deploying TYPO3 Neos websites using SurfDeploying TYPO3 Neos websites using Surf
Deploying TYPO3 Neos websites using Surf
 
Profiling TYPO3 Flow Applications
Profiling TYPO3 Flow ApplicationsProfiling TYPO3 Flow Applications
Profiling TYPO3 Flow Applications
 
Using Document Databases with TYPO3 Flow
Using Document Databases with TYPO3 FlowUsing Document Databases with TYPO3 Flow
Using Document Databases with TYPO3 Flow
 
i18n and L10n in TYPO3 Flow
i18n and L10n in TYPO3 Flowi18n and L10n in TYPO3 Flow
i18n and L10n in TYPO3 Flow
 
FLOW3-Workshop F3X12
FLOW3-Workshop F3X12FLOW3-Workshop F3X12
FLOW3-Workshop F3X12
 
Doctrine in FLOW3
Doctrine in FLOW3Doctrine in FLOW3
Doctrine in FLOW3
 
How Git and Gerrit make you more productive
How Git and Gerrit make you more productiveHow Git and Gerrit make you more productive
How Git and Gerrit make you more productive
 
The agile future of a ponderous project
The agile future of a ponderous projectThe agile future of a ponderous project
The agile future of a ponderous project
 
How Domain-Driven Design helps you to migrate into the future
How Domain-Driven Design helps you to migrate into the futureHow Domain-Driven Design helps you to migrate into the future
How Domain-Driven Design helps you to migrate into the future
 
Content Repository, Versioning and Workspaces in TYPO3 Phoenix
Content Repository, Versioning and Workspaces in TYPO3 PhoenixContent Repository, Versioning and Workspaces in TYPO3 Phoenix
Content Repository, Versioning and Workspaces in TYPO3 Phoenix
 
Transparent Object Persistence (within FLOW3)
Transparent Object Persistence (within FLOW3)Transparent Object Persistence (within FLOW3)
Transparent Object Persistence (within FLOW3)
 
JavaScript for PHP Developers
JavaScript for PHP DevelopersJavaScript for PHP Developers
JavaScript for PHP Developers
 
Transparent Object Persistence with FLOW3
Transparent Object Persistence with FLOW3Transparent Object Persistence with FLOW3
Transparent Object Persistence with FLOW3
 
TDD (with FLOW3)
TDD (with FLOW3)TDD (with FLOW3)
TDD (with FLOW3)
 
Knowledge Management in der TYPO3 Community
Knowledge Management in der TYPO3 CommunityKnowledge Management in der TYPO3 Community
Knowledge Management in der TYPO3 Community
 
Implementing a JSR-283 Content Repository in PHP
Implementing a JSR-283 Content Repository in PHPImplementing a JSR-283 Content Repository in PHP
Implementing a JSR-283 Content Repository in PHP
 
Introduction to Source Code Management
Introduction to Source Code ManagementIntroduction to Source Code Management
Introduction to Source Code Management
 
DB API Usage - How to write DBAL compliant code
DB API Usage - How to write DBAL compliant codeDB API Usage - How to write DBAL compliant code
DB API Usage - How to write DBAL compliant code
 

Recently uploaded

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Unicode & PHP6

  • 1. Unicode & PHP6 Andrei Zmievski Inspiring people to share
  • 2. Section Title Inspiring people to share
  • 3. Unicode & PHP6 Sara Golemon Inspiring people to share
  • 4. Section Title Inspiring people to share
  • 5. Unicode & PHP6 Karsten & Robert Inspiring people to share
  • 6. Tower of babel The tower of Babel “Come, let us descend and confuse their language, so that one will not understand the language of his companionquot;. — Genesis 11:6 Inspiring people to share
  • 7. Tower of Babel We all know it’s true The Babel story rings true Dealing with multiple encodings is a pain Requires different algorithms, conversion, detection, validation, processing... Dealing with multiple languages is a pain too But cannot be avoided in this day and age Inspiring people to share
  • 8. Tower of Babel PHP in the past PHP has always been a binary processor The string type is byte-oriented and is used for everything from text to images Core language knows little to nothing about encodings and processing multilingual data iconv and mbstring extensions are not sufficient Relies on POSIX locales Inspiring people to share
  • 9. Tower of Babel PHP in the past But does it have to stay that way? No! Inspiring people to share
  • 10. Unicode Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language is Inspiring people to share
  • 11. Unicode Unicode Developed by the Unicode Consortium Designed to allow text and symbols from all languages to be consistently represented and manipulated Covers all major living scripts Version 5.0 has 99,000+ characters Capacity for 1 million+ characters Unicode Character Set = ISO 10646 Inspiring people to share
  • 12. Unicode Unicode is here to stay Multilingual Generative Rich and reliable set of character properties Standard encodings: UTF-8, UTF-16, UTF-32 Algorithm specifications provide interoperability Unified character set and algorithms are essential for creating modern software But Unicode != i18n Inspiring people to share
  • 13. Locales I18N and L10N rely on consistent and correct locale data Locale doesn’t refer to data like in POSIX Locale = identifier referring to linguistic and cultural preferences of a user community: en_US, en_GB, ja_JP These preferences can change over time due to cultural and political reasons: • Introduction of new currencies, like the Euro • Standard sorting of Spanish changes Inspiring people to share
  • 14. Locales Types of locale data Date/time formats Number/currency formats Measurement system Collation Specification, i.e. sorting, searching, matching Translated names for language, territory, script, time zones, currencies,... Script and characters used by a language Inspiring people to share
  • 15. Locales Common Locale Data Hosted by Unicode Consortium Goals: • Common, necessary software locale data for all world languages • Collect and maintain locale data • XML format for effective interchange • Freely available 360 locales: 121 languages and 142 territories in CLDR 1.4 Inspiring people to share
  • 16. Goals for PHP6 Native Unicode string type Distinct binary string type Updated language semantics Where possible, upgrade the existing functions Backwards compatibility Make simple things easy and complex things possible Inspiring people to share
  • 17. Goals for PHP6 How will it work? UTF-16 as internal encoding All functions and operators work on Normalized Composed Characters (NFC) All identifiers can contain Unicode characters Internationalization is explicit, not implicit You can turn off Unicode semantics if you don't need it Inspiring people to share
  • 18. ICU International Components for Unicode • Encoding conversions • Collation • Unicode text processing • much more The wheel has already been invented and is pretty good Inspiring people to share
  • 19. ICU ICU Features Unicode Character Properties Unicode String Class & text processing Text transformations (normalization, upper/lowercase, etc) Text Boundary Analysis (Character/Word/Sentence Break Iterators) Encoding Conversions for 500+ legacy encodings Language-sensitive collation (sorting) and searching Unicode regular expressions Inspiring people to share
  • 20. ICU more ICU Features Thread-safe Formatting: Date/Time/Numbers/Currency Cultural Calendars & Time Zones Transliterations (50+ script pairs) Complex Text Layout for Arabic, Hebrew, Indic & Thai International Domain Names and Web addresses Java model for locale-hierarchical resource bundles. Multiple locales can be used at a time Inspiring people to share
  • 21. Let There Be Unicode! A control switch called unicode.semantics Global, not per request or virtual server No changes to program behavior unless enabled Does not imply no Unicode at all when disabled! Inspiring people to share
  • 22. Functions All the functions in the PHP default distribution are being analyzed to see whether they need be upgraded to understand Unicode and if so, how The upgrade is in progress and requires involvement from extension authors Parameter parsing API will perform automatic conversions while upgrades are being done Inspiring people to share
  • 23. Let There Be Unicode! String Types PHP 4/5 string types • only one, used for everything PHP 6 string types • Unicode: textual data (in UTF-16 encoding) • Binary: textual data in other encodings and true binary data Inspiring people to share
  • 24. Let There Be Unicode! String Literals With unicode.semantics=off, string literals are old‐fashioned 8- bit strings 1 character = 1 byte Inspiring people to share
  • 25. Let There Be Unicode! Unicode String Literals With unicode.semantics=on, string literals are of Unicode type 1 character may be > 1 byte To obtain length in bytes one would use a separate function Inspiring people to share
  • 26. Let There Be Unicode! Binary String Literals Binary string literals require new syntax The contents, which are the literal byte sequence inside the delimiters, depend on the encoding of the script Inspiring people to share
  • 27. Let There Be Unicode! Escape Sequences Inside Unicode strings uXXXX and UXXXXXX escape sequences may be used to specify Unicode code points explicitly Characters can also be specified by name, using the C{..} escape sequence Inspiring people to share
  • 28. Conversions Inspiring people to share
  • 29. Conversions & Encoding Runtime Encoding Runtime Encoding Specifies which encoding to use when converting between Unicode and binary strings at runtime Also used when interfacing with functions that do not yet support Unicode type Inspiring people to share
  • 30. Conversions & Encoding Script/Source Encoding Currently, scripts may be written in a variety of encodings: ISO-8859-1, Shift-JIS, UTF-8, etc. The engine needs to know the encoding of a script in order to parse it correctly Encoding can be specified as an INI setting or with declare() pragma Affects how identifiers and string literals are interpreted Inspiring people to share
  • 31. Conversions & Encoding Script Encoding Whatever the encoding of the script, the resulting string literals are of Unicode type In both cases $uni is a Unicode string containing two codepoints: U+00F8 (ø) and U+006C (l) Inspiring people to share
  • 32. Conversions & Encoding Script Encoding Encoding can be also changed with a pragma • Has to be the very first statement in the script • Does not propagate to included files Inspiring people to share
  • 33. Conversions & Encoding Output Encoding Specifies the encoding for the standard output stream The script output is transcoded on the fly Affects only Unicode strings Inspiring people to share
  • 34. Conversions & Encoding “HTTP Input Encoding” With Unicode semantics switch enabled, we need to convert HTTP input to Unicode GET requests have no encoding at all and POST ones rarely come marked with the encoding Encoding detection is not reliable Correctly decoding HTTP input is somewhat of an unsolved problem Inspiring people to share
  • 35. Conversions & Encoding “HTTP Input Encoding” PHP will perform lazy decoding Delays decoding data in $_GET, $_POST, and $_REQUEST until the first time you access them Allows user to set expected encoding or just rely on a default one Allows decoding errors to be handled by the same mechanism Applications should also use filter extension to filter incoming data Inspiring people to share
  • 36. Conversions & Encoding Filesystem Encoding Specifies the encoding of the file and directory names on the filesystem Filesystem-related functions will do the conversion when accepting and returning filenames Inspiring people to share
  • 37. Conversions & Encoding Type Conversions Unicode and binary string types can be converted to one another explicitly or implicitly Conversions use unicode.runtime_encoding Explicit conversions: casting • (binary) casts to binary string type (unicode) casts to Unicode string type (string) casts to Unicode type if unicode.semantics is on and to binary otherwise Implicit conversions: concatenation, comparison, parameter passing Inspiring people to share
  • 38. Conversions & Encoding Generic Conversions Casting is just a shortcut for converting using runtime encoding For all other encodings, use provided functions Inspiring people to share
  • 39. Conversions & Encoding Conversion Issues Unicode is a superset of legacy character sets Many Unicode characters cannot be represented in legacy encodings Strings may also contain corrupt data or irregular byte sequences You can customize what PHP should do when it runs into a conversion error Global settings apply to all conversions by default Inspiring people to share
  • 40. Unicode Identifiers PHP will allow Unicode characters in identifiers You may start with something quite simple and old-fashioned Inspiring people to share
  • 41. Unicode Identifiers For beginners... Perhaps you feel that a few accented characters won’t hurt Inspiring people to share
  • 42. Unicode Identifiers ...and advanced Then you learn a couple more languages… …and the fun begins Inspiring people to share
  • 43. Text Iterator Get used to the idea of using TextIterator It is very fast and gives you access to various text units in a generic fashion You can iterate over code points, combining sequences, characters, words, lines, and sentences forward and backward Provides access to ICU’s boundary analysis API Inspiring people to share
  • 44. Text Iterator Truncate text at a word boundary Get the last 2 sentences of the text Inspiring people to share
  • 45. Locales in PHP6 Unicode support in PHP relies exclusively on ICU locales The legacy setlocale() should not be used Default locale can be accessed with: • locale_set_default() • locale_get_default() ICU locale IDs have a somewhat different format from POSIX locale IDs: • sr_Latn_YU_REVISED@currency=USD <language>[_<script>]_<country>[_<variant>][@<keywords>] Serbian (Latin, Yugoslavia, Revised Orthography, Currency=USInspiring people to Dollar) share
  • 46. Collation Collation is the process of ordering units of textual information Specific to a particular locale, language, and document: Inspiring people to share
  • 47. Collation Collators Languages may sort more than one way • German dictionary vs. phone book • Japanese stroke-radical vs. radical-stroke • Traditional vs. modern Spanish PHP comparison operators do not use collation But PHP6 provides collators to do the job Inspiring people to share
  • 48. Collation Using collators We can ignore accents if we want: • $coll->setStrength(Collator::PRIMARY); We can sort arrays as well: • $coll->sort(array(quot;cotequot;, quot;côtequot;, quot;Côtequot;, quot;cotéquot;)); Inspiring people to share
  • 49. Collation Default collators There is a default collator associated with the default locale Can be accessed with: • collator_get_default() • collator_set_default() When the default locale is changed, the default collator changes as well Inspiring people to share
  • 50. Collation Collation API Full collation API is very flexible and customizable You can change collation strength, make it use numeric ordering, ignore or respect case level or punctuation, and much more PHP will always update its collation algorithm and data with each version of Unicode, without breaking backwards compatibility Inspiring people to share
  • 51. Stream I/O PHP has a streams-based I/O system Generalized file, network, data compression, and other operations PHP cannot assume that data on the other end of the stream is in a particular encoding Need to apply encoding conversions Inspiring people to share
  • 52. Stream I/O By default, a stream is in binary mode and no encoding conversion is done Applications can convert data explicitly But ... we’re lazy so it’s easier to let the streams do it t mode - it’s not just for Windows line endings anymore! Uses encoding setting in default context, which is UTF-8 unless changed Inspiring people to share
  • 53. Stream I/O If you mainly work with files in an encoding other than UTF-8, change default context: • stream_default_encoding('Shift-JIS'); Or create a custom context and use it instead If you have a stream that was opened in binary mode, you can also automate encoding handling • stream_encoding($fp, 'utf-8'); fopen() can actually detect the encoding from the headers, if it’s available Inspiring people to share
  • 54. Text Transforms Powerful and flexible way to process Unicode text • script-to-script conversions • normalization • case mappings and full-/halfwidth conversions • accent removal • and more Allows chained transforms • [:Latin:]; NFKD; Lower; Latin-Katakana; Inspiring people to share
  • 55. Text Transforms Transliteration Here’s how to get (a fairly reliable) Japanese pronunciation of your name: And the result: • Buritenei Supearusu Inspiring people to share
  • 56. Text Transforms Transliteration Here’s how to get (a fairly reliable) Japanese pronunciation of your name: And the result: • Buritenei Supearusu Inspiring people to share
  • 57. Current Status Most of the described functionality has been implemented Development still underway, so minor feature tweaks are possible ext/standard and a number of other extensions are being upgraded first Overall about 61% (of 3047) of extension functions have been upgraded Download PHP 6 snapshots today! Inspiring people to share
  • 58. Current Status Finished Extensions XML extensions: dom, xml, xsl, simplexml, xmlreader/xmlwriter soap, json spl, reflection mysql, mysqli, sqlite, oci8 pcre curl, gd, session zlib, bz2, zip, hash, mcrypt, shmop, tidy Inspiring people to share
  • 59. Current Status What’s Coming? PHP 6 “Unicode Preview Release” • core functionality done • first-tier extensions upgraded Afterwards • Upgrade the rest of extensions • Expose more ICU services • Update PHP manual • Optimize performance Inspiring people to share