PHP Internationalization with ICU By Stas Malyshev, Zend Technologies
What and why? ICU -  http://icu-project.org/  (IBM) Unicode CLDR -  http://cldr.unicode.org/
Intl extension Locale Collator Number & Currency formatter Date & Time formatter Message & Choice formatter Normalizer Graphemes IDN Calendars Resources
Intl extension Dual API OO and procedural Same implementation underneath collator_create () ==  new Collator () numfmt_format () ==  NumberFormatter::format () locale_get_default () ==  Locale::getDefault ()
Locale Relies on ICU locales <language>[_<script>]_<country>[_<variant>][@<keywords>] Default locale new Collator(Locale::DEFAULT ) Locale::setDefault, Locale::getDefault You can use  null
Locale Locale pieces getPrimaryLanguage($locale) getScript($locale) getRegion($locale) getVariant($locale) getKeywords($locale)
Locale Locale display pieces getDisplayName($locale, $in_locale = null) getDisplayLanguage($locale, $in_locale = null) getDisplayScript($locale, $in_locale = null) getDisplayRegion($locale, $in_locale = null) Example: getDisplayScript(getScript(&quot;zh-Hant-TW&quot;), &quot;en-US&quot;)   returns  “Traditional Chinese”
Locale building blocks parseLocale () - returns array composed of locale subtags composeLocale () - creates locale ID out of subtags parseLocale('sr-Latn-RS')   returns  array('language'=>'sr', 'script'=>'Latn', 'region'=>’RS’) composeLocale(array('language'=>'sr', 'script'=>'Latn', 'region'=>’RS’))   returns  ‘sr-Latn-RS ’
Locale guessing acceptFromHttp -  Accept-Language to locale lookup  – find in the list  filterMatches  – are they the same?
Collator Comparing, sorting strings Collation level (strength) All ICU collator attributes Numeric collation Ignoring punctuation Not yet: custom “tailoring” rules
Collator $coll  = new  Collator ( &quot;fr_CA&quot; ); if ( $coll -> compare ( &quot;côte&quot; ,  &quot;coté&quot; ) <  0 ) {       echo  &quot;less\n&quot; ;  } else {       echo  &quot;greater\n&quot; ;  }    côte < coté
Collator $strings  = array( &quot;cote&quot; ,  &quot;côte&quot; ,  &quot;Côte&quot; ,  &quot;coté&quot; , &quot;Coté&quot; ,  &quot;côté&quot; ,  &quot;Côté&quot; ,  &quot;coter&quot; ); $coll  = new  Collator ( &quot;fr_CA&quot; );  $coll -> sort ( $strings );   cote côte Côte coté Coté côté Côté coter sort($array, $flags) asort($array, $flags) sortWithSortKeys($array)
NumberFormatter Formatting and parsing Numbers and currency numfmt_create($locale, $style, $pattern = null) NumberFormatter::PATTERN_DECIMAL  NumberFormatter::ORDINAL NumberFormatter::DECIMAL NumberFormatter::DURATION NumberFormatter::CURRENCY  NumberFormatter::SCIENTIFIC NumberFormatter::PERCENT  NumberFormatter::SPELLOUT
NumberFormatter Formatting $fmt  = new  NumberFormatter ( ‘en_US’ ,                            NumberFormatter :: DECIMAL ); echo $fmt -> format ( 1234 ); // result is 1,234 $fmt  = new  NumberFormatter ( ‘de_CH’ ,                            NumberFormatter :: DECIMAL ); echo $fmt -> format ( 1234 ); // result is 1'234
NumberFormatter Parsing $fmt  = new  NumberFormatter ( ‘de_DE’ ,                            NumberFormatter :: DECIMAL ); $num  =  ‘1.234 , 567 min’ ; $fmt -> parse ( $num ,  NumberFormatter :: TYPE_DOUBLE ,  $pos ); // result is 1234.567 , $pos = 9   $fmt -> parse ( $num ,  NumberFormatter :: TYPE_INT32 ); // result is 1234
MessageFormatter Formatting and parsing whole messages, including data inside Also allows choice between things printed: 0≤are no files|1≤is one file|1<are many files
MessageFormatter $fmt  = new  MessageFormatter ( &quot;en_US&quot; ,  &quot;{0,number,integer}    monkeys on {1,number,integer} trees    make {2,number} monkeys per tree&quot; ); echo  $fmt -> format (array( 4560 ,  123 ,  4560 / 123 )); $fmt  = new  MessageFormatter ( &quot;de&quot; ,  &quot;{0,number,integer}    Affen über {1,number,integer} Bäume    um {2,number} Affen pro Baum&quot; ); echo  $fmt -> format (array( 4560 ,  123 ,  4560 / 123 ));
IntlDateFormatter Allows using locale-dependent canned patterns Short, medium, long date & time Long: Tuesday, April 12, 1952 AD or 3:30:42pm PST  Medium: January 12, 1952 or 3:30:32pm  Short: 12/13/52 or 3:30pm  Also allows free-form patterns &quot;yyyy.MM.dd G 'at' HH:mm:ss vvvv&quot;  1996.07.10 AD at 15:08:56 Pacific Time
IntlDateFormatter $fmt  = new  IntlDateFormatter (  &quot;en_US&quot;  ,   IntlDateFormatter :: FULL ,   IntlDateFormatter :: FULL , 'America/Los_Angeles' , IntlDateFormatter :: GREGORIAN ); echo  $fmt -> format ( 0 ); //   Wednesday, December 31, 1969 4:00:00 PM PT   $fmt  = new  IntlDateFormatter (  &quot;de-DE&quot;  ,   IntlDateFormatter :: FULL ,   IntlDateFormatter :: FULL , 'America/Los_Angeles' , IntlDateFormatter :: GREGORIAN ); echo  $fmt -> format ( 0 );   //   Mittwoch, 31. Dezember 1969 16:00 Uhr GMT-08:00
Normalizer Brings Unicode text to one of the normal forms: NFC, NFD, NFKC, NFKD normalize(), isNormalized() $combining_ring_above  =  &quot;\xCC\x8A&quot; ;   // 'COMBINING RING ABOVE' (U+030A)   $chars  =  Normalizer :: normalize (  'A'  .  $combining_ring_above ,  Normalizer :: FORM_C  );  echo  urlencode ( $chars );  // %C3%85 i.e.  // 'LATIN CAPITAL LETTER A WITH RING ABOVE' (U+00C5)
Grapheme functions Graphemes are multi-char entities, like letter + accent mark(s) Same as string functions, but operate on grapheme units Strlen, substr, strpos, strstr Extraction function – extract to fill limited buffer, but always keep graphemes whole
IDN עברית .idn.icann.org ↔ xn--5dbqzzl.idn.icann.org русский.idn.icann.org ↔ xn--h1acbxfam.idn.icann.org idn_to_ascii idn_to_utf8
TODO ResourceHandler Transliteration  StringSearch Tighter integration with other modules in 6.0
Thanks! http://php.net/intl  for futher information.

I18n with PHP 5.3

  • 1.
    PHP Internationalization withICU By Stas Malyshev, Zend Technologies
  • 2.
    What and why?ICU - http://icu-project.org/ (IBM) Unicode CLDR - http://cldr.unicode.org/
  • 3.
    Intl extension LocaleCollator Number & Currency formatter Date & Time formatter Message & Choice formatter Normalizer Graphemes IDN Calendars Resources
  • 4.
    Intl extension DualAPI OO and procedural Same implementation underneath collator_create () == new Collator () numfmt_format () == NumberFormatter::format () locale_get_default () == Locale::getDefault ()
  • 5.
    Locale Relies onICU locales <language>[_<script>]_<country>[_<variant>][@<keywords>] Default locale new Collator(Locale::DEFAULT ) Locale::setDefault, Locale::getDefault You can use null
  • 6.
    Locale Locale piecesgetPrimaryLanguage($locale) getScript($locale) getRegion($locale) getVariant($locale) getKeywords($locale)
  • 7.
    Locale Locale displaypieces getDisplayName($locale, $in_locale = null) getDisplayLanguage($locale, $in_locale = null) getDisplayScript($locale, $in_locale = null) getDisplayRegion($locale, $in_locale = null) Example: getDisplayScript(getScript(&quot;zh-Hant-TW&quot;), &quot;en-US&quot;) returns “Traditional Chinese”
  • 8.
    Locale building blocksparseLocale () - returns array composed of locale subtags composeLocale () - creates locale ID out of subtags parseLocale('sr-Latn-RS') returns array('language'=>'sr', 'script'=>'Latn', 'region'=>’RS’) composeLocale(array('language'=>'sr', 'script'=>'Latn', 'region'=>’RS’)) returns ‘sr-Latn-RS ’
  • 9.
    Locale guessing acceptFromHttp- Accept-Language to locale lookup – find in the list filterMatches – are they the same?
  • 10.
    Collator Comparing, sortingstrings Collation level (strength) All ICU collator attributes Numeric collation Ignoring punctuation Not yet: custom “tailoring” rules
  • 11.
    Collator $coll  = new Collator ( &quot;fr_CA&quot; ); if ( $coll -> compare ( &quot;côte&quot; ,  &quot;coté&quot; ) <  0 ) {       echo  &quot;less\n&quot; ;  } else {       echo  &quot;greater\n&quot; ;  }  côte < coté
  • 12.
    Collator $strings  = array(&quot;cote&quot; ,  &quot;côte&quot; ,  &quot;Côte&quot; ,  &quot;coté&quot; , &quot;Coté&quot; ,  &quot;côté&quot; ,  &quot;Côté&quot; ,  &quot;coter&quot; ); $coll  = new  Collator ( &quot;fr_CA&quot; );  $coll -> sort ( $strings ); cote côte Côte coté Coté côté Côté coter sort($array, $flags) asort($array, $flags) sortWithSortKeys($array)
  • 13.
    NumberFormatter Formatting andparsing Numbers and currency numfmt_create($locale, $style, $pattern = null) NumberFormatter::PATTERN_DECIMAL NumberFormatter::ORDINAL NumberFormatter::DECIMAL NumberFormatter::DURATION NumberFormatter::CURRENCY NumberFormatter::SCIENTIFIC NumberFormatter::PERCENT NumberFormatter::SPELLOUT
  • 14.
    NumberFormatter Formatting $fmt = new  NumberFormatter ( ‘en_US’ ,                            NumberFormatter :: DECIMAL ); echo $fmt -> format ( 1234 ); // result is 1,234 $fmt  = new  NumberFormatter ( ‘de_CH’ ,                            NumberFormatter :: DECIMAL ); echo $fmt -> format ( 1234 ); // result is 1'234
  • 15.
    NumberFormatter Parsing $fmt = new  NumberFormatter ( ‘de_DE’ ,                            NumberFormatter :: DECIMAL ); $num  =  ‘1.234 , 567 min’ ; $fmt -> parse ( $num ,  NumberFormatter :: TYPE_DOUBLE ,  $pos ); // result is 1234.567 , $pos = 9 $fmt -> parse ( $num ,  NumberFormatter :: TYPE_INT32 ); // result is 1234
  • 16.
    MessageFormatter Formatting andparsing whole messages, including data inside Also allows choice between things printed: 0≤are no files|1≤is one file|1<are many files
  • 17.
    MessageFormatter $fmt  = new MessageFormatter ( &quot;en_US&quot; ,  &quot;{0,number,integer}  monkeys on {1,number,integer} trees  make {2,number} monkeys per tree&quot; ); echo  $fmt -> format (array( 4560 ,  123 ,  4560 / 123 )); $fmt  = new  MessageFormatter ( &quot;de&quot; ,  &quot;{0,number,integer}  Affen über {1,number,integer} Bäume  um {2,number} Affen pro Baum&quot; ); echo  $fmt -> format (array( 4560 ,  123 ,  4560 / 123 ));
  • 18.
    IntlDateFormatter Allows usinglocale-dependent canned patterns Short, medium, long date & time Long: Tuesday, April 12, 1952 AD or 3:30:42pm PST Medium: January 12, 1952 or 3:30:32pm Short: 12/13/52 or 3:30pm Also allows free-form patterns &quot;yyyy.MM.dd G 'at' HH:mm:ss vvvv&quot; 1996.07.10 AD at 15:08:56 Pacific Time
  • 19.
    IntlDateFormatter $fmt  = new IntlDateFormatter (  &quot;en_US&quot;  , IntlDateFormatter :: FULL , IntlDateFormatter :: FULL , 'America/Los_Angeles' , IntlDateFormatter :: GREGORIAN ); echo  $fmt -> format ( 0 ); // Wednesday, December 31, 1969 4:00:00 PM PT $fmt  = new  IntlDateFormatter (  &quot;de-DE&quot;  , IntlDateFormatter :: FULL , IntlDateFormatter :: FULL , 'America/Los_Angeles' , IntlDateFormatter :: GREGORIAN ); echo  $fmt -> format ( 0 ); // Mittwoch, 31. Dezember 1969 16:00 Uhr GMT-08:00
  • 20.
    Normalizer Brings Unicodetext to one of the normal forms: NFC, NFD, NFKC, NFKD normalize(), isNormalized() $combining_ring_above  =  &quot;\xCC\x8A&quot; ;   // 'COMBINING RING ABOVE' (U+030A) $chars  =  Normalizer :: normalize (  'A'  .  $combining_ring_above ,  Normalizer :: FORM_C  ); echo  urlencode ( $chars ); // %C3%85 i.e. // 'LATIN CAPITAL LETTER A WITH RING ABOVE' (U+00C5)
  • 21.
    Grapheme functions Graphemesare multi-char entities, like letter + accent mark(s) Same as string functions, but operate on grapheme units Strlen, substr, strpos, strstr Extraction function – extract to fill limited buffer, but always keep graphemes whole
  • 22.
    IDN עברית .idn.icann.org↔ xn--5dbqzzl.idn.icann.org русский.idn.icann.org ↔ xn--h1acbxfam.idn.icann.org idn_to_ascii idn_to_utf8
  • 23.
    TODO ResourceHandler Transliteration StringSearch Tighter integration with other modules in 6.0
  • 24.
    Thanks! http://php.net/intl for futher information.

Editor's Notes

  • #3 Globalization Formats, names, rules, algorithms – complexity &amp; volume Keeping it all up-to-date
  • #11 Strength is which character properties matter (a vs. à, a vs. A) and which characters matter (space, punctuation) Attributes: which case first, which characters are considered space/visible, if to use normalization, if two representations of the same (e.g. Katakana/Hiragana) are different