Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Using unicode with php


Published on

Published in: Technology
  • It is Laruence, not Laurence, btw :)
    Are you sure you want to  Yes  No
    Your message goes here

Using unicode with php

  1. 1. Translation, localization, and 100% less mojibake guaranteed or your users won’t come back! USING UNICODE WITH PHP
  2. 2. The whole world uses the internet
  3. 3. Why is internationalization important? Content language of websites Percentage of Internet users by language
  4. 4. Worse than no internationalization? Mojibake
  5. 5. Unicode is the solution! Well – kind of 1. Different encodings 2. OS’s have different default implementations 3. All software encodings have to match or convert Unicode Idea == simple Unicode Implementation == hard
  6. 6. Back to Basics WHAT IS UNICODE?
  7. 7. U·ni·code ˈyo͞oniˈkōd/ Noun COMPUTING 1. an international encoding standard for use with different languages and scripts, by which each letter, digit, or symbol is assigned a unique numeric value that applies across different platforms and programs.
  8. 8. In the Beginning, there was ASCII
  9. 9. Code Pages In which things get really weird…
  10. 10. ASCII Unicode One character to bits in memory Code point A -> 0100 0001 A -> U+0041 Direct Abstract Representing characters differently But how do we represent this in memory?
  11. 11. Encoding Madness UTF – Unicode Transformation Format Maps a Code Point to a Byte Sequence
  12. 12. What is a character? å (A + COMBINING RING or A-RING) How long is the string? 1. In bytes? 2. In code units? 3. In code points? 4. In graphemes?
  13. 13. Crash course in Computer Memory Big endian systems - most significant bytes of a number in the upper left corner. Decreasing significance. Little endian systems – most significant bytes of a number in the lower right. Increasing significance.
  14. 14. Big Endian? Little Endian? You’re hurting my brain Hello -> U+0048 U+0065 U+006C U+006C U+006F 00 48 00 65 00 6C 00 6C 00 6F – Little Endian 48 00 65 00 6C 00 6C 00 6F 00 - Big Endian But.. It’s the same way to encode unicode… Now I have a headache!
  15. 15. UTF-8 to the rescue! Hello in ANSI -> 48 65 6C 6C 6 Hello in UTF8 -> 48 65 6C 6C 6
  16. 16. Moral of the story Unicode is a standard, not an implementation Text is never plain Every string has an encoding From a file From a db From an HTTP POST or GET (or PUT or file upload…) No encoding? Start praying to the Mojibake gods… If you do web – use UTF-8
  17. 17. Mojibake on rye with swiss. WHY DO YOU NEED UNICODE?
  18. 18. Helgi Þormar Þorbjörnsson
  19. 19. Laurence
  20. 20. More than just UTF8 BEYOND STRINGS
  21. 21. I18n and L10N • Internationalization – adaptation of products for potential use virtually everywhere • Localization - addition of special features for use in a specific locale
  22. 22. Date and Time Formats 30 juin 2009 fr_FR 30.06.2009 de_DE Jun 30, 2009 en_US And don’t forget the time zones!
  23. 23. Currency and Numbers •123 456 fr_FR •345 987,246 fr_FR •123.456 de_DE •345.987,246 de_DE •123,456 en_US •345,987.246 en_US •French (France), Euro: 9 876 543,21 € •German (Germany), Euro: 9.876.543,21 € •English (United States), US Dollar: $9,876,543.21
  24. 24. Collation (Sorting) • The letters A-Z can be sorted in a different order than in English. For example, in Lithuanian, "y" is sorted between "i" and "k” • Combinations of letters can be treated as if they were one letter. For example, in traditional Spanish "ch" is treated as a single letter, and sorted between "c" and "d” • Accented letters can be treated as minor variants of the unaccented letter. For example, "é" can be treated equivalent to "e”. • Accented letters can be treated as distinct letters. For example, "Å" in Danish is treated as a separate letter that sorts just after "Z”.
  25. 25. String Translation • Translation is never one to one, especially when inserting items like numbers • Some languages have different grammars and formats for the strangest things • Usually translated strings are separated into “messages” and stored, then mapped depending on the locale • Large amounts of text need even more – different tables in a database, files in directories, or more
  26. 26. Layout and Design • Reading order • Right to left • Left to right • Top to bottom • Word order • Cultural taboos (human images, for example)
  27. 27. 3.5 extensions for triple the pain! HOW TO UNICODE WITH PHP
  28. 28. Upgrade to at least 5.3 • No, really, I’m entirely serious • If you’re not on 5.3 you’re not ready for unicode • At all • You have far bigger issues to deal with – like no security updates • (oh, and the extensions and apis you need either don’t exist or won’t work right)
  29. 29. Install the bare minimum • intl extension (bundled since PHP 5.3) • mb_string (if you need zend_multibyte support or on the fly conversion, but most anything else it can do intl does better) • iconv extension (optional but excellent for dealing with files) • pcre MUST have utf8 support (CHECK!)
  30. 30. PHP strings 101
  31. 31. C strings and encoding char - 1 byte (usually 8 bit) char * - a pointer to an array of chars stored in memory • Can handle Code Page encodings, although generally need special APIs for dealing with multibyte code pages • Usually null terminated… well unless it’s a binary string • Unix cleverly supports utf8 with apis • Windows … does not
  32. 32. Introducing a new type wchar_t – C90 standard (horribly ambiguous) • Windows set it at 16 – and defined A and W versions of everything • Unix set it at 32 C99 and C++11 do char16_t and char32_t to fix the craziness Non-portable and api support sketchy • Libraries to fix this exist • Few are cross-platform • Except for ICU – which just rocks
  33. 33. Why do we care? • PHP talks ONLY to ansi apis on windows • PHP functions assume ascii or binary encodings (except for a few special ones) • Although most functions are now marked “binary safe” and don’t flip out on null bytes within a string, some still assume a null terminated string • string handling functions treat strings as a sequence of single-byte characters.
  34. 34. Non-stupid PHP functionality • utf8_encode (only ISO-8859-1 to UTF8) • utf8_decode (only UTF8 to ISO-8859-1) • html_ entity_ decode • htmlentities • htmlspecialchars_ decode • htmlspecialchars
  35. 35. C locales or how to make servers cry • Setlocale is Per process • I will repeat that – setlocale sets PER PROCESS • Locales are slightly different on different OS’s • Windows does not support utf8 properly
  36. 36. What setlocale will break •gettext extension • strtoupper • strtolower • number_format • money_format • ucfirst • ucwords • strftime
  37. 37. INTL to the rescue! • Wrapper around the excellent ICU library • Standardized locales, set default locale per script • Number formatting • Currency formatting • Message formatting (replaces gettext) • Calendars, dates, timezones and time • Transliterator • Spoofchecker • Resource Bundles • Convertors • IDN support • Graphemes • Collation • Iterators
  38. 38. Some intl caveats • New stuff is only in newer PHP versions • All strings in and out must be UTF-8 except for Uconvertor • Intl doesn’t yet support zend_multibyte • Intl doesn’t support HTTP input/output conversion • Intl doesn’t support function “overloading”
  39. 39. mb_string • enables zend_multibyte support • supports transparent http in and out encoding • provides some wrappers for functionality such as strtoupper (including overloading the php function version…)
  40. 40. Iconv • Primarily for charset conversion • output buffer handler • mime encoding functionality • conversion • some string helpers • len • substr • strpos • strrpos • stream filter stream_filter_append($fp, 'convert.iconv.ISO-2022-JP/EUC-JP');
  41. 41. What do you mean mysql is giving me garbage? BEYOND THE CODE
  42. 42. Browser Considerations • Set Content-type AND charset • use HTTP headers AND meta tags (not just meta) • use accept-charset on forms to make sure your data is coming in right • Javascript: string literals, regular expression literals and any code unit can also be expressed via a Unicode escape sequence uHHHH • Specify content-type AND charset headers for javascript!!
  43. 43. Databases Table/Schema encoding and connection • Mysql you need to set the charset right on the table AND • Set the charset right on the connection (NOT set names, it does not do enough) AND • Don’t use mysql – mysqli or pdo • postgresql - pg_set_client_encoding • oracle – passed in the connect • sqlite(3) – make sure it was compiled with unicode and intl extension is available • sqlsrv/pdo_sqlsrv – CharacterSet in options
  44. 44. Other gotchas • Plain text is not plain text, files will have encodings • Files will be loaded as binary if you add the b flag to fopen (here’s a hint, always use the b flag) • You can convert files on the fly with the iconv filter • You cannot use unicode file names with PHP and windows at all (no, not even utf8) – unless you find a 3rd party php extension • Beware of sending anything but ascii to exec, proc_open and other command line calls
  45. 45. The best and worst in PHP apps CASE STUDIES
  46. 46. Applications • Wordpress • gettext (sigh) • Drupal • gettext files but NOT gettext api
  47. 47. Frameworks • ZF and ZF2 • • multiple adapters • “gettext” allows using fast .po files, but doesn’t use setlocale/gettext extension • Symfony 1 and 2 • • multiple formats to hold translations • doesn’t use gettext
  48. 48. Resources • • • the-ugly-what-happened-to-unicode-and-php-6 • • •
  49. 49. My Little Project • Get everything needed into intl from mb_string and iconv so you need only 1 solution • stream filter from iconv • output handler from iconv • zend_multibyte support from mb_string • http in and output conversion from mb_string • Some simplified apis to make “overloading” doable
  50. 50. Contact • • @auroraeosrose • • • Freenode • #phpwomen • #phpmentoring • #php-gtk