Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
USING UNICODE WITH PHP
Translation, localization, and 100%
less mojibake guaranteed or your
users won’t come back!
The whole world uses the internet
Why is internationalization important?
Content language of websites

Percentage of Internet users by language
Worse than no internationalization?
Mojibake
Unicode is the solution!
Well – kind of

1. Different encodings
2. OS’s have different default implementations
3. All soft...
Back to Basics

WHAT IS UNICODE?
U·ni·code
ˈ oniˈkōd
yo͞
/
Noun COMPUTING

1. an international encoding standard for use
with different languages and scrip...
In the Beginning, there was ASCII
Code Pages
In which things get really weird…
Representing characters differently
ASCII

Unicode

One character to bits
in memory

Code point

A -> 100 0001

A -> U+004...
Encoding Madness
UTF – Unicode Transformation Format
Maps a Code Point to a Byte Sequence
What is a character?
å
U+212B ANGSTROM SIGN
U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
U+0041 LATIN CAPITAL LETTER A + ...
Crash course in Computer Memory
Big endian systems - most significant
bytes of a number in the upper left
corner. Decreasi...
Big Endian? Little Endian?
You’re hurting my brain
Hello -> U+0048 U+0065 U+006C U+006C U+006F
00 48 00 65 00 6C 00 6C 00 ...
UTF-8 to the rescue!

Hello in ANSI -> 48 65 6C 6C 6
Hello in UTF8 -> 48 65 6C 6C 6
Moral of the story
Unicode is a standard, not an implementation
Text is never plain
Every string has an encoding
From a fi...
Mojibake on rye with swiss.

WHY DO YOU NEED
UNICODE?
Helgi Þormar Þorbjörnsson
Laruence
More than just UTF8

BEYOND STRINGS
I18n and L10N
• Internationalization – adaptation of products for potential use
virtually everywhere
• Localization - addi...
Date and Time Formats
30 juin 2009 fr_FR
30.06.2009
de_DE
Jun 30, 2009 en_US

And don’t forget the time zones!
Currency and Numbers
• 123 456
fr_FR
• 345 987,246 fr_FR
• 123.456
de_DE
• 345.987,246 de_DE
• 123,456
en_US
• 345,987.246...
Collation (Sorting)
• The letters A-Z can be sorted in a different order than in English. For
example, in Lithuanian, "y" ...
String Translation
• Translation is never one to one, especially when inserting items like
numbers
• Some languages have d...
Layout and Design
• Reading order
• Right to left
• Left to right
• Top to bottom

• Word order
• Cultural taboos (human i...
3.5 extensions for triple the pain!

HOW TO UNICODE
WITH PHP
Upgrade to at least 5.3
• No, really, I’m entirely serious

• If you’re not on 5.3 you’re not ready for unicode
• At all

...
Install the bare minimum
• intl extension (bundled since PHP 5.3)
• mb_string (if you need zend_multibyte support or on th...
PHP strings 101
C strings and encoding
char - 1 byte (usually 8 bit)
char * - a pointer to an array of chars stored in memory
• Can handle...
Introducing a new type
wchar_t – C90 standard (horribly ambiguous)
• Windows set it at 16 – and defined A and W versions o...
Why do we care?
• PHP talks ONLY to ansi apis on windows
• PHP functions assume ascii or binary encodings (except for a fe...
Non-stupid PHP functionality (kinda)
• utf8_encode (only ISO-8859-1 to UTF8)
• utf8_decode (only UTF8 to ISO-8859-1)
• htm...
C locales or how to make servers cry
• Setlocale is Per process
• I will repeat that – setlocale sets PER PROCESS
• Locale...
What setlocale will break

•gettext extension
• strtoupper
• strtolower
• number_format
• money_format
• ucfirst
• ucwords...
INTL to the rescue!
•
•
•
•
•
•
•
•
•
•
•
•
•
•

Wrapper around the excellent ICU library
Standardized locales, set defaul...
Some intl caveats
• New stuff is only in newer PHP versions
• All strings in and out must be UTF-8 except for Uconvertor
•...
mb_string
• enables zend_multibyte support
• supports transparent http in and out encoding
• provides some wrappers for fu...
Iconv
• Primarily for charset conversion
• output buffer handler
• mime encoding functionality
• conversion
• some string ...
Stay away from:
• ctype (all of it)
• filter extension with string functionality
•
•
•
•

FILTER_VALIDATE_EMAIL
FILTER_VAL...
What do you mean mysql is giving
me garbage?

BEYOND THE CODE
Browser Considerations
• Set Content-type AND charset
• use HTTP headers AND meta tags (not just meta)
• use accept-charse...
Databases
Table/Schema encoding and connection
• Mysql you need to set the charset right on the table
AND
• Set the charse...
Other gotchas
• Plain text is not plain text, files will have encodings
• Files will be loaded as binary if you add the b ...
The best and worst in PHP apps

CASE STUDIES
Applications
• Wordpress
• gettext (sigh)
• Drupal
• gettext files but NOT gettext api
Frameworks
• ZF and ZF2
• http://framework.zend.com/manual/1.12/en/performance.localization.html
• multiple adapters
• “ge...
Resources
• http://www.joelonsoftware.com/articles/Unicode.html
• http://unicode.org
• http://www.slideshare.net/andreizm/...
My Little Project
• Get everything needed into intl from mb_string and iconv so you
need only 1 solution
•
•
•
•
•

stream...
Contact
• auroraeosrose@gmail.com
• @auroraeosrose
• http://emsmith.net
• http://github.com/auroraeosrose
• Freenode
• #ph...
Upcoming SlideShare
Loading in …5
×

Using unicode with php

7,374 views

Published on

our application is great – and popular. You have translation efforts underway, everything is going well – and wait a minute, what’s the report of strange question mark characters all over the page? Unicode is pain. UTF-32, UTF-16, UTF-8 and then something else is thrown in the mix … Multibyte and codepoints, it all sounds like greek. But it doesn’t have to be so scary. PHP support for Unicode has been improving, even without native unicode string support. Learn the basics of unicode is and how it works, why you would add support for it in your application, how to deal with issues, and the pain points of implementation.

Published in: Technology
  • Be the first to comment

Using unicode with php

  1. 1. USING UNICODE WITH PHP Translation, localization, and 100% less mojibake guaranteed or your users won’t come back!
  2. 2. The whole world uses the internet
  3. 3. Why is internationalization important? Content language of websites Percentage of Internet users by language
  4. 4. Worse than no internationalization? Mojibake
  5. 5. Unicode is the solution! Well – kind of 1. Different encodings 2. OS’s have different default implementations 3. All software encodings have to match or convert Unicode Idea == simple Unicode Implementation == hard
  6. 6. Back to Basics WHAT IS UNICODE?
  7. 7. U·ni·code ˈ oniˈkōd yo͞ / Noun COMPUTING 1. an international encoding standard for use with different languages and scripts, by which each letter, digit, or symbol is assigned a unique numeric value that applies across different platforms and programs.
  8. 8. In the Beginning, there was ASCII
  9. 9. Code Pages In which things get really weird…
  10. 10. Representing characters differently ASCII Unicode One character to bits in memory Code point A -> 100 0001 A -> U+0041 Direct Abstract But how do we represent this in memory?
  11. 11. Encoding Madness UTF – Unicode Transformation Format Maps a Code Point to a Byte Sequence
  12. 12. What is a character? å U+212B ANGSTROM SIGN U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE U+0041 LATIN CAPITAL LETTER A + U+030A COMBINING RING ABOVE How long is the string? 1. In bytes? 2. In code units? 3. In code points? 4. In graphemes?
  13. 13. Crash course in Computer Memory Big endian systems - most significant bytes of a number in the upper left corner. Decreasing significance. Little endian systems – most significant bytes of a number in the lower right. Increasing significance.
  14. 14. Big Endian? Little Endian? You’re hurting my brain Hello -> U+0048 U+0065 U+006C U+006C U+006F 00 48 00 65 00 6C 00 6C 00 6F – Little Endian 48 00 65 00 6C 00 6C 00 6F 00 - Big Endian But.. It’s the same way to encode unicode… Now I have a headache!
  15. 15. UTF-8 to the rescue! Hello in ANSI -> 48 65 6C 6C 6 Hello in UTF8 -> 48 65 6C 6C 6
  16. 16. Moral of the story Unicode is a standard, not an implementation Text is never plain Every string has an encoding From a file From a db From an HTTP POST or GET (or PUT or file upload…) Even Binary is an encoding! No encoding? Start praying to the Mojibake gods… If you do web – use UTF-8
  17. 17. Mojibake on rye with swiss. WHY DO YOU NEED UNICODE?
  18. 18. Helgi Þormar Þorbjörnsson
  19. 19. Laruence
  20. 20. More than just UTF8 BEYOND STRINGS
  21. 21. I18n and L10N • Internationalization – adaptation of products for potential use virtually everywhere • Localization - addition of special features for use in a specific locale
  22. 22. Date and Time Formats 30 juin 2009 fr_FR 30.06.2009 de_DE Jun 30, 2009 en_US And don’t forget the time zones!
  23. 23. Currency and Numbers • 123 456 fr_FR • 345 987,246 fr_FR • 123.456 de_DE • 345.987,246 de_DE • 123,456 en_US • 345,987.246 en_US • French (France), Euro: 9 876 543,21 € • German (Germany), Euro: 9.876.543,21 € • English (United States), US Dollar: $9,876,543.21
  24. 24. Collation (Sorting) • The letters A-Z can be sorted in a different order than in English. For example, in Lithuanian, "y" is sorted between "i" and "k” • Combinations of letters can be treated as if they were one letter. For example, in traditional Spanish "ch" is treated as a single letter, and sorted between "c" and "d” • Accented letters can be treated as minor variants of the unaccented letter. For example, "é" can be treated equivalent to "e”. • Accented letters can be treated as distinct letters. For example, "Å" in Danish is treated as a separate letter that sorts just after "Z”.
  25. 25. String Translation • Translation is never one to one, especially when inserting items like numbers • Some languages have different grammars and formats for the strangest things • Usually translated strings are separated into “messages” and stored, then mapped depending on the locale • Large amounts of text need even more – different tables in a database, files in directories, or more
  26. 26. Layout and Design • Reading order • Right to left • Left to right • Top to bottom • Word order • Cultural taboos (human images, for example)
  27. 27. 3.5 extensions for triple the pain! HOW TO UNICODE WITH PHP
  28. 28. Upgrade to at least 5.3 • No, really, I’m entirely serious • If you’re not on 5.3 you’re not ready for unicode • At all • You have far bigger issues to deal with – like no security updates • (oh, and the extensions and apis you need either don’t exist or won’t work right)
  29. 29. Install the bare minimum • intl extension (bundled since PHP 5.3) • mb_string (if you need zend_multibyte support or on the fly conversion, but most anything else it can do intl does better) • iconv extension (optional but excellent for dealing with files) • pcre MUST have utf8 support (CHECK!)
  30. 30. PHP strings 101
  31. 31. C strings and encoding char - 1 byte (usually 8 bit) char * - a pointer to an array of chars stored in memory • Can handle Code Page encodings, although generally need special APIs for dealing with multibyte code pages • Usually null terminated… well unless it’s a binary string • Unix cleverly supports utf8 with apis • Windows … does not
  32. 32. Introducing a new type wchar_t – C90 standard (horribly ambiguous) • Windows set it at 16 – and defined A and W versions of everything • Unix set it at 32 C99 and C++11 do char16_t and char32_t to fix the craziness Non-portable and api support sketchy • Libraries to fix this exist • Few are cross-platform • Except for ICU – which just rocks
  33. 33. Why do we care? • PHP talks ONLY to ansi apis on windows • PHP functions assume ascii or binary encodings (except for a few special ones) • Although most functions are now marked “binary safe” and don’t flip out on null bytes within a string, some still assume a null terminated string • string handling functions treat strings as a sequence of single-byte characters.
  34. 34. Non-stupid PHP functionality (kinda) • utf8_encode (only ISO-8859-1 to UTF8) • utf8_decode (only UTF8 to ISO-8859-1) • html_ entity_ decode • htmlentities • htmlspecialchars_ decode • htmlspecialchars
  35. 35. C locales or how to make servers cry • Setlocale is Per process • I will repeat that – setlocale sets PER PROCESS • Locales are slightly different on different OS’s • Windows does not support utf8 properly
  36. 36. What setlocale will break •gettext extension • strtoupper • strtolower • number_format • money_format • ucfirst • ucwords • strftime
  37. 37. INTL to the rescue! • • • • • • • • • • • • • • Wrapper around the excellent ICU library Standardized locales, set default locale per script Number formatting Currency formatting Message formatting (replaces gettext) Calendars, dates, timezones and time Transliterator Spoofchecker Resource Bundles Convertors IDN support Graphemes Collation Iterators
  38. 38. Some intl caveats • New stuff is only in newer PHP versions • All strings in and out must be UTF-8 except for Uconvertor • Intl doesn’t yet support zend_multibyte • Intl doesn’t support HTTP input/output conversion • Intl doesn’t support function “overloading”
  39. 39. mb_string • enables zend_multibyte support • supports transparent http in and out encoding • provides some wrappers for functionality such as strtoupper (including overloading the php function version…)
  40. 40. Iconv • Primarily for charset conversion • output buffer handler • mime encoding functionality • conversion • some string helpers • • • • len substr strpos strrpos • stream filter stream_filter_append($fp, 'convert.iconv.ISO-2022-JP/EUC-JP');
  41. 41. Stay away from: • ctype (all of it) • filter extension with string functionality • • • • FILTER_VALIDATE_EMAIL FILTER_VALIDATE_URL FILTER_VALIDATE_REGEXP FILTER_SANITIZE_* • some string functionality • str_pad • wordwrap • others that might work only by looking at single bytes
  42. 42. What do you mean mysql is giving me garbage? BEYOND THE CODE
  43. 43. Browser Considerations • Set Content-type AND charset • use HTTP headers AND meta tags (not just meta) • use accept-charset on forms to make sure your data is coming in right • Javascript: string literals, regular expression literals and any code unit can also be expressed via a Unicode escape sequence uHHHH • Specify content-type AND charset headers for javascript!!
  44. 44. Databases Table/Schema encoding and connection • Mysql you need to set the charset right on the table AND • Set the charset right on the connection (NOT set names, it does not do enough) AND • Don’t use mysql – mysqli or pdo • postgresql - pg_set_client_encoding • oracle – passed in the connect • sqlite(3) – make sure it was compiled with unicode and intl extension is available • sqlsrv/pdo_sqlsrv – CharacterSet in options
  45. 45. Other gotchas • Plain text is not plain text, files will have encodings • Files will be loaded as binary if you add the b flag to fopen (here’s a hint, always use the b flag) • You can convert files on the fly with the iconv filter • You cannot use unicode file names with PHP and windows at all (no, not even utf8) – unless you find a 3 rd party php extension • Beware of sending anything but ascii to exec, proc_open and other command line calls
  46. 46. The best and worst in PHP apps CASE STUDIES
  47. 47. Applications • Wordpress • gettext (sigh) • Drupal • gettext files but NOT gettext api
  48. 48. Frameworks • ZF and ZF2 • http://framework.zend.com/manual/1.12/en/performance.localization.html • multiple adapters • “gettext” allows using fast .po files, but doesn’t use setlocale/gettext extension • Symfony 1 and 2 • http://symfony.com/doc/current/book/translation.html • multiple formats to hold translations • doesn’t use gettext
  49. 49. Resources • http://www.joelonsoftware.com/articles/Unicode.html • http://unicode.org • http://www.slideshare.net/andreizm/the-good-the-bad-andthe-ugly-what-happened-to-unicode-and-php-6 • http://php.net • http://www.2ality.com/2013/09/javascript-unicode.html • http://htmlpurifier.org/docs/enduser-utf8.html
  50. 50. My Little Project • Get everything needed into intl from mb_string and iconv so you need only 1 solution • • • • • stream filter from iconv output handler from iconv zend_multibyte support from mb_string http in and output conversion from mb_string Some simplified apis to make “overloading” doable
  51. 51. Contact • auroraeosrose@gmail.com • @auroraeosrose • http://emsmith.net • http://github.com/auroraeosrose • Freenode • #phpwomen • #phpmentoring • #php-gtk

×