Multibyte string handling in PHP with the mbstring extension By Daniel Rhodes of Warp Asylum ( www.warpasylum.co.uk ) As s...
What is mbstring for? <ul><li>Multibyte string handling
Supports many character encodings including unicode
Supports some different national languages *
Character encoding conversion
Some Japanese specific functions / settings </li></ul>
Mbstring is NOT... <ul><li>A magic way to get the internals of the PHP interpreter itself to suddenly operate natively wit...
How to get mbstring <ul><li>Regular (but not “built-in”) extension for PHP
On most PHP servers it's already there so...
...just switch it on!
Present and switched on out-of-the-box in Zend Server (CE and upwards)
If not present then download, but shouldn't need to compile etc </li></ul>
Some key directives for mbstring <ul><li>mbstring.internal_encoding
mbstring.language
See http://php.net/manual/en/mbstring.configuration.php </li></ul>
Easy peasy in Zend Server
Enough now – let's rock and roll! <ul><li>Mbstring gives us multibyte-safe versions of the “core” string handling functions
For example, we all know strlen() …
… So let's have a look at mb_strlen() </li></ul>
mb_strlen()
More mb_strlen()
Even more mb_strlen()
Still rocking and rolling... <ul>Mbstring gives us multibyte-safe versions of the “core” string handling functions <li>For...
… So let's have a look at mb_strpos() </li></ul>
mb_strpos()
More mb_strpos()
Wrapping up and moving on <ul>Mbstring gives us multibyte-safe versions of the “core” string handling functions <li>There ...
BE CAREFUL but you can make calls to strlen() (and etc) automatically call mb_strlen()  - this is the mbstring.func_overlo...
Mbstring specific functions <ul>Let's look at character encodings first <li>mb_detect_encoding()
mb_convert_encoding()
LOTS of supported encodings
Upcoming SlideShare
Loading in …5
×

Multibyte string handling in PHP

6,987 views

Published on

Multibyte string handling in PHP with the mbstring extension

Published in: Technology
3 Comments
2 Likes
Statistics
Notes
  • So, with PHP 5.6, you won't really need to set mbstring.internal_encoding with ini_set() [or mb_internal_encoding()]. Rather, mbstring will pick up the default_charset directive value. For more details see http://php.net/manual/en/ini.core.php#ini.default-charset
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Note that PHP 5.6 deprecates a bunch of mbstring (and iconv) configuration directives to do with character encodings: http://php.net/manual/en/migration56.deprecated.php
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Just to let you all know that this slideshow is quite old now so please check http://docs.php.net/manual/en/book.mbstring.php for the current state of play with PHP's mbstring.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
6,987
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
31
Comments
3
Likes
2
Embeds 0
No embeds

No notes for slide

Multibyte string handling in PHP

  1. 1. Multibyte string handling in PHP with the mbstring extension By Daniel Rhodes of Warp Asylum ( www.warpasylum.co.uk ) As seen on Zend.com!
  2. 2. What is mbstring for? <ul><li>Multibyte string handling
  3. 3. Supports many character encodings including unicode
  4. 4. Supports some different national languages *
  5. 5. Character encoding conversion
  6. 6. Some Japanese specific functions / settings </li></ul>
  7. 7. Mbstring is NOT... <ul><li>A magic way to get the internals of the PHP interpreter itself to suddenly operate natively with unicode (you'll have to wait and follow the development of PHP itself for that!) </li></ul>
  8. 8. How to get mbstring <ul><li>Regular (but not “built-in”) extension for PHP
  9. 9. On most PHP servers it's already there so...
  10. 10. ...just switch it on!
  11. 11. Present and switched on out-of-the-box in Zend Server (CE and upwards)
  12. 12. If not present then download, but shouldn't need to compile etc </li></ul>
  13. 13. Some key directives for mbstring <ul><li>mbstring.internal_encoding
  14. 14. mbstring.language
  15. 15. See http://php.net/manual/en/mbstring.configuration.php </li></ul>
  16. 16. Easy peasy in Zend Server
  17. 17. Enough now – let's rock and roll! <ul><li>Mbstring gives us multibyte-safe versions of the “core” string handling functions
  18. 18. For example, we all know strlen() …
  19. 19. … So let's have a look at mb_strlen() </li></ul>
  20. 20. mb_strlen()
  21. 21. More mb_strlen()
  22. 22. Even more mb_strlen()
  23. 23. Still rocking and rolling... <ul>Mbstring gives us multibyte-safe versions of the “core” string handling functions <li>For example, we all know strpos() …
  24. 24. … So let's have a look at mb_strpos() </li></ul>
  25. 25. mb_strpos()
  26. 26. More mb_strpos()
  27. 27. Wrapping up and moving on <ul>Mbstring gives us multibyte-safe versions of the “core” string handling functions <li>There are LOTS of these multibyte-safe versions of “core” string handling functions – please have a look
  28. 28. BE CAREFUL but you can make calls to strlen() (and etc) automatically call mb_strlen() - this is the mbstring.func_overload directive </li></ul>
  29. 29. Mbstring specific functions <ul>Let's look at character encodings first <li>mb_detect_encoding()
  30. 30. mb_convert_encoding()
  31. 31. LOTS of supported encodings
  32. 32. ( http://php.net/manual/en/mbstring.supported-encodings.php )
  33. 33. Mbstring.detect_order directive comes into play here </li></ul>
  34. 34. mb_detect_encoding()
  35. 35. mb_detect_order()
  36. 36. More mb_detect_order()
  37. 37. Mbstring specific functions <ul>Still looking at character encodings ... <li>mb_detect_encoding()
  38. 38. mb_convert_encoding()
  39. 39. LOTS of supported encodings
  40. 40. ( http://php.net/manual/en/mbstring.supported-encodings.php )
  41. 41. Mbstring.detect_order directive comes into play here </li></ul>
  42. 42. mb_convert_encoding()
  43. 43. More mb_convert_encoding()
  44. 44. Regular expressions on multibyte strings <ul><li>mb_regex_encoding() but note that supported encodings for regex purposes is actually a SUBSET of supported encodings for mbstring itself!
  45. 45. mb_ereg()
  46. 46. mb_ereg_match()
  47. 47. mb_ereg_replace()
  48. 48. … and many more!
  49. 49. Note: PHP's regular preg_*() functions can also “do” UTF-8 with the /u pattern modifier !! </li></ul>
  50. 50. mb_ereg()
  51. 51. More mb_ereg()
  52. 52. Summary of mbstring functions <ul><li>Directive setting functions
  53. 53. Multibyte versions of regular string functions
  54. 54. Regex functions
  55. 55. Encoding detection / conversion
  56. 56. Japanese specific functions / settings
  57. 57. Other misc stuff </li></ul>
  58. 58. Putting it all together <ul><li>Mbstring gets PHP working with multibyte
  59. 59. BUT...
  60. 60. Don't forget your:
  61. 61. PHP script files (best to have encoding of file same as mbstring.internal_encoding)
  62. 62. Database
  63. 63. Output (ie. Probably HTML)
  64. 64. Input (ie. Form submissions etc) </li></ul>
  65. 65. Multibyting your database <ul><li>Oracle – I'm no expert but look at NCHAR as opposed to CHAR ('N' for 'national language')
  66. 66. PostgreSQL – I'm no expert but IIRC Postgres automagically understands and converts input / output character encodings
  67. 67. MySQL – can choose a “collation” for server, each schema, each table, each column!
  68. 68. MySQL – collation means “charset + sort order” (for example CS means case-sensitive sort order) </li></ul>
  69. 69. More multibyting your database <ul><li>MySQL – easiest to put everything on 'utf8_unicode_ci' or 'utf8_general_ci' (but note that these two collations differ when sorting and doing LIKE etc! See http://forums.mysql.com/read.php?103,187048,188748#msg-188748)
  70. 70. You'll need to do an SQL query of:
  71. 71. SET NAMES utf8 and / or SET CHARACTER SET utf8
  72. 72. After connecting and before reading / writing
  73. 73. (otherwise characters will become garbled) </li></ul>
  74. 74. Multibyting your output HTML <ul><li>For example, for UTF8, we need to output this kind of HTTP header:
  75. 75. Content-Type: &quot;text/html; charset=UTF-8;&quot;
  76. 76. ie. header(&quot;Content-Type: text/html; charset=UTF-8;&quot;);
  77. 77. Possible but less desirable to output as a meta tag in the HTML <head>:
  78. 78. <meta http-equiv=&quot;Content-Type&quot; content=&quot;text/html; charset=UTF-8;&quot;/>
  79. 79. (or simply <meta charset=”UTF-8”> for HTML5)
  80. 80. Don't forget lang=”xy” or xml:lang=”xy” where needed </li></ul>
  81. 81. Multibyting your input <ul><li>Theoretically possible, but unusual, to have a <form> with a different encoding to its host page
  82. 82. Out-of-the-box, form data on a SJIS host page comes in as SJIS. Form data on an EUC-JP host page comes in as EUC-JP and etc
  83. 83. Or have I just been very lucky?
  84. 84. Look at mbstring.http_input directive if struggling </li></ul>
  85. 85. That's all folks! <ul>I'll leave you with some things to think about: <li>Iconv (a built-in extension) might be better if all you need is to detect / change encodings
  86. 86. Previous examples of preg_match() failing will probably work with the /u patter modifier (to enable UTF-8)
  87. 87. No mb version of trim() or preg_match_all()
  88. 88. Mbstring in action: http://twitter.com/japxlate http://mapanese.info
  89. 89. Questions welcome at daniel.rhodes@warpasylum.co.uk </li></ul>

×