Character sets and iconv This presentation is about character sets and the iconv library (with usage examples in PHP) By D...
What is a character set? <ul><li>Mapping of  character x in human language y is value z
Western European languages often use 8-bit ISO 8859-1
English possible in 7-bit ASCII!
Some languages have complex / numerous characters and need 2, 3 or even 4 bytes to represent one character!
So, many different character sets exist </li></ul>
More about character sets <ul><li>Even same language may have many different character sets
Character sets tend not to be compatible
So, conversion is necessary and useful
But Unicode is coming through as a modernising, unifying character set
Unicode is one HUGE character set that can be used to represent any character from any language! </li></ul>
Character sets? Who cares! <ul><li>Anglophones very lucky as everything seems to  just work  (even if in the background di...
English is not the only language!
An app expecting character set  x  but getting  y  (or an incorrect character set conversion) will result in mojibake </li...
Mojibake? What's that? <ul><li>A great Japanese word meaning garbled (bake) characters (moji)
Often encountered in Japanese computing with its two traditional character sets, Unicode and a separate character set for ...
Shouldn't really happen at all in modern computing
But it still does, mostly due to lack of implementation knowledge </li></ul>
Mojibake in English <ul><li>A slight case of mojibake here, the pound symbols (£) have garbled </li></ul>
Mojibake in German <ul><li>More severe now, umlauted vowels ( ä, ö and ü ) have garbled </li></ul>
Mojibake in Japanese <ul><li>Ouch! </li></ul>
What is the iconv library? <ul><li>API to convert between character sets
Works on strings
Some support for transliteration (changing / substituting characters in source character set that don't exist in target ch...
Your implementation may vary, but a HUGE number of character sets are supported </li></ul>
Some iconv use cases <ul><li>Convert legacy character set ↔ Unicode
Convert backend ↔ frontend character sets
Convert file's character set for import / export
Transliterate to remove unwanted characters
Transliterate to make safe for URL / filename
Upcoming SlideShare
Loading in …5
×

Character sets and iconv

4,360 views

Published on

All about character sets, converting character sets with the iconv library and the iconv extension for PHP

Published in: Technology
4 Comments
0 Likes
Statistics
Notes
  • So, with PHP 5.6, instead of setting the slide 13 directives with ini_set() [or with iconv_set_encoding()], iconv picks up the value from the default_charset directive as described here: http://php.net/manual/en/ini.core.php#ini.default-charset
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Note that PHP 5.6 deprecates a bunch of iconv (and mbstring) configuration directives to do with character encodings: http://php.net/manual/en/migration56.deprecated.php
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Just to let you know that this presentation is quite old now.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • More food for thought: no out-of-the-box character set *detection*
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

No Downloads
Views
Total views
4,360
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
12
Comments
4
Likes
0
Embeds 0
No embeds

No notes for slide

Character sets and iconv

  1. 1. Character sets and iconv This presentation is about character sets and the iconv library (with usage examples in PHP) By Daniel Rhodes of Warp Asylum http://www.warpasylum.co.uk
  2. 2. What is a character set? <ul><li>Mapping of character x in human language y is value z
  3. 3. Western European languages often use 8-bit ISO 8859-1
  4. 4. English possible in 7-bit ASCII!
  5. 5. Some languages have complex / numerous characters and need 2, 3 or even 4 bytes to represent one character!
  6. 6. So, many different character sets exist </li></ul>
  7. 7. More about character sets <ul><li>Even same language may have many different character sets
  8. 8. Character sets tend not to be compatible
  9. 9. So, conversion is necessary and useful
  10. 10. But Unicode is coming through as a modernising, unifying character set
  11. 11. Unicode is one HUGE character set that can be used to represent any character from any language! </li></ul>
  12. 12. Character sets? Who cares! <ul><li>Anglophones very lucky as everything seems to just work (even if in the background different character sets are interacting)
  13. 13. English is not the only language!
  14. 14. An app expecting character set x but getting y (or an incorrect character set conversion) will result in mojibake </li></ul>
  15. 15. Mojibake? What's that? <ul><li>A great Japanese word meaning garbled (bake) characters (moji)
  16. 16. Often encountered in Japanese computing with its two traditional character sets, Unicode and a separate character set for emails!
  17. 17. Shouldn't really happen at all in modern computing
  18. 18. But it still does, mostly due to lack of implementation knowledge </li></ul>
  19. 19. Mojibake in English <ul><li>A slight case of mojibake here, the pound symbols (£) have garbled </li></ul>
  20. 20. Mojibake in German <ul><li>More severe now, umlauted vowels ( ä, ö and ü ) have garbled </li></ul>
  21. 21. Mojibake in Japanese <ul><li>Ouch! </li></ul>
  22. 22. What is the iconv library? <ul><li>API to convert between character sets
  23. 23. Works on strings
  24. 24. Some support for transliteration (changing / substituting characters in source character set that don't exist in target character set)
  25. 25. Your implementation may vary, but a HUGE number of character sets are supported </li></ul>
  26. 26. Some iconv use cases <ul><li>Convert legacy character set ↔ Unicode
  27. 27. Convert backend ↔ frontend character sets
  28. 28. Convert file's character set for import / export
  29. 29. Transliterate to remove unwanted characters
  30. 30. Transliterate to make safe for URL / filename
  31. 31. Let's look at some iconv usage examples in PHP.. </li></ul>
  32. 32. What is PHP's iconv extension? <ul><li>Interface to iconv library
  33. 33. See http://uk.php.net/manual/en/book.iconv.php
  34. 34. iconv library should be on your OS
  35. 35. If not, need to install it before using the PHP extension
  36. 36. See http://www.gnu.org/software/libiconv </li></ul>
  37. 37. iconv extension presence <ul><li>phpinfo() will look something like: </li></ul>
  38. 38. A few directives <ul><li>iconv.input_encoding – currently unused
  39. 39. iconv.output_encoding – for ob_iconv_handler() [iconv handler for PHP's output buffering]
  40. 40. iconv.internal_encoding – for ob_iconv_handler(), iconv_mime_*() and iconv's string utility functions (which are present from PHP 5) </li></ul>
  41. 41. First play
  42. 42. Basic usage <ul><li>iconv() is the conversion function
  43. 43. Pass it the input string's character set,
  44. 44. the desired output character set
  45. 45. and the input string
  46. 46. BUT within reason... </li></ul>
  47. 47. Within reason
  48. 48. Character mapping <ul><li>You might not get every character from set x present in set y
  49. 49. So what to do if character absent? Bomb out and return an empty string?
  50. 50. NO! iconv gives us a few options
  51. 51. Let's look at transliteration first... </li></ul>
  52. 52. First transliteration
  53. 53. Transliteration <ul><li>Append //TRANSLIT to output character set as passed to iconv()
  54. 54. Approximates characters not present in output character set with closest equivalent
  55. 55. Closest equivalent might simply be '?' for wildly different character sets </li></ul>
  56. 56. More realistic transliteration
  57. 57. Ignore option <ul><li>We can also append //IGNORE to the output character set as passed to inconv()
  58. 58. This will simply skip over any characters that are absent from the output character set </li></ul>
  59. 59. Ignore example
  60. 60. Transliterate and ignore <ul><li>You may (or may not!) be able to combine the //TRANSLIT and //IGNORE behaviours
  61. 61. This will transliterate transliteratable characters and ignore the rest
  62. 62. Action it by appending //TRANSLIT//IGNORE to the output character set as passed to iconv() </li></ul>
  63. 63. Output buffer handler <ul><li>We also get a handler for PHP's output buffering
  64. 64. Allows us to, for example, output everything to the browser as ISO-8859-1 though our PHP scripts etc are using UTF-8
  65. 65. An automatic way to convert character sets for output without necessarily touching anything internally
  66. 66. Let's take a look... </li></ul>
  67. 67. ob_iconv_handler
  68. 68. Utility functions <ul><li>As of PHP 5, we also get some non-conversion utility functions
  69. 69. iconv_strlen()
  70. 70. iconv_strpos(), iconv_strrpos()
  71. 71. iconv_substr()
  72. 72. Character equivalents of core strlen(), strpos(), strrpos() and substr() [which are really byte functions]
  73. 73. Quite trivial so we'll look only at one, iconv_strlen()... </li></ul>
  74. 74. iconv_strlen()
  75. 75. Food for thought <ul><li>Unicode is the character set of the future
  76. 76. PHP iconv extension uses sytem locale [setlocale()] for transliteration
  77. 77. PHP iconv extension issues a notice even when //IGNORE is used
  78. 78. iconv library has no mechanism for custom character maps </li></ul>
  79. 79. Summary <ul><li>iconv library can be accessed on the command line
  80. 80. But extension for PHP (and many other languages!)
  81. 81. Many character sets supported
  82. 82. All or nothing conversion or softer transliteration </li></ul>
  83. 83. Links <ul><li>Should be able to get a PHP source code pack from wherever you got this presentation
  84. 84. http://spin.atomicobject.com/2011/07/13/some-useful-iconv-functionality
  85. 85. http://developer.loftdigital.com/blog/php-utf-8-cheatsheet
  86. 86. http://blog.grayproductions.net/articles/encoding_conversion_with_iconv
  87. 87. http://czyborra.com/charsets/iso8859.html </li></ul>

×