Character Encoding issue with PHP


Published on

Character Encoding issue with PHP

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Character Encoding issue with PHP

  1. 1. Character Encoding issue with PHP <ul><li>$customer = </li></ul><ul><li>array( </li></ul><ul><li>'id' => 'á é í ó ú, ñ, Ñ', </li></ul><ul><li>'name' => 'Iñtërnâtiônàlizætiøn', </li></ul><ul><li>'notes' => 'raviraj from infoEdge india Ltd.' </li></ul><ul><li>); </li></ul><ul><li>$var =&quot;I ♥ Unicode, You ♥ Unicode.&quot;; </li></ul>
  2. 2. Main problem with using Unicode <ul><li>it's partially supported by some parts of any given tool chain. </li></ul><ul><li>Sometimes it works great, and other times—due to a given piece of software's lack of implementation (or worse, a partial implementation), human error, or full-on bugs—the chain's weakest link shatters in a non-spectacular way. </li></ul>
  3. 3. Let's Take a Complex Case.. <ul><li>create file, edit file, commit file to svn, other developers edit file, others commit to svn, </li></ul><ul><li>release is rolled from svn, visitor browser requests page, httpd parses request, httpd delivers request to PHP, </li></ul><ul><li>PHP processes request, PHP (client) calls service to fulfill back-end portions of request (encodes the request in an </li></ul><ul><li>envelope—we use JSON most of the time), PHP (service) receives request, service retrieves and/or stores data in database, </li></ul><ul><li>service returns data to PHP client, PHP client processes returned data and in turn delivers it to httpd, httpd </li></ul><ul><li>returns data to browser </li></ul>
  4. 4. Let's Take a Complex Case..... <ul><li>any (one or more!) of the following could fail when handling unicode: developers' editors, developers' </li></ul><ul><li>transport (either upload or version control), user's browser, user's http proxy, client-side httpd, </li></ul><ul><li>client-side PHP, client-side encoder (JSON), service-side httpd (especially HTTP headers), service-side decoder, </li></ul><ul><li>service-side PHP, service-side database client, database protocol character set imbalance, database table charset, </li></ul><ul><li>database server, service-side encoder, client-side decoder, client-side PHP (again), client-side httpd </li></ul><ul><li>(including HTTP headers, again), user's proxy (again), and user's browser (again). I've probably even left some out. </li></ul>
  5. 5. Understand Basic.. <ul><li>A character is the smallest component of written language that has a semantic value. Examples of characters are letters, ideographs (e.g. Chinese characters), punctuation marks, digits etc. </li></ul><ul><li>A character set is a group of characters without associated numerical values. An example of a character set is the Latin alphabet or the Cyrillic alphabet. </li></ul><ul><li>Coded character sets are character sets in which each character is associated with a scalar value: a code point. For example, in ASCII, the uppercase letter “A” has the value 65. Examples for coded character sets are ASCII and Unicode. A coded character set is meant to be encoded, i.e. converted into a digital representation so that the characters can be serialized in files, databases, or strings. This is done through a character encoding scheme or encoding. The encoding method maps each character value to a given sequence of bytes. </li></ul><ul><li>In many cases, the encoding is just a direct projection of the scalar values, and there is no real distinction between the coded character set and its serialized representation. For example, in ISO 8859-1 (Latin 1), the character “A” (code point 65) is encoded as a byte 0×41 (i.e. 65). In other cases, the encoding method is more complex. For example, in UTF-8, an encoding of Unicode, the character “á” (225) is encoded as two bytes: 0xC3 and 0xA1. </li></ul>
  6. 6. Unicode -Universal Character Set <ul><li>UTF-8 is a multibyte 8-bit encoding in which each Unicode scalar value is mapped to a sequence of one to four bytes. One of the main advantages of UTF-8 is its compatibility with ASCII. If no extended characters are present, there is no difference between a dencoded in ASCII and one encoded in UTF-8. </li></ul><ul><li>One thing to take into consideration when using UTF-8 with PHP is that characters are represented with a varying number of bytes. Some PHP functions do not take this into account and will not work as expected </li></ul>
  7. 7. PHP's Problem <ul><li><?php </li></ul><ul><li>echo strlen('Iñtërnâtiônàlizætiøn'); </li></ul><ul><li>?> </li></ul><ul><li>It prints 27 characters. That’s because the string, encoded as UTF-8, contains multi-byte characters which PHP‘s strlen function will count as being multiple characters. </li></ul><ul><li>Correct answer is 20 characters !!! </li></ul><ul><li>So it's good time to switch over UTF-8 ... </li></ul>
  8. 8. Why UTF8 ?? <ul><li>it’s an encoding of Unicode and, second, that it’s backwards compatible with ASCII. </li></ul><ul><li>Character codes less than 128 (effectively, the ASCII repertoire) are presented “as such”, using one octet for each code (character) All other codes are presented, according to a relatively complicated method, so that one code (character) is presented as a sequence of two to six octets, each of which is in the range 128 - 255. This means that in a sequence of octets, octets in the range 0 - 127 (”bytes with most significant bit set to 0”) directly represent ASCII characters, whereas octets in the range 128 - 255 (”bytes with most significant bit set to 1”) are to be interpreted as really encoded presentations of characters. </li></ul>
  9. 9. UTF8 and Codeigniter ?? <ul><li>HTML Form should be support UTF8 </li></ul><ul><li><form accept-charset=&quot;utf-8&quot; ...> </li></ul><ul><li>HTML Meta Tag should support UTF8 </li></ul><ul><li><?php echo meta('Content-type', 'text/html; charset='.config_item('charset'), 'equiv');?> </li></ul><ul><li>Put it on index.php </li></ul><ul><li>header('Content-Type: text/html; charset=utf-8'); </li></ul>
  10. 10. UTF8 & Codeigniter <ul><li>change config.php file </li></ul><ul><li>$config['charset'] = &quot;UTF-8&quot;; </li></ul><ul><li>config DB settings </li></ul><ul><li>$db['default']['char_set'] = &quot;utf8&quot;; </li></ul><ul><li>$db['default']['dbcollat'] = &quot;utf8_unicode_ci&quot;; </li></ul>
  11. 11. UTF8 & CI <ul><li>ALTER DATABASE mydatabase </li></ul><ul><li>CHARACTER SET utf8 </li></ul><ul><li>DEFAULT CHARACTER SET utf8 </li></ul><ul><li>COLLATE utf8_general_ci </li></ul><ul><li>DEFAULT COLLATE utf8_general_ci ; </li></ul><ul><li>ALTER TABLE mytable </li></ul><ul><li>DEFAULT CHARACTER SET utf8 </li></ul><ul><li>COLLATE utf8_general_ci ; </li></ul>
  12. 12. End ... <ul><li>Universal Unicode support is long battle. </li></ul><ul><li>I'm sure you are ready for it Now :D </li></ul><ul><li>RIGHT ?? :-) </li></ul>
  13. 13. THANKS <ul><li>Reference Links </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li> </li></ul>