• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Character sets and iconv
 

Character sets and iconv

on

  • 3,963 views

All about character sets, converting character sets with the iconv library and the iconv extension for PHP

All about character sets, converting character sets with the iconv library and the iconv extension for PHP

Statistics

Views

Total Views
3,963
Views on SlideShare
3,962
Embed Views
1

Actions

Likes
0
Downloads
7
Comments
1

1 Embed 1

https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • More food for thought: no out-of-the-box character set *detection*
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Character sets and iconv Character sets and iconv Presentation Transcript

    • Character sets and iconv This presentation is about character sets and the iconv library (with usage examples in PHP) By Daniel Rhodes of Warp Asylum http://www.warpasylum.co.uk
    • What is a character set?
      • Mapping of character x in human language y is value z
      • Western European languages often use 8-bit ISO 8859-1
      • English possible in 7-bit ASCII!
      • Some languages have complex / numerous characters and need 2, 3 or even 4 bytes to represent one character!
      • So, many different character sets exist
    • More about character sets
      • Even same language may have many different character sets
      • Character sets tend not to be compatible
      • So, conversion is necessary and useful
      • But Unicode is coming through as a modernising, unifying character set
      • Unicode is one HUGE character set that can be used to represent any character from any language!
    • Character sets? Who cares!
      • Anglophones very lucky as everything seems to just work (even if in the background different character sets are interacting)
      • English is not the only language!
      • An app expecting character set x but getting y (or an incorrect character set conversion) will result in mojibake
    • Mojibake? What's that?
      • A great Japanese word meaning garbled (bake) characters (moji)
      • Often encountered in Japanese computing with its two traditional character sets, Unicode and a separate character set for emails!
      • Shouldn't really happen at all in modern computing
      • But it still does, mostly due to lack of implementation knowledge
    • Mojibake in English
      • A slight case of mojibake here, the pound symbols (£) have garbled
    • Mojibake in German
      • More severe now, umlauted vowels ( ä, ö and ü ) have garbled
    • Mojibake in Japanese
      • Ouch!
    • What is the iconv library?
      • API to convert between character sets
      • Works on strings
      • Some support for transliteration (changing / substituting characters in source character set that don't exist in target character set)
      • Your implementation may vary, but a HUGE number of character sets are supported
    • Some iconv use cases
      • Convert legacy character set ↔ Unicode
      • Convert backend ↔ frontend character sets
      • Convert file's character set for import / export
      • Transliterate to remove unwanted characters
      • Transliterate to make safe for URL / filename
      • Let's look at some iconv usage examples in PHP..
    • What is PHP's iconv extension?
      • Interface to iconv library
      • See http://uk.php.net/manual/en/book.iconv.php
      • iconv library should be on your OS
      • If not, need to install it before using the PHP extension
      • See http://www.gnu.org/software/libiconv
    • iconv extension presence
      • phpinfo() will look something like:
    • A few directives
      • iconv.input_encoding – currently unused
      • iconv.output_encoding – for ob_iconv_handler() [iconv handler for PHP's output buffering]
      • iconv.internal_encoding – for ob_iconv_handler(), iconv_mime_*() and iconv's string utility functions (which are present from PHP 5)
    • First play
    • Basic usage
      • iconv() is the conversion function
      • Pass it the input string's character set,
      • the desired output character set
      • and the input string
      • BUT within reason...
    • Within reason
    • Character mapping
      • You might not get every character from set x present in set y
      • So what to do if character absent? Bomb out and return an empty string?
      • NO! iconv gives us a few options
      • Let's look at transliteration first...
    • First transliteration
    • Transliteration
      • Append //TRANSLIT to output character set as passed to iconv()
      • Approximates characters not present in output character set with closest equivalent
      • Closest equivalent might simply be '?' for wildly different character sets
    • More realistic transliteration
    • Ignore option
      • We can also append //IGNORE to the output character set as passed to inconv()
      • This will simply skip over any characters that are absent from the output character set
    • Ignore example
    • Transliterate and ignore
      • You may (or may not!) be able to combine the //TRANSLIT and //IGNORE behaviours
      • This will transliterate transliteratable characters and ignore the rest
      • Action it by appending //TRANSLIT//IGNORE to the output character set as passed to iconv()
    • Output buffer handler
      • We also get a handler for PHP's output buffering
      • Allows us to, for example, output everything to the browser as ISO-8859-1 though our PHP scripts etc are using UTF-8
      • An automatic way to convert character sets for output without necessarily touching anything internally
      • Let's take a look...
    • ob_iconv_handler
    • Utility functions
      • As of PHP 5, we also get some non-conversion utility functions
      • iconv_strlen()
      • iconv_strpos(), iconv_strrpos()
      • iconv_substr()
      • Character equivalents of core strlen(), strpos(), strrpos() and substr() [which are really byte functions]
      • Quite trivial so we'll look only at one, iconv_strlen()...
    • iconv_strlen()
    • Food for thought
      • Unicode is the character set of the future
      • PHP iconv extension uses sytem locale [setlocale()] for transliteration
      • PHP iconv extension issues a notice even when //IGNORE is used
      • iconv library has no mechanism for custom character maps
    • Summary
      • iconv library can be accessed on the command line
      • But extension for PHP (and many other languages!)
      • Many character sets supported
      • All or nothing conversion or softer transliteration
    • Links
      • Should be able to get a PHP source code pack from wherever you got this presentation
      • http://spin.atomicobject.com/2011/07/13/some-useful-iconv-functionality
      • http://developer.loftdigital.com/blog/php-utf-8-cheatsheet
      • http://blog.grayproductions.net/articles/encoding_conversion_with_iconv
      • http://czyborra.com/charsets/iso8859.html