Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

UTF-8: The Secret of Character Encoding

4,586 views

Published on

Published in: Technology
  • Be the first to comment

UTF-8: The Secret of Character Encoding

  1. 1. UTF-8: The Secret of Character
  2. 2. Bert? Web developer at Netlash Dextro http://dextrose.be http://twitter.com/dextro
  3. 3. Some history... sorry ASCII = 7 bit → 2 7 possibilities 1 byte = 8 bit → 1 bit left parity bit ASCII extended (ISO 8859) Asia: DBCS
  4. 4. Unicode NOT: 1 char = 16 bits letter → a code point (U+234) U+0048 U+0065 U+006C U+006C U+006F
  5. 5. Unicode memory storing 00 48 00 65 00 6C 00 6C 00 6F (little endian) 48 00 65 00 6C 00 6C 00 6F 00 (big endian) Byte Order Mark • FF FE (little endian) • FE FF (big endian)
  6. 6. ways of encoding unicode UCS-4 (UTF-32) high endian low endian UCS-2 high endian low endian
  7. 7. ways of encoding unicode UTF-16: 2 or 4 bytes UTF-8: 1, 2, 3 or 4 bytes UTF-7: SMTP in mailtraffic
  8. 8. UTF-8 U+000 till U+127 → 1 byte above → 2, 3, up to 6 bytes ANSII → UTF-8 = no difference
  9. 9. UTF-8 sorting: byte oriented = sorting code points standard for XML (XHTML) documents easy recognized by an algorithm
  10. 10. Sidenote What if char is not known in the encoding?
  11. 11. In practice Which encoding to choose?
  12. 12. Questions Which characters am I going to use? In which encodings can my editor save files? Which encodings are supported by the various components in my publishing chain? Which encodings are supported by browsers?
  13. 13. 1 character range single language or multilanguage? (curly) quotation marks, dashes and other special punctuation mathematical or other special symbols
  14. 14. 2 text editor fixed or not? Zend Studio for Eclipse: ISO-8859-1, US- ASCII, UTF-16, UTF-16BE, UTF-16LE, UTF-8
  15. 15. 3 other components webserver programming (or scripting) language database ...
  16. 16. 4 browser support no problem: US-ASCII, ISO 8859 series and UTF-8 avoid the others (and US-ASCII...)
  17. 17. character not available? entity: © ë á NCR: © or © more bytes difficult to read SEO?
  18. 18. Biggest problem PHP5 at least full support in PHP6

×