Character sets

2,925 views

Published on

Learn a bit about Character Sets. Aimed at the LAMP stack

Published in: Technology

Character sets

  1. 1. Character Sets Suck
  2. 2. About Us• Ligaya Turmelle • Raymond DeRoo• Senior Technical • Data Architect Support Engineer• MySQL Now (AKA • Nimbuzz BV Oracle) • .15 decades• ~3 years
  3. 3. A very brief history of character sets
  4. 4. Character sets• Earliest form was Morse Code.• Baudot Code came next• ITA2• ASCII and EBCIDIC then come along in 1963
  5. 5. Character sets• Earliest form was Morse Code.• Baudot Code came next• ITA2• ASCII and EBCIDIC then come along in 1963
  6. 6. Character sets• Earliest form was Morse Code.• Baudot Code came next• ITA2• ASCII and EBCIDIC then come along in 1963
  7. 7. Character sets• Earliest form was Morse Code.• Baudot Code came next• ITA2• ASCII and EBCDIC then come along in 1963
  8. 8. What exactly is a Character Set?
  9. 9. it is a set ofsymbols and encodings
  10. 10. ExampleQ Y q y
  11. 11. ExampleQ Y q y = the symbols
  12. 12. ExampleQ Y q y = the symbols Q=1 Y=2 q=3 y=4
  13. 13. ExampleQ Y q y = the symbols Q=1 Y=2 is the encoding q=3 y=4
  14. 14. In real life mostcharacter sets have many characters
  15. 15. Now that we know abit about charactersets - what the heck is a collation?
  16. 16. Collations dealwith how we orderthe character set
  17. 17. CollationsQ=1Y=2q=3y=4
  18. 18. CollationsQ=1Y=2q=3 Q<Y<q<yy=4
  19. 19. So The answer - or at least OUR answer - to Character Sets Sucking is..
  20. 20. Use Latin1 if youcan - otherwise just use UTF8
  21. 21. What is UTF8?
  22. 22. a multi-bytecharacter encoding for the Unicode character set.
  23. 23. Handling Character Sets
  24. 24. Be consistent
  25. 25. Check everything along the way
  26. 26. Some specific Settings to consider
  27. 27. For the webpage• Content-Type: text/html; charset=utf-8• Apache virtual host setting • AddDefaultCharset utf-8• web form - accept-charset
  28. 28. For the webpage• Content-Type: text/html; charset=utf-8• Apache virtual host setting • AddDefaultCharset utf-8• web form - accept-charset
  29. 29. For the webpage• Content-Type: text/html; charset=utf-8• Apache virtual host setting • AddDefaultCharset utf-8• web form - accept-charset
  30. 30. PHP• Not here to discuss Unicode and PHP6• mbstring• default_charset• SET NAMES
  31. 31. PHP• Not here to discuss Unicode and PHP6• mbstring• default_charset• SET NAMES
  32. 32. PHP• Not here to discuss Unicode and PHP6• mbstring• default_charset• SET NAMES
  33. 33. PHP• Not here to discuss Unicode and PHP6• mbstring• default_charset• SET NAMES
  34. 34. MySQL• Has fine grained control if you really want it • Can handle character set *and* collation settings at 4 levels within MySQL• Plus at the client/server connection level
  35. 35. MySQL• Has fine grained control if you really want it • Can handle character set *and* collation settings at 4 levels within MySQL• Plus at the client/server connection level
  36. 36. MySQL• Has fine grained control if you really want it • Can handle character set *and* collation settings at 4 levels within MySQL• Plus at the client/server connection level
  37. 37. MySQL (server level)• --character-set-server• --collation-server
  38. 38. MySQL (Database level)• Can be set in the CREATE/ALTER DATABASE statement CREATE DATABASE test CHARACTER SET utf8 COLLATE utf8_general_ci;
  39. 39. Character set table and columns• Save yourself the headache • don’t mix and match the values for these unless you must.• Mis-matches between different table and column character sets in your queries can kill performance
  40. 40. SET NAMES• Again - save the headache and just always send it.• Sets character set that the client will use to send SQL to the server.• Equivalent to: SET character_set_client=X; SET character_set_results=X SET character_set_connection= X
  41. 41. Collation names Explained• Conventions used in MySQL: • _ci • _cs • _bin
  42. 42. So to wrap this all up...
  43. 43. Use Latin1 if youcan - otherwise just use UTF8
  44. 44. Questions?
  45. 45. References - http://codeworks.gnomedia.com/archives/2003/programming/a-short-history-of-character-sets/ - http://en.wikipedia.org/wiki/Character_Set - http://en.wikipedia.org/wiki/Extended_Binary_Coded_Decimal_Interchange_Code - http://en.wikipedia.org/wiki/ASCII - http://en.wikipedia.org/wiki/Baudot_code - http://dev.mysql.com/doc/refman/5.0/en/charset-general.html - http://unicode.org/standard/WhatIsUnicode.html - http://en.wikipedia.org/wiki/UTF-8 - http://httpd.apache.org/docs/2.0/mod/core.html - http://www.w3schools.com/TAGS/att_form_accept_charset.asp - http://us3.php.net/mbstring - http://php.net/manual/en/ini.core.php - http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html - http://dev.mysql.com/doc/refman/5.0/en/charset-server.html - http://dev.mysql.com/doc/refman/5.0/en/charset-database.html

×