Unicode and character sets

790 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
790
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
21
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Unicode and character sets

  1. 1. Unicode and Character Sets<br />
  2. 2. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)<br /> - Joel Spolsky<br />The founder of Stackoverflow<br />The author of 《More Joel on Software》<br />
  3. 3. A<br />In person’s eye<br />0100 0001<br />In computer’s eye<br />
  4. 4. ASCII 32~127 8bits<br />ISO-8859-1, ISO-8859-2, ISO-8859-3……….. 16<br />In ISO-8859-1, 0xC0is À<br />In ISO-8859-7, 0xC0is ΐ<br />The same octet has different meanings in different charsets!!<br />
  5. 5. Unicode<br />Not a Charset<br />To assign a code point to every words in the world<br />A -> U+0041<br />http://www.unicode.org/charts/<br />
  6. 6. How to use Unicode in computer?<br />
  7. 7. UCS-2 (UTF-16)<br />A -> U+0041 -> 0x00 0x41<br />PROS: <br />map code points (U+0000~U+FFFF) to octet directly<br />CONS: <br />Be incompatible with ASCII<br />Waste memory when code point <= U+007F<br />Cannot support code point > U+FFFF<br />
  8. 8. UCS-4 (UTF-32)<br />A -> U+0041 -> 0x00 0x000x00 0x41<br />PROS: <br />map code points (U+00000000~U+FFFFFFFF) to octet directly<br />CONS: <br />Be incompatible with ASCII<br />Waste huge memory<br />
  9. 9. UTF-8<br />0000 ~ 007F 0xxxxxxx <br />0080 ~ 07FF 110xxxxx 10xxxxxx <br />0800 ~ FFFF 1110xxxx 10xxxxxx 10xxxxxx<br />A => U+0041 => 1000001 => 01000001 => 0x41 <br />神 => U+795E => 1111001 01011110 => <br />11100111 10100101 10011110 => 0xE7 0xA5 0x9E<br />
  10. 10. UTF-8<br />PROS: <br />Be compatible with ASCII<br />Can map all the code points to octets<br />CONS: <br />Algorithm is a little complicate<br />
  11. 11. It does not make sense to have a string without know what <br />encoding it uses.<br /> - Joel Spolsky<br />Software communicate with each other by octet stream <br />A<br />B<br />Sends E7 A5 9E E9 A9 AC 3F<br />A should tell B he sends the octets with charset UTF-8.<br />Then B can understand the received message is “神马?”<br />
  12. 12. Charsets in Perl<br />
  13. 13. Two ways to get a string in Perl<br />Literal string<br />From I/O<br />Literal string – depends on the encoding of your source code<br /># encoding UTF-8<br />my $a1 = “神马?”;<br />my $a2 = “xE7xA5x9ExE9xA9xACx3F”;<br />my $a3 = <FH>;<br />Anyway, in the perl’s eye, it’s a string with 7 octets.<br />ISO-8859-1 or UTF-8?<br />
  14. 14. Default, Perl treats it just as a sequence of octets <br /># encoding UTF-8<br />my $a1 = “神马?”;<br />print length($a1) #output is 7<br />How to make perl treat it as a sequence of characters?<br /># encoding UTF-8<br />my $a1 = “神马?”;<br />Encode::decode_utf8($a1);<br />Encode::decode(“utf8”, $a1);<br />Encode::_utf8_on($a1);<br />print length($a1) #output is 3<br />
  15. 15. What has happened inside?<br />Decode the sequence of octets to Code points as UTF-8(or other charsets)<br />Encode the Code points to internal format (utf8)<br />Turn the string’s UTF8 flag ON<br />According to the UTF8 flag, Perl treats it as a sequence of chars<br />UTF-8 ? utf8? UTF8?<br />
  16. 16. UTF-8<br />The standard charset made by Ken Thompson<br />utf8<br />Perl internal charset<br />Superset of UTF-8<br />UTF8<br />The name of flag that indicate whether<br />perl should treat it as a sequence of chars<br />
  17. 17. More Examples<br />
  18. 18. #encoding UTF-8<br />use Devel::Peek;<br />print Dump(“神”), Dump(“xE7xA5x9E”);<br />print Dump(“x{795E}”), Dump(Encode::decode_utf8(“xE7xA5x9E”));<br />print Dump(“神”.“x{795E}”);<br />FLAGS = <PADMY,POK,Ppok><br />PV = 0x16189d8 “347245236”0<br />FLAGS = <PADMY,POK,Ppok,UTF8><br />PV = 0x2e7478 “347245236”0 [UTF8 “x{795e}”]<br />FLAGS = <PADMY,POK,Ppok,UTF8><br />PV = 0x2e74d8 “347245236303247302245302236”0 <br />[UTF8 “x{795e}x{e7}x{a5}x{9e}”]<br />236303 = 11000011 10100111<br />x{e7} = 11100111<br />
  19. 19. Convert “神” from UTF-8 to GBK<br />神<br />E7A59E(UTF-8 encoded)<br />UTF8 flag = off<br />decode<br />神<br />U+795E(unicode)<br />神<br />E7A59E(utf8 encoded)<br />UTF8 flag = on<br />encode<br />神<br />C9F1(gbk encoded)<br />UTF8 flag = off<br />
  20. 20. Charsets in MySQL<br />
  21. 21. Server -> database -> table<br />CREATE TABLE XXX<br />……<br />……<br />……<br />DEFAULT CHARSET = UTF-8<br />
  22. 22. SET NAMES X<br />SET CHARACTER_SET_CLIENT = X<br />SET CHARACTER_SET_CONNECTION = X<br />SET CHARACTER_SET_RESULTS = X<br />
  23. 23. Connection_charset = shiftJIS<br />Client_charset = UTF-8<br />Shell (UTF-8)<br />UTF-8 -> shiftJIS<br />shiftJIS -> UTF-8<br />Results_charset = UTF-8<br />MySQL(UTF-8)<br />UTF-8 <- UTF-8<br />euc-jp <- UTF-8<br />Client_charset = euc-jp<br />Perl (euc-jp)<br />shiftJIS -> UTF-8<br />euc-jp -> shiftJIS<br />Results_charset = euc-jp<br />
  24. 24. Q & A<br />
  25. 25. Thank U!<br />

×