Your SlideShare is downloading. ×
Unicode and character sets
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Unicode and character sets

597
views

Published on

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
597
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
16
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Unicode and Character Sets
  • 2. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
    - Joel Spolsky
    The founder of Stackoverflow
    The author of 《More Joel on Software》
  • 3. A
    In person’s eye
    0100 0001
    In computer’s eye
  • 4. ASCII 32~127 8bits
    ISO-8859-1, ISO-8859-2, ISO-8859-3……….. 16
    In ISO-8859-1, 0xC0is À
    In ISO-8859-7, 0xC0is ΐ
    The same octet has different meanings in different charsets!!
  • 5. Unicode
    Not a Charset
    To assign a code point to every words in the world
    A -> U+0041
    http://www.unicode.org/charts/
  • 6. How to use Unicode in computer?
  • 7. UCS-2 (UTF-16)
    A -> U+0041 -> 0x00 0x41
    PROS:
    map code points (U+0000~U+FFFF) to octet directly
    CONS:
    Be incompatible with ASCII
    Waste memory when code point <= U+007F
    Cannot support code point > U+FFFF
  • 8. UCS-4 (UTF-32)
    A -> U+0041 -> 0x00 0x000x00 0x41
    PROS:
    map code points (U+00000000~U+FFFFFFFF) to octet directly
    CONS:
    Be incompatible with ASCII
    Waste huge memory
  • 9. UTF-8
    0000 ~ 007F 0xxxxxxx
    0080 ~ 07FF 110xxxxx 10xxxxxx
    0800 ~ FFFF 1110xxxx 10xxxxxx 10xxxxxx
    A => U+0041 => 1000001 => 01000001 => 0x41
    神 => U+795E => 1111001 01011110 =>
    11100111 10100101 10011110 => 0xE7 0xA5 0x9E
  • 10. UTF-8
    PROS:
    Be compatible with ASCII
    Can map all the code points to octets
    CONS:
    Algorithm is a little complicate
  • 11. It does not make sense to have a string without know what
    encoding it uses.
    - Joel Spolsky
    Software communicate with each other by octet stream
    A
    B
    Sends E7 A5 9E E9 A9 AC 3F
    A should tell B he sends the octets with charset UTF-8.
    Then B can understand the received message is “神马?”
  • 12. Charsets in Perl
  • 13. Two ways to get a string in Perl
    Literal string
    From I/O
    Literal string – depends on the encoding of your source code
    # encoding UTF-8
    my $a1 = “神马?”;
    my $a2 = “xE7xA5x9ExE9xA9xACx3F”;
    my $a3 = <FH>;
    Anyway, in the perl’s eye, it’s a string with 7 octets.
    ISO-8859-1 or UTF-8?
  • 14. Default, Perl treats it just as a sequence of octets
    # encoding UTF-8
    my $a1 = “神马?”;
    print length($a1) #output is 7
    How to make perl treat it as a sequence of characters?
    # encoding UTF-8
    my $a1 = “神马?”;
    Encode::decode_utf8($a1);
    Encode::decode(“utf8”, $a1);
    Encode::_utf8_on($a1);
    print length($a1) #output is 3
  • 15. What has happened inside?
    Decode the sequence of octets to Code points as UTF-8(or other charsets)
    Encode the Code points to internal format (utf8)
    Turn the string’s UTF8 flag ON
    According to the UTF8 flag, Perl treats it as a sequence of chars
    UTF-8 ? utf8? UTF8?
  • 16. UTF-8
    The standard charset made by Ken Thompson
    utf8
    Perl internal charset
    Superset of UTF-8
    UTF8
    The name of flag that indicate whether
    perl should treat it as a sequence of chars
  • 17. More Examples
  • 18. #encoding UTF-8
    use Devel::Peek;
    print Dump(“神”), Dump(“xE7xA5x9E”);
    print Dump(“x{795E}”), Dump(Encode::decode_utf8(“xE7xA5x9E”));
    print Dump(“神”.“x{795E}”);
    FLAGS = <PADMY,POK,Ppok>
    PV = 0x16189d8 “347245236”0
    FLAGS = <PADMY,POK,Ppok,UTF8>
    PV = 0x2e7478 “347245236”0 [UTF8 “x{795e}”]
    FLAGS = <PADMY,POK,Ppok,UTF8>
    PV = 0x2e74d8 “347245236303247302245302236”0
    [UTF8 “x{795e}x{e7}x{a5}x{9e}”]
    236303 = 11000011 10100111
    x{e7} = 11100111
  • 19. Convert “神” from UTF-8 to GBK

    E7A59E(UTF-8 encoded)
    UTF8 flag = off
    decode

    U+795E(unicode)

    E7A59E(utf8 encoded)
    UTF8 flag = on
    encode

    C9F1(gbk encoded)
    UTF8 flag = off
  • 20. Charsets in MySQL
  • 21. Server -> database -> table
    CREATE TABLE XXX
    ……
    ……
    ……
    DEFAULT CHARSET = UTF-8
  • 22. SET NAMES X
    SET CHARACTER_SET_CLIENT = X
    SET CHARACTER_SET_CONNECTION = X
    SET CHARACTER_SET_RESULTS = X
  • 23. Connection_charset = shiftJIS
    Client_charset = UTF-8
    Shell (UTF-8)
    UTF-8 -> shiftJIS
    shiftJIS -> UTF-8
    Results_charset = UTF-8
    MySQL(UTF-8)
    UTF-8 <- UTF-8
    euc-jp <- UTF-8
    Client_charset = euc-jp
    Perl (euc-jp)
    shiftJIS -> UTF-8
    euc-jp -> shiftJIS
    Results_charset = euc-jp
  • 24. Q & A
  • 25. Thank U!