The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel Spolsky The founder of Stackoverflow The author of 《More Joel on Software》
UCS-2 (UTF-16) A -> U+0041 -> 0x00 0x41 PROS: map code points (U+0000~U+FFFF) to octet directly CONS: Be incompatible with ASCII Waste memory when code point <= U+007F Cannot support code point > U+FFFF
UCS-4 (UTF-32) A -> U+0041 -> 0x00 0x000x00 0x41 PROS: map code points (U+00000000~U+FFFFFFFF) to octet directly CONS: Be incompatible with ASCII Waste huge memory
UTF-8 PROS: Be compatible with ASCII Can map all the code points to octets CONS: Algorithm is a little complicate
It does not make sense to have a string without know what encoding it uses. - Joel Spolsky Software communicate with each other by octet stream A B Sends E7 A5 9E E9 A9 AC 3F A should tell B he sends the octets with charset UTF-8. Then B can understand the received message is “神马?”
Two ways to get a string in Perl Literal string From I/O Literal string – depends on the encoding of your source code # encoding UTF-8 my $a1 = “神马?”; my $a2 = “xE7xA5x9ExE9xA9xACx3F”; my $a3 = <FH>; Anyway, in the perl’s eye, it’s a string with 7 octets. ISO-8859-1 or UTF-8?
Default, Perl treats it just as a sequence of octets # encoding UTF-8 my $a1 = “神马?”; print length($a1) #output is 7 How to make perl treat it as a sequence of characters? # encoding UTF-8 my $a1 = “神马?”; Encode::decode_utf8($a1); Encode::decode(“utf8”, $a1); Encode::_utf8_on($a1); print length($a1) #output is 3
What has happened inside? Decode the sequence of octets to Code points as UTF-8(or other charsets) Encode the Code points to internal format (utf8) Turn the string’s UTF8 flag ON According to the UTF8 flag, Perl treats it as a sequence of chars UTF-8 ? utf8? UTF8?
UTF-8 The standard charset made by Ken Thompson utf8 Perl internal charset Superset of UTF-8 UTF8 The name of flag that indicate whether perl should treat it as a sequence of chars