Overview of Character EncodingDuy Lam – Dec 2010
AgendaCharacter EncodingUnicodeEncoding problem2
Character encoding
DefinitionCharacter Encoding (character set - charset, character map or code page) is a system to specify:Set of codes (natural numbers or electrical pulses) that represents for charactersHow to persist characters (such as “hello”) onto disk as a sequence of bytes4
Common Encodings5
Unicode
Life was perfect7a = 01100001 ấ = ????????ä = ????????a ?ä
UnicodeUnicode is a computing industry standard to map every known character to a number (code point)Unicode is one character set that can be encoded several different ways. Common Unicode encoding methods (Unicode Transformation Format and Universal Character Set):UTF-8 (one to four bytes): maximized compatibility with ASCIIUTF-16 (UCS-2): variable-width encoding (one or two 16-bit code unit)UTF-32 (UCS-4): fixed-width encoding8
Unicode mapping tableUnicode charts9
Encoding problem
ApplicationMissing understanding11UTF-16 encodingUTF-8 encodingUTF-8 encoding
Demo
End

Overview of character encoding

Editor's Notes

  • #6 ASCII encoding, which specifies how to store English characters in a single byte each (taking up the space in 0-127, leaving 128-255 empty)Microsoft code page is used in pre-Windows NT systems (Windows NT is a family of operating systems produced by Microsoft, the first version of which was released in July 1993. NT was the first fully 32-bit version of Windows, (Windows 3.1x and Windows 9x, were 16-bit/32-bit hybrids). Windows 2000, Windows XP, Windows Server 2003, Windows Vista, Windows Home Server, Windows Server 2008 and Windows 7 are based on Windows NT, although they are not branded as Windows NT)Reference:http://www.nadcomm.com/fiveunit/fiveunits.htm
  • #8 In the beginning of computer age, ASCII covers everything you would find on an English keyboard: letters in upper and lower case, numbers, and some common symbols. There was even some room left in the 128 character ASCII mapping for some control character sequences. But the entire world can't quite get by on just these characters. Need a encoding system to help encode characters in languages There are many characters in different existing Chinese, Japanese and Korean (CJK) character sets actually represent the same character. Need an effort to identify them
  • #9 A character has to be stored in computer as some number. Unicode tries to unify characters from different encodings that represent the same character. For instance, the A in ASCII, the A in ISO-8859-1, and the A in the Japanese encoding SHIFT-JIS all map to the same Unicode character.A character set and a character encoding aren't necessarily the same thing. Unicode is one character set, and has multiple character encodingsThe UTF-8 is most efficient for Strings containing mostly ASCII characters (inWestern countries). UTF-8 and UTF-16 are approximately equivalent for Strings containing mostly characters outside ASCII but inside the the BMP (characters for almost all modern languages, and a large number of special characters). For Strings containing mostly characters outside the BMP, UTF-8, UTF-16, and UTF-32 are approximately equivalent.
  • #10 Go to Start Menu > Accessories > System Tools > Character MapsUse this tool show characters and their number in Unicode charts