Overview of Character Encoding<br />Duy Lam – Dec 2010<br />
Agenda<br />Character Encoding<br />Unicode<br />Encoding problem<br />2<br />
Character encoding<br />
Definition<br />Character Encoding (character set - charset, character map or code page) is a system to specify:<br />Set ...
Common Encodings<br />5<br />
Unicode<br />
Life was perfect<br />7<br />a = 01100001 <br />ấ = ????????<br />ä = ????????<br />a ?ä<br />
Unicode<br />Unicode is a computing industry standard to map every known character to a number (code point)<br />Unicode i...
Unicode mapping table<br />Unicode charts<br />9<br />
Encoding problem<br />
Application<br />Missing understanding<br />11<br />UTF-16 encoding<br />UTF-8 encoding<br />UTF-8 encoding<br />
Demo<br />
End<br />
Upcoming SlideShare
Loading in …5
×

Overview of character encoding

1,501 views

Published on

Provide a summary of character encoding and unicode for developer

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,501
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • ASCII encoding, which specifies how to store English characters in a single byte each (taking up the space in 0-127, leaving 128-255 empty)Microsoft code page is used in pre-Windows NT systems (Windows NT is a family of operating systems produced by Microsoft, the first version of which was released in July 1993. NT was the first fully 32-bit version of Windows, (Windows 3.1x and Windows 9x, were 16-bit/32-bit hybrids). Windows 2000, Windows XP, Windows Server 2003, Windows Vista, Windows Home Server, Windows Server 2008 and Windows 7 are based on Windows NT, although they are not branded as Windows NT)Reference:http://www.nadcomm.com/fiveunit/fiveunits.htm
  • In the beginning of computer age, ASCII covers everything you would find on an English keyboard: letters in upper and lower case, numbers, and some common symbols. There was even some room left in the 128 character ASCII mapping for some control character sequences. But the entire world can&apos;t quite get by on just these characters. Need a encoding system to help encode characters in languages There are many characters in different existing Chinese, Japanese and Korean (CJK) character sets actually represent the same character. Need an effort to identify them
  • A character has to be stored in computer as some number. Unicode tries to unify characters from different encodings that represent the same character. For instance, the A in ASCII, the A in ISO-8859-1, and the A in the Japanese encoding SHIFT-JIS all map to the same Unicode character.A character set and a character encoding aren&apos;t necessarily the same thing. Unicode is one character set, and has multiple character encodingsThe UTF-8 is most efficient for Strings containing mostly ASCII characters (inWestern countries). UTF-8 and UTF-16 are approximately equivalent for Strings containing mostly characters outside ASCII but inside the the BMP (characters for almost all modern languages, and a large number of special characters). For Strings containing mostly characters outside the BMP, UTF-8, UTF-16, and UTF-32 are approximately equivalent.
  • Go to Start Menu &gt; Accessories &gt; System Tools &gt; Character MapsUse this tool show characters and their number in Unicode charts
  • Overview of character encoding

    1. 1. Overview of Character Encoding<br />Duy Lam – Dec 2010<br />
    2. 2. Agenda<br />Character Encoding<br />Unicode<br />Encoding problem<br />2<br />
    3. 3. Character encoding<br />
    4. 4. Definition<br />Character Encoding (character set - charset, character map or code page) is a system to specify:<br />Set of codes (natural numbers or electrical pulses) that represents for characters<br />How to persist characters (such as “hello”) onto disk as a sequence of bytes<br />4<br />
    5. 5. Common Encodings<br />5<br />
    6. 6. Unicode<br />
    7. 7. Life was perfect<br />7<br />a = 01100001 <br />ấ = ????????<br />ä = ????????<br />a ?ä<br />
    8. 8. Unicode<br />Unicode is a computing industry standard to map every known character to a number (code point)<br />Unicode is one character set that can be encoded several different ways. Common Unicode encoding methods (Unicode Transformation Format and Universal Character Set):<br />UTF-8 (one to four bytes): maximized compatibility with ASCII<br />UTF-16 (UCS-2): variable-width encoding (one or two 16-bit code unit)<br />UTF-32 (UCS-4): fixed-width encoding<br />8<br />
    9. 9. Unicode mapping table<br />Unicode charts<br />9<br />
    10. 10. Encoding problem<br />
    11. 11. Application<br />Missing understanding<br />11<br />UTF-16 encoding<br />UTF-8 encoding<br />UTF-8 encoding<br />
    12. 12. Demo<br />
    13. 13. End<br />

    ×