Üɳîḉỗḋę
ᨐЉⰖ닖ぼຢഐဩᚠඐ༃ꘐ
Character Encoding
Maps characters to numbers that can be
represented in binary form.
Character Encoding Humor
Terms
• Repertoire
o Full set of abstract characters that a system supports
• Coded Character Set
o Assigns code points (integers) to characters
• Character Encoding Form
o Maps code points to code values that can be represented in binary in
a limited number of bits
• Character Encoding Scheme
o Maps code values to octets
In the beginning, there was ASCII.
ISO-8859-1 (Latin 1)
ISO-8859-2 (Central Europe)
ISO-8859-6 (Arabic)
History of Unicode
Not very interesting.
Fun Facts About Unicode
• 1.1 million code points, of which over 110,000 are currently assigned.
• Codespace is divided into 17 planes, each with 216 (65,536) code points.
o Basic Multilingual Plane
 Almost all modern languages
 Most code points are CJK
o Supplementary Multilingual Plane
 Historic scripts, hieroglyphs, emoji, card suit symbols, etc.
o Supplementary Ideographic Plane
 CJK
Basic Multilingual Plane
Character Mapping
‫ج‬is U+062C (ARABIC LETTER JEEM)
http://en.wikibooks.org/wiki/Unicode/Character_reference
pâté
U+0070 U+00E2 U+0074 U+00E9
UTF-32
• Every code points is represented with 32 bits
• Direct representation of a code point.
U+0070 U+00E2 U+0074 U+00E9
UTF-16
• Code points in BMP are mapped to single
16-bit values
• Code points from other planes use surrogate
pairs
U+0070 U+00E2 U+0074 U+00E9
UTF-8
• Highly recommended and widely adopted for
internet use.
• Uses 1 octet for ASCII characters, between
2 and 4 octets for other code points.
UTF-8 and ISO-8859-1
http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
UTF-8 and ISO-8859-1 (cont.)
The Latin 1 supplement (upper half of ISO-
8859-1) are assigned 2 octets in UTF-8.
When UTF-8 data is interpreted as ISO-8859-1,
a Latin 1 supplement character will appear as
 or à followed by another character.
Here is my résumé becomes Here is my résumé
UTF-8 ISO-8859-1
Encoding UTF-8
1. The Unicode code point for "€" is U+20AC.
2. According to the scheme table above, this will take three bytes to encode, since it is between U+0800 and U+FFFF.
3. Hexadecimal 20AC is binary 0010000010101100. The two leading zeros are added because, as the scheme table shows,
a three-byte encoding needs exactly sixteen bits from the code point.
4. Because it is a three-byte encoding, the leading byte starts with three 1s, then a 0 (1110...)
5. The remaining bits of this byte are taken from the code point (11100010), leaving ...000010101100.
6. Each of the continuation bytes starts with 10 and takes six bits of the code point (so 10000010, then 10101100).
The three bytes 11100010 10000010 10101100 can be more concisely written in hexadecimal, as E2 82 AC.
https://github.com/dhumbert/Unicode/blob/master/utf8.c
UTF-8 URL Encoding
Byte Order Mark
• U+FEFF
• Little endian vs. big endian
• Commonly used for UTF-16 and UTF-32
• Unnecessary in UTF-8
Heuristics for Detecting Unicode
Collation
Unicode collation algorithm
http://www.unicode.org/reports/tr10/
Criticisms of Unicode
Too complex
Criticisms of Unicode
Inefficient compared to
single-byte encodings
Criticisms of Unicode
Klingon script not present

Unicode

Editor's Notes

  • #12 1987 by engineers from Xerox and Apple
  • #21 Unicode is a superset of ISO-8859-1. Unicode code points of the Latin Supplement are the same as the values in ISO-8859. However, these characters require 2 bytes to encode in UTF-8.
  • #22 The reason is that the first octet of the encoded form is 11000010 or 11000011 in binary, C2 or C3 in hexadecimal, which means  or à in ISO-8859-1. The second octet has "10" as the first 2 bits, so it would be interpreted as some Latin 1 Supplement character.
  • #25 Browser encodes into UTF-8
  • #26 UTF-8 byte order is always the same
  • #27 However, use of BOM, especially for UTF-8, is discouraged by Unicode Consortium
  • #29 But human script is complex. Example, are diacritics two characters or one?
  • #31 Rejected in 2001 by representatives of Adobe, Apple, IBM, Microsoft, and Sun. Boycott appropriately