Unicode

Üɳîḉỗḋę
ᨐЉⰖ닖ぼຢഐဩᚠඐ༃ꘐ

Character Encoding
Maps characters to numbers that can be
represented in binary form.

Terms
• Repertoire
o Full set of abstract characters that a system supports
• Coded Character Set
o Assigns code points (integers) to characters
• Character Encoding Form
o Maps code points to code values that can be represented in binary in
a limited number of bits
• Character Encoding Scheme
o Maps code values to octets

In the beginning, there was ASCII.

History of Unicode
Not very interesting.

Fun Facts About Unicode
• 1.1 million code points, of which over 110,000 are currently assigned.
• Codespace is divided into 17 planes, each with 216 (65,536) code points.
o Basic Multilingual Plane
 Almost all modern languages
 Most code points are CJK
o Supplementary Multilingual Plane
 Historic scripts, hieroglyphs, emoji, card suit symbols, etc.
o Supplementary Ideographic Plane
 CJK

Character Mapping
‫ج‬is U+062C (ARABIC LETTER JEEM)
http://en.wikibooks.org/wiki/Unicode/Character_reference

pâté
U+0070 U+00E2 U+0074 U+00E9

UTF-32
• Every code points is represented with 32 bits
• Direct representation of a code point.
U+0070 U+00E2 U+0074 U+00E9

UTF-16
• Code points in BMP are mapped to single
16-bit values
• Code points from other planes use surrogate
pairs
U+0070 U+00E2 U+0074 U+00E9

UTF-8
• Highly recommended and widely adopted for
internet use.
• Uses 1 octet for ASCII characters, between
2 and 4 octets for other code points.

UTF-8 and ISO-8859-1
http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT

UTF-8 and ISO-8859-1 (cont.)
The Latin 1 supplement (upper half of ISO-
8859-1) are assigned 2 octets in UTF-8.
When UTF-8 data is interpreted as ISO-8859-1,
a Latin 1 supplement character will appear as
Â or Ã followed by another character.
Here is my résumé becomes Here is my rÃ©sumÃ©
UTF-8 ISO-8859-1

Encoding UTF-8
1. The Unicode code point for "€" is U+20AC.
2. According to the scheme table above, this will take three bytes to encode, since it is between U+0800 and U+FFFF.
3. Hexadecimal 20AC is binary 0010000010101100. The two leading zeros are added because, as the scheme table shows,
a three-byte encoding needs exactly sixteen bits from the code point.
4. Because it is a three-byte encoding, the leading byte starts with three 1s, then a 0 (1110...)
5. The remaining bits of this byte are taken from the code point (11100010), leaving ...000010101100.
6. Each of the continuation bytes starts with 10 and takes six bits of the code point (so 10000010, then 10101100).
The three bytes 11100010 10000010 10101100 can be more concisely written in hexadecimal, as E2 82 AC.

https://github.com/dhumbert/Unicode/blob/master/utf8.c

Byte Order Mark
• U+FEFF
• Little endian vs. big endian
• Commonly used for UTF-16 and UTF-32
• Unnecessary in UTF-8

Heuristics for Detecting Unicode

Collation
Unicode collation algorithm
http://www.unicode.org/reports/tr10/

Criticisms of Unicode
Too complex

Inefficient compared to
single-byte encodings

Klingon script not present

Unicode

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Unicode

Similar to Unicode (20)

Recently uploaded

Recently uploaded (20)

Unicode

Editor's Notes