Notes on a Standard: Unicode

Notes on a Standard:
UNICODE
Elena-Oana Tabaranu
elena.tabaranu@info.uaic.ro
UAIC, Iasi

Plan
● Introduction
● Design Goals
● Code Points and Characters
● Encoding Forms, UTF-32, UTF-16, UTF-8
● Conclusion

2

Introduction
● UNIversal character enCODing system
● Unicode = universal character encoding scheme
for written characters and text
● Advantages
● Consistent way of encoding multilingual text
● Data stability instead of proliferating character sets
● Encode ALL characters used for the written languages (> 1
million characters can be encoded)
● Creates a foundation for global software

3

Design Principles

4

Characters, not Glyphs
● The Unicode Standard draws a distinction between
characters and glyphs.
● Characters are the abstract representations of the
smallest components of written language that have
semantic value.

5

Logical Order
● The order in which
Unicode text is stored
in the memory
representation is
called logical order
● Unicode Standard
includes characters to
explicitly specify
changes in direction
when necessary
6

Code Points and Characters
● Abstract characters are
encoded internally as
numbers
● Codespace: 0 to 10FFFF16
=> 1,114,112 code points
available
● Abstract character -> code
point
● Example:
U+0061 latin small letter a

7

Encoding Forms
● Encoding forms specify how
each code point is to be
expressed as a sequence of
one or more code unit (8-bit,
16-bit, 32-bit units)
● Encoding forms for Unicode
characters: UTF-8, UTF-16,
UTF-32
● Each form can be efficiently
transformed into either of the
other two without any loss of
data

8

UTF-32
● The simplest Unicode encoding form
● Each Unicode code point is represented directly
by a single 32-bit code unit (fixed-width)
● restricted to representation of code points in
the range 0..10FFFF16
● Example:
U+10000 is represented as <00010000>
● preferred encoding form for processing
characters on most Unix platforms 9

UTF-16
● Code unit values often change from the code
point value => conversion required
● Variable-width encoding:
➢ U+0000..U+FFFF are represented as a single 16-bit
code unit
➢ U+10000..U+10FFFF are represented as pairs of
16-bit code units (surrogate pairs)
● Optimized for BMP (Basic Multilingual Plain) =
majority of common-use characters for all
modern scripts of the world
10

UTF-8
● UTF-8 encodes each character (code point) in 1 to 4
octets (8-bit bytes), with the single–octet encoding
used only for the 128 US-ASCII characters
● U+0000 to U+007F → 1 byte
● above → 2, 3, up to 4 bytes
● Backwards compatible with ASCII
● Standard for XML (XHTML) documents
● Example:
U+10000 is represented as <F0 90 80 80>

11

Conclusion
● The Unicode Standard is a superset of all
characters in widespread use today.
● It contains characters from major international
and national standards (e.g. the SGML
standard) as well as prominient industry
character sets (e.g. industy code from Apple,
Adobe, Fujitsu, etc).
● Responds to changing industry demands by
encoding important new characters (e.g. the €
sign )
12

Questions?
● Thank You!

13

Notes on a Standard: Unicode

More Related Content

What's hot

Viewers also liked

Similar to Notes on a Standard: Unicode

More from Elena-Oana Tabaranu

Recently uploaded

In this document

Notes on a Standard: Unicode