Notes on a Standard:
    UNICODE
     Elena-Oana Tabaranu
  elena.tabaranu@info.uaic.ro
           UAIC, Iasi
Plan
●   Introduction
●   Design Goals
●   Code Points and Characters
●   Encoding Forms, UTF-32, UTF-16, UTF-8
●   Conclusion




                                            2
Introduction
●   UNIversal character enCODing system
●   Unicode = universal character encoding scheme
    for written characters and text
●   Advantages
    ●   Consistent way of encoding multilingual text
    ●   Data stability instead of proliferating character sets
    ●   Encode ALL characters used for the written languages (> 1
        million characters can be encoded)
    ●   Creates a foundation for global software

                                                                    3
Design Principles




                    4
Characters, not Glyphs
●   The Unicode Standard draws a distinction between
    characters and glyphs.
●   Characters are the abstract representations of the
    smallest components of written language that have
    semantic value.




                                                         5
Logical Order
       ●   The order in which
           Unicode text is stored
           in the memory
           representation is
           called logical order
       ●   Unicode Standard
           includes characters to
           explicitly specify
           changes in direction
           when necessary
                                6
Code Points and Characters
●   Abstract characters are
    encoded internally as
    numbers
●   Codespace: 0 to 10FFFF16
    => 1,114,112 code points
    available
●   Abstract character -> code
    point
●   Example:
    U+0061 latin small letter a

                                     7
Encoding Forms
●   Encoding forms specify how
    each code point is to be
    expressed as a sequence of
    one or more code unit (8-bit,
    16-bit, 32-bit units)
●   Encoding forms for Unicode
    characters: UTF-8, UTF-16,
    UTF-32
●   Each form can be efficiently
    transformed into either of the
    other two without any loss of
    data

                                     8
UTF-32
●   The simplest Unicode encoding form
●   Each Unicode code point is represented directly
    by a single 32-bit code unit (fixed-width)
●   restricted to representation of code points in
    the range 0..10FFFF16
●   Example:
    U+10000 is represented as <00010000>
●   preferred encoding form for processing
    characters on most Unix platforms                 9
UTF-16
●   Code unit values often change from the code
    point value => conversion required
●   Variable-width encoding:
    ➢   U+0000..U+FFFF are represented as a single 16-bit
        code unit
    ➢   U+10000..U+10FFFF are represented as pairs of
        16-bit code units (surrogate pairs)
●   Optimized for BMP (Basic Multilingual Plain) =
    majority of common-use characters for all
    modern scripts of the world
                                                        10
UTF-8
●   UTF-8 encodes each character (code point) in 1 to 4
    octets (8-bit bytes), with the single–octet encoding
    used only for the 128 US-ASCII characters
    ●   U+0000 to U+007F → 1 byte
    ●   above → 2, 3, up to 4 bytes
●   Backwards compatible with ASCII
●   Standard for XML (XHTML) documents
●   Example:
    U+10000 is represented as <F0 90 80 80>

                                                           11
Conclusion
●   The Unicode Standard is a superset of all
    characters in widespread use today.
●   It contains characters from major international
    and national standards (e.g. the SGML
    standard) as well as prominient industry
    character sets (e.g. industy code from Apple,
    Adobe, Fujitsu, etc).
●   Responds to changing industry demands by
    encoding important new characters (e.g. the €
    sign )
                                                      12
Questions?
●   Thank You!




                              13

Notes on a Standard: Unicode

  • 1.
    Notes on aStandard: UNICODE Elena-Oana Tabaranu elena.tabaranu@info.uaic.ro UAIC, Iasi
  • 2.
    Plan ● Introduction ● Design Goals ● Code Points and Characters ● Encoding Forms, UTF-32, UTF-16, UTF-8 ● Conclusion 2
  • 3.
    Introduction ● UNIversal character enCODing system ● Unicode = universal character encoding scheme for written characters and text ● Advantages ● Consistent way of encoding multilingual text ● Data stability instead of proliferating character sets ● Encode ALL characters used for the written languages (> 1 million characters can be encoded) ● Creates a foundation for global software 3
  • 4.
  • 5.
    Characters, not Glyphs ● The Unicode Standard draws a distinction between characters and glyphs. ● Characters are the abstract representations of the smallest components of written language that have semantic value. 5
  • 6.
    Logical Order ● The order in which Unicode text is stored in the memory representation is called logical order ● Unicode Standard includes characters to explicitly specify changes in direction when necessary 6
  • 7.
    Code Points andCharacters ● Abstract characters are encoded internally as numbers ● Codespace: 0 to 10FFFF16 => 1,114,112 code points available ● Abstract character -> code point ● Example: U+0061 latin small letter a 7
  • 8.
    Encoding Forms ● Encoding forms specify how each code point is to be expressed as a sequence of one or more code unit (8-bit, 16-bit, 32-bit units) ● Encoding forms for Unicode characters: UTF-8, UTF-16, UTF-32 ● Each form can be efficiently transformed into either of the other two without any loss of data 8
  • 9.
    UTF-32 ● The simplest Unicode encoding form ● Each Unicode code point is represented directly by a single 32-bit code unit (fixed-width) ● restricted to representation of code points in the range 0..10FFFF16 ● Example: U+10000 is represented as <00010000> ● preferred encoding form for processing characters on most Unix platforms 9
  • 10.
    UTF-16 ● Code unit values often change from the code point value => conversion required ● Variable-width encoding: ➢ U+0000..U+FFFF are represented as a single 16-bit code unit ➢ U+10000..U+10FFFF are represented as pairs of 16-bit code units (surrogate pairs) ● Optimized for BMP (Basic Multilingual Plain) = majority of common-use characters for all modern scripts of the world 10
  • 11.
    UTF-8 ● UTF-8 encodes each character (code point) in 1 to 4 octets (8-bit bytes), with the single–octet encoding used only for the 128 US-ASCII characters ● U+0000 to U+007F → 1 byte ● above → 2, 3, up to 4 bytes ● Backwards compatible with ASCII ● Standard for XML (XHTML) documents ● Example: U+10000 is represented as <F0 90 80 80> 11
  • 12.
    Conclusion ● The Unicode Standard is a superset of all characters in widespread use today. ● It contains characters from major international and national standards (e.g. the SGML standard) as well as prominient industry character sets (e.g. industy code from Apple, Adobe, Fujitsu, etc). ● Responds to changing industry demands by encoding important new characters (e.g. the € sign ) 12
  • 13.
    Questions? ● Thank You! 13