Notes on a Standard:
    UNICODE
     Elena-Oana Tabaranu
  elena.tabaranu@info.uaic.ro
           UAIC, Iasi
Plan
●   Introduction
●   Design Goals
●   Code Points and Characters
●   Encoding Forms, UTF-32, UTF-16, UTF-8
●   Conclu...
Introduction
●   UNIversal character enCODing system
●   Unicode = universal character encoding scheme
    for written cha...
Design Principles




                    4
Characters, not Glyphs
●   The Unicode Standard draws a distinction between
    characters and glyphs.
●   Characters are ...
Logical Order
       ●   The order in which
           Unicode text is stored
           in the memory
           represen...
Code Points and Characters
●   Abstract characters are
    encoded internally as
    numbers
●   Codespace: 0 to 10FFFF16
...
Encoding Forms
●   Encoding forms specify how
    each code point is to be
    expressed as a sequence of
    one or more ...
UTF-32
●   The simplest Unicode encoding form
●   Each Unicode code point is represented directly
    by a single 32-bit c...
UTF-16
●   Code unit values often change from the code
    point value => conversion required
●   Variable-width encoding:...
UTF-8
●   UTF-8 encodes each character (code point) in 1 to 4
    octets (8-bit bytes), with the single–octet encoding
   ...
Conclusion
●   The Unicode Standard is a superset of all
    characters in widespread use today.
●   It contains character...
Questions?
●   Thank You!




                              13
Upcoming SlideShare
Loading in...5
×

Notes on a Standard: Unicode

1,388

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,388
On Slideshare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Notes on a Standard: Unicode

  1. 1. Notes on a Standard: UNICODE Elena-Oana Tabaranu elena.tabaranu@info.uaic.ro UAIC, Iasi
  2. 2. Plan ● Introduction ● Design Goals ● Code Points and Characters ● Encoding Forms, UTF-32, UTF-16, UTF-8 ● Conclusion 2
  3. 3. Introduction ● UNIversal character enCODing system ● Unicode = universal character encoding scheme for written characters and text ● Advantages ● Consistent way of encoding multilingual text ● Data stability instead of proliferating character sets ● Encode ALL characters used for the written languages (> 1 million characters can be encoded) ● Creates a foundation for global software 3
  4. 4. Design Principles 4
  5. 5. Characters, not Glyphs ● The Unicode Standard draws a distinction between characters and glyphs. ● Characters are the abstract representations of the smallest components of written language that have semantic value. 5
  6. 6. Logical Order ● The order in which Unicode text is stored in the memory representation is called logical order ● Unicode Standard includes characters to explicitly specify changes in direction when necessary 6
  7. 7. Code Points and Characters ● Abstract characters are encoded internally as numbers ● Codespace: 0 to 10FFFF16 => 1,114,112 code points available ● Abstract character -> code point ● Example: U+0061 latin small letter a 7
  8. 8. Encoding Forms ● Encoding forms specify how each code point is to be expressed as a sequence of one or more code unit (8-bit, 16-bit, 32-bit units) ● Encoding forms for Unicode characters: UTF-8, UTF-16, UTF-32 ● Each form can be efficiently transformed into either of the other two without any loss of data 8
  9. 9. UTF-32 ● The simplest Unicode encoding form ● Each Unicode code point is represented directly by a single 32-bit code unit (fixed-width) ● restricted to representation of code points in the range 0..10FFFF16 ● Example: U+10000 is represented as <00010000> ● preferred encoding form for processing characters on most Unix platforms 9
  10. 10. UTF-16 ● Code unit values often change from the code point value => conversion required ● Variable-width encoding: ➢ U+0000..U+FFFF are represented as a single 16-bit code unit ➢ U+10000..U+10FFFF are represented as pairs of 16-bit code units (surrogate pairs) ● Optimized for BMP (Basic Multilingual Plain) = majority of common-use characters for all modern scripts of the world 10
  11. 11. UTF-8 ● UTF-8 encodes each character (code point) in 1 to 4 octets (8-bit bytes), with the single–octet encoding used only for the 128 US-ASCII characters ● U+0000 to U+007F → 1 byte ● above → 2, 3, up to 4 bytes ● Backwards compatible with ASCII ● Standard for XML (XHTML) documents ● Example: U+10000 is represented as <F0 90 80 80> 11
  12. 12. Conclusion ● The Unicode Standard is a superset of all characters in widespread use today. ● It contains characters from major international and national standards (e.g. the SGML standard) as well as prominient industry character sets (e.g. industy code from Apple, Adobe, Fujitsu, etc). ● Responds to changing industry demands by encoding important new characters (e.g. the € sign ) 12
  13. 13. Questions? ● Thank You! 13
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×