• Like
  • Save
Notes on a Standard: Unicode
Upcoming SlideShare
Loading in...5

Notes on a Standard: Unicode






Total Views
Views on SlideShare
Embed Views



17 Embeds 200

http://elena-oana.blogspot.com 112
http://elena-oana.blogspot.ro 22
http://elena-oana.blogspot.fr 18
http://elena-oana.blogspot.co.uk 12
http://elena-oana.blogspot.com.br 9
http://elena-oana.blogspot.it 4
http://www.slideshare.net 4
http://elena-oana.blogspot.gr 4
http://elena-oana.blogspot.co.at 4
http://elena-oana.blogspot.de 3
http://elena-oana.blogspot.in 2
http://webcache.googleusercontent.com 1
http://elena-oana.blogspot.ca 1
http://elena-oana.blogspot.nl 1
http://elena-oana.blogspot.mx 1
http://elena-oana.blogspot.ae 1
http://elena-oana.blogspot.be 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Notes on a Standard: Unicode Notes on a Standard: Unicode Presentation Transcript

    • Notes on a Standard: UNICODE Elena-Oana Tabaranu elena.tabaranu@info.uaic.ro UAIC, Iasi
    • Plan ● Introduction ● Design Goals ● Code Points and Characters ● Encoding Forms, UTF-32, UTF-16, UTF-8 ● Conclusion 2
    • Introduction ● UNIversal character enCODing system ● Unicode = universal character encoding scheme for written characters and text ● Advantages ● Consistent way of encoding multilingual text ● Data stability instead of proliferating character sets ● Encode ALL characters used for the written languages (> 1 million characters can be encoded) ● Creates a foundation for global software 3
    • Design Principles 4
    • Characters, not Glyphs ● The Unicode Standard draws a distinction between characters and glyphs. ● Characters are the abstract representations of the smallest components of written language that have semantic value. 5
    • Logical Order ● The order in which Unicode text is stored in the memory representation is called logical order ● Unicode Standard includes characters to explicitly specify changes in direction when necessary 6
    • Code Points and Characters ● Abstract characters are encoded internally as numbers ● Codespace: 0 to 10FFFF16 => 1,114,112 code points available ● Abstract character -> code point ● Example: U+0061 latin small letter a 7
    • Encoding Forms ● Encoding forms specify how each code point is to be expressed as a sequence of one or more code unit (8-bit, 16-bit, 32-bit units) ● Encoding forms for Unicode characters: UTF-8, UTF-16, UTF-32 ● Each form can be efficiently transformed into either of the other two without any loss of data 8
    • UTF-32 ● The simplest Unicode encoding form ● Each Unicode code point is represented directly by a single 32-bit code unit (fixed-width) ● restricted to representation of code points in the range 0..10FFFF16 ● Example: U+10000 is represented as <00010000> ● preferred encoding form for processing characters on most Unix platforms 9
    • UTF-16 ● Code unit values often change from the code point value => conversion required ● Variable-width encoding: ➢ U+0000..U+FFFF are represented as a single 16-bit code unit ➢ U+10000..U+10FFFF are represented as pairs of 16-bit code units (surrogate pairs) ● Optimized for BMP (Basic Multilingual Plain) = majority of common-use characters for all modern scripts of the world 10
    • UTF-8 ● UTF-8 encodes each character (code point) in 1 to 4 octets (8-bit bytes), with the single–octet encoding used only for the 128 US-ASCII characters ● U+0000 to U+007F → 1 byte ● above → 2, 3, up to 4 bytes ● Backwards compatible with ASCII ● Standard for XML (XHTML) documents ● Example: U+10000 is represented as <F0 90 80 80> 11
    • Conclusion ● The Unicode Standard is a superset of all characters in widespread use today. ● It contains characters from major international and national standards (e.g. the SGML standard) as well as prominient industry character sets (e.g. industy code from Apple, Adobe, Fujitsu, etc). ● Responds to changing industry demands by encoding important new characters (e.g. the € sign ) 12
    • Questions? ● Thank You! 13