Your SlideShare is downloading. ×
Notes on a Standard: Unicode
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Notes on a Standard: Unicode


Published on

Published in: Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Notes on a Standard: UNICODE Elena-Oana Tabaranu UAIC, Iasi
  • 2. Plan ● Introduction ● Design Goals ● Code Points and Characters ● Encoding Forms, UTF-32, UTF-16, UTF-8 ● Conclusion 2
  • 3. Introduction ● UNIversal character enCODing system ● Unicode = universal character encoding scheme for written characters and text ● Advantages ● Consistent way of encoding multilingual text ● Data stability instead of proliferating character sets ● Encode ALL characters used for the written languages (> 1 million characters can be encoded) ● Creates a foundation for global software 3
  • 4. Design Principles 4
  • 5. Characters, not Glyphs ● The Unicode Standard draws a distinction between characters and glyphs. ● Characters are the abstract representations of the smallest components of written language that have semantic value. 5
  • 6. Logical Order ● The order in which Unicode text is stored in the memory representation is called logical order ● Unicode Standard includes characters to explicitly specify changes in direction when necessary 6
  • 7. Code Points and Characters ● Abstract characters are encoded internally as numbers ● Codespace: 0 to 10FFFF16 => 1,114,112 code points available ● Abstract character -> code point ● Example: U+0061 latin small letter a 7
  • 8. Encoding Forms ● Encoding forms specify how each code point is to be expressed as a sequence of one or more code unit (8-bit, 16-bit, 32-bit units) ● Encoding forms for Unicode characters: UTF-8, UTF-16, UTF-32 ● Each form can be efficiently transformed into either of the other two without any loss of data 8
  • 9. UTF-32 ● The simplest Unicode encoding form ● Each Unicode code point is represented directly by a single 32-bit code unit (fixed-width) ● restricted to representation of code points in the range 0..10FFFF16 ● Example: U+10000 is represented as <00010000> ● preferred encoding form for processing characters on most Unix platforms 9
  • 10. UTF-16 ● Code unit values often change from the code point value => conversion required ● Variable-width encoding: ➢ U+0000..U+FFFF are represented as a single 16-bit code unit ➢ U+10000..U+10FFFF are represented as pairs of 16-bit code units (surrogate pairs) ● Optimized for BMP (Basic Multilingual Plain) = majority of common-use characters for all modern scripts of the world 10
  • 11. UTF-8 ● UTF-8 encodes each character (code point) in 1 to 4 octets (8-bit bytes), with the single–octet encoding used only for the 128 US-ASCII characters ● U+0000 to U+007F → 1 byte ● above → 2, 3, up to 4 bytes ● Backwards compatible with ASCII ● Standard for XML (XHTML) documents ● Example: U+10000 is represented as <F0 90 80 80> 11
  • 12. Conclusion ● The Unicode Standard is a superset of all characters in widespread use today. ● It contains characters from major international and national standards (e.g. the SGML standard) as well as prominient industry character sets (e.g. industy code from Apple, Adobe, Fujitsu, etc). ● Responds to changing industry demands by encoding important new characters (e.g. the € sign ) 12
  • 13. Questions? ● Thank You! 13