UNICODE TRANSFORMATION
FORMAT
By ANKIT SHARMA
                  Page 1
INTRODUCTION
• Computers at their most basic level just
  deal with numbers. They store letters,
  numerals and other characters by
  assigning a number for each one.
• �In the pre-Unicode environment, we
  had single 8-bit characters sets, which
  limited us to 256 characters max. No
  single encoding could contain enough
  characters to cover all the languages.
• �so hundreds of different encoding
  systems were developed for assigning
  numbers to characters.
                                             Page 2
Cnt…

• As a result, these coding systems
  conflict with each other. That is, two
  encodings can use the same number
  for two different characters or different
  numbers for the same character.
• �Any given computer needs to support
  many different encodings.
• �yet whenever data is passed
  between different encodings or
  platforms, that data always runs the
  risk of corruption.

                                              Page 3
examples of character encoding
            systems
• examples of character encoding
  systems
• Morse code,
• Baudot code,
• the American Standard Code for
  Information Interchange (ASCII)
• Unicode.


                                    Page 4
WHAT IS UNICODE ?



 Unicode provides a unique number for
 every character,
 no matter what the platform,
 no matter what the program,
 no matter what the language.



 The Unicode Standard is a character coding
 system designed to support the worldwide
 interchange, processing, and display of the
 written texts of the diverse languages.


                                               Page 5
From ASCII to Unicode
• �Most character sets and encodings in
  70s/80s were modifications or
  extensions of ASCII
• �Most common encodings now a days
  use single byte per character (SBCS)
• �They are all limited to 256 characters
• �Due to that, none of them can even
  cover the letters for the Western
  European languages




                                            Page 6
Where is Unicode Used ?
• �The Unicode standards has been
  adopted by many software and hardware
  vendors.
• �Most OSs support Unicode.
• �Unicode is required for international
  document and data interchange, the
  Internet and the WWW, and therefore by
  modern standards such as:
• �Java, C#, Perl, Python
• �Markup languages such as XML,
  HTML, XHTML,
• �JavaScript, LDAP, CORBA etc.
                                           Page 7
UTF-8
• �UTF-8 is the 8-bit encoding of Unicode
• �It’s a variable-width encoding and also
  a strict superset of ASCII.
• �“Strict superset” means that every
  character in ASCII is available in UTF-8
  with the same corresponding code point
  value
• �1 character = 1byte to 4 bytes in the
  encoding
• �Characters from European scripts:
  either 1or 2 bytes
• �Asian scripts: 3 or 4 bytes

                                             Page 8
• �UTF-8 used for UNIX-platforms, HTML
  and most Internet Browsers
• �Main benefits of UTF-8
• �compact storage requirements for
  European scripts
• �In general European scripts will occupy
  less storage on disk and memory
• �Ease of migration –since 7-bit ASCII
  data remains the same in UTF-8, data
  conversion effort between ASCII based
  character sets and UTF-8 is reduced
  significantly.
                                             Page 9
UTF-16
• �UTF-16 is the 16-bit encoding of
  Unicode
• Basically an extension of UCS-2
• �One Unicode character can be 2 or 4
  bytes in
• �the encoding Characters from
  European and most Asian scripts are
  represented in 2 bytes
• �Supplementary characters are
  represented in 4 bytes
• �UTF-16 is the main Unicode encoding
  from Windows 2K

                                         Page 10
• �Main benefits of UTF-16:
• �More compact storage requirements for
  Asian scripts (2 bytes for commonly used
  characters)
• �Ideal if European and Asian scripts are
  used together
• �UTF-16 will occupy less storage on
  disk and memory than with UTF-8 (3
  bytes for Asian part) Balance of efficient
  access to characters and economical
  use of storage.

                                               Page 11
UTF-32


• �32-Bit encoding
• �Popular when memory space is no
  concern
• �Fixed width (4Byte)




                                     Page 12
Unicode @ the Library



•   �» Display all scripts and characters
•   �» Record data in all languages
•   �» Exchange bibliographic data
•   �» Search in all languages …




                                            Page 13
THANK
   YOU
         Page 14

Unicode

  • 1.
  • 2.
    INTRODUCTION • Computers attheir most basic level just deal with numbers. They store letters, numerals and other characters by assigning a number for each one. • �In the pre-Unicode environment, we had single 8-bit characters sets, which limited us to 256 characters max. No single encoding could contain enough characters to cover all the languages. • �so hundreds of different encoding systems were developed for assigning numbers to characters. Page 2
  • 3.
    Cnt… • As aresult, these coding systems conflict with each other. That is, two encodings can use the same number for two different characters or different numbers for the same character. • �Any given computer needs to support many different encodings. • �yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption. Page 3
  • 4.
    examples of characterencoding systems • examples of character encoding systems • Morse code, • Baudot code, • the American Standard Code for Information Interchange (ASCII) • Unicode. Page 4
  • 5.
    WHAT IS UNICODE? Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. The Unicode Standard is a character coding system designed to support the worldwide interchange, processing, and display of the written texts of the diverse languages. Page 5
  • 6.
    From ASCII toUnicode • �Most character sets and encodings in 70s/80s were modifications or extensions of ASCII • �Most common encodings now a days use single byte per character (SBCS) • �They are all limited to 256 characters • �Due to that, none of them can even cover the letters for the Western European languages Page 6
  • 7.
    Where is UnicodeUsed ? • �The Unicode standards has been adopted by many software and hardware vendors. • �Most OSs support Unicode. • �Unicode is required for international document and data interchange, the Internet and the WWW, and therefore by modern standards such as: • �Java, C#, Perl, Python • �Markup languages such as XML, HTML, XHTML, • �JavaScript, LDAP, CORBA etc. Page 7
  • 8.
    UTF-8 • �UTF-8 isthe 8-bit encoding of Unicode • �It’s a variable-width encoding and also a strict superset of ASCII. • �“Strict superset” means that every character in ASCII is available in UTF-8 with the same corresponding code point value • �1 character = 1byte to 4 bytes in the encoding • �Characters from European scripts: either 1or 2 bytes • �Asian scripts: 3 or 4 bytes Page 8
  • 9.
    • �UTF-8 usedfor UNIX-platforms, HTML and most Internet Browsers • �Main benefits of UTF-8 • �compact storage requirements for European scripts • �In general European scripts will occupy less storage on disk and memory • �Ease of migration –since 7-bit ASCII data remains the same in UTF-8, data conversion effort between ASCII based character sets and UTF-8 is reduced significantly. Page 9
  • 10.
    UTF-16 • �UTF-16 isthe 16-bit encoding of Unicode • Basically an extension of UCS-2 • �One Unicode character can be 2 or 4 bytes in • �the encoding Characters from European and most Asian scripts are represented in 2 bytes • �Supplementary characters are represented in 4 bytes • �UTF-16 is the main Unicode encoding from Windows 2K Page 10
  • 11.
    • �Main benefitsof UTF-16: • �More compact storage requirements for Asian scripts (2 bytes for commonly used characters) • �Ideal if European and Asian scripts are used together • �UTF-16 will occupy less storage on disk and memory than with UTF-8 (3 bytes for Asian part) Balance of efficient access to characters and economical use of storage. Page 11
  • 12.
    UTF-32 • �32-Bit encoding •�Popular when memory space is no concern • �Fixed width (4Byte) Page 12
  • 13.
    Unicode @ theLibrary • �» Display all scripts and characters • �» Record data in all languages • �» Exchange bibliographic data • �» Search in all languages … Page 13
  • 14.
    THANK YOU Page 14

Editor's Notes

  • #3 ANKIT & SUSHEEL