This document discusses Unicode transformation formats. It explains that computers assign numbers to characters and that older 8-bit encoding systems were limited, causing conflicts when different encodings were used. Unicode provides a unique number for every character to allow for worldwide text interchange. It describes common encoding schemes like UTF-8, UTF-16 and UTF-32 that are used to encode Unicode, along with their characteristics and benefits. The document also lists some examples of where Unicode is used.
In this document
Powered by AI
Overview of character encoding, limitations of pre-Unicode systems, and the need for a universal encoding to prevent data corruption.
Explanation of Unicode as a universal character coding standard that accommodates all languages and platforms.
Discussion on how older character sets derived from ASCII are limited and how Unicode addresses this.
Unicode’s adoption by various software and hardware, highlighting its necessity for international data interchange.
Details on UTF-8 as a flexible 8-bit encoding, its advantages for storage, and ease of use with ASCII.
UTF-16 specifications, benefits for Asian scripts, and comparison with UTF-8 in terms of storage efficiency.
Overview of UTF-32 as a fixed-width 4-byte encoding, used when memory space is available.
Benefits of Unicode in libraries for displaying scripts, data recording, and multilingual search.
INTRODUCTION
• Computers attheir most basic level just
deal with numbers. They store letters,
numerals and other characters by
assigning a number for each one.
• �In the pre-Unicode environment, we
had single 8-bit characters sets, which
limited us to 256 characters max. No
single encoding could contain enough
characters to cover all the languages.
• �so hundreds of different encoding
systems were developed for assigning
numbers to characters.
Page 2
3.
Cnt…
• As aresult, these coding systems
conflict with each other. That is, two
encodings can use the same number
for two different characters or different
numbers for the same character.
• �Any given computer needs to support
many different encodings.
• �yet whenever data is passed
between different encodings or
platforms, that data always runs the
risk of corruption.
Page 3
4.
examples of characterencoding
systems
• examples of character encoding
systems
• Morse code,
• Baudot code,
• the American Standard Code for
Information Interchange (ASCII)
• Unicode.
Page 4
5.
WHAT IS UNICODE?
Unicode provides a unique number for
every character,
no matter what the platform,
no matter what the program,
no matter what the language.
The Unicode Standard is a character coding
system designed to support the worldwide
interchange, processing, and display of the
written texts of the diverse languages.
Page 5
6.
From ASCII toUnicode
• �Most character sets and encodings in
70s/80s were modifications or
extensions of ASCII
• �Most common encodings now a days
use single byte per character (SBCS)
• �They are all limited to 256 characters
• �Due to that, none of them can even
cover the letters for the Western
European languages
Page 6
7.
Where is UnicodeUsed ?
• �The Unicode standards has been
adopted by many software and hardware
vendors.
• �Most OSs support Unicode.
• �Unicode is required for international
document and data interchange, the
Internet and the WWW, and therefore by
modern standards such as:
• �Java, C#, Perl, Python
• �Markup languages such as XML,
HTML, XHTML,
• �JavaScript, LDAP, CORBA etc.
Page 7
8.
UTF-8
• �UTF-8 isthe 8-bit encoding of Unicode
• �It’s a variable-width encoding and also
a strict superset of ASCII.
• �“Strict superset” means that every
character in ASCII is available in UTF-8
with the same corresponding code point
value
• �1 character = 1byte to 4 bytes in the
encoding
• �Characters from European scripts:
either 1or 2 bytes
• �Asian scripts: 3 or 4 bytes
Page 8
9.
• �UTF-8 usedfor UNIX-platforms, HTML
and most Internet Browsers
• �Main benefits of UTF-8
• �compact storage requirements for
European scripts
• �In general European scripts will occupy
less storage on disk and memory
• �Ease of migration –since 7-bit ASCII
data remains the same in UTF-8, data
conversion effort between ASCII based
character sets and UTF-8 is reduced
significantly.
Page 9
10.
UTF-16
• �UTF-16 isthe 16-bit encoding of
Unicode
• Basically an extension of UCS-2
• �One Unicode character can be 2 or 4
bytes in
• �the encoding Characters from
European and most Asian scripts are
represented in 2 bytes
• �Supplementary characters are
represented in 4 bytes
• �UTF-16 is the main Unicode encoding
from Windows 2K
Page 10
11.
• �Main benefitsof UTF-16:
• �More compact storage requirements for
Asian scripts (2 bytes for commonly used
characters)
• �Ideal if European and Asian scripts are
used together
• �UTF-16 will occupy less storage on
disk and memory than with UTF-8 (3
bytes for Asian part) Balance of efficient
access to characters and economical
use of storage.
Page 11
Unicode @ theLibrary
• �» Display all scripts and characters
• �» Record data in all languages
• �» Exchange bibliographic data
• �» Search in all languages …
Page 13