SlideShare a Scribd company logo
This article is a part of Lingoport.com; the original article can be found at
http://www.lingoport.com/software-internationalization-articles/unicode-primer-for-the-uninitiated/

Unicode Primer for the Uninitiated
Among our friends and clients at Lingoport, we regularly see ranges of confusion, to complete lack of
awareness of what Unicode is. So for the less- or under-informed, perhaps this article will help. The
advent of Unicode is a key underpinning for global software applications and websites so that they can
support worldwide language scripts. So it’s a very important standard to be aware of, whether you’re in
localization, an engineer or a business manager.
Firstly, Unicode is a character set standard used for
displaying and processing language data in computer
applications. The Unicode character set is the entire
world’s set of characters, including letters, numbers,
currencies, symbols and the like, supporting a number
of character encodings to make that all happen. Before
your eyes glaze over, let me explain what character
encoding means. You have to remember that for a
computer, all information is represented in zeros and
ones (i.e. binary values). So if you think of the letter A
in the ASCII standard of zeros and ones it would look
like this: 1000001. That is, a 1 then five zeros and a 1
to make a total of 7 bits. This binary representation for
A is called A’s code point, and this mapping of zeros
and ones to characters is called the character
encoding. In the early days of computing, unless you
did something very special, ASCII (7 bits per character) was how your data got managed. The problem is
that ASCII doesn’t leave you enough zeros and ones to represent extended characters, like accents and
characters specific to non-English alphabets, such as you find in European languages. You certainly can’t
support the complex characters that make up Chinese, Korean and Japanese languages. These languages
require 8-bit (single-byte) or 16-bit (double-byte) character encodings. One important note on all of these
single- and double-byte encodings is that they are a superset of 7-bit ASCII encoding, which means that
English code points will always be the same regardless the encoding.
The Bad Old Days
                                                     In the early computing days, specific character
                                                     single- and double-byte encodings were developed to
                                                     support various languages. That was very bad, as it
                                                     meant that software developers needed to build a
                                                     version of their application for every language they
                                                     wanted to support that used a different encoding.
                                                     You’d have the Japanese version, the Western
                                                     European language version, the English-only version
                                                     and so on. You’d end up with a hoard of individual
An Introduction to Unicode and Character Encoding
                                                     software code bases, each needing their own testing,
updating and ongoing maintenance and support, which is very expensive, and pretty near impossible for
businesses to realistically support without serious digressions among the various language versions over
time. You don’t see this problem very often for newly developed applications, but there are plenty of
holdovers. We see it typically when a new client has turned over their source code to a particular country
partner or marketing agent which was responsible for adapting the code to multiple languages. The worst
case I saw was in 2004 when a particular client, who I will leave unmentioned, had a legacy product with
18 separate language versions and had no real idea any longer the level of functionality that varied from
language to language. That’s no way to grow a corporate empire!

ISO Latin
A single-byte character set that we often see in applications is
ISO Latin 1, which is represented in various encoding
standards such as ISO-8859-1 for UNIX, Windows-1252 for
Windows and MacRoman on guess what platform. This
character set supports characters used in Western European
languages such as French, Spanish, German, and U.K. English.
Since each character requires only a single byte, this character
set provides support for multiple languages, while avoiding the
work required to support either Unicode or a double-byte                       Unicode: The Movie
encoding. Trouble is that still leaves out much of the world.
For example, to support Eastern European languages you need to use a different character set, often
referred to as Latin 2, which provides the characters that are uniquely needed for these languages. There
are also separate character sets for Baltic languages, Turkish, Arabic, Hebrew, and on and on. When
having to internationalize software for the first time, sometimes companies will start with just supporting
ISO Latin 1 if it meets their immediate marketing requirements and deal with the more extensive work of
supporting other languages later. The reason is that it’s likely these software applications will need major
reworking of the encoding support in their database and functions, methods and classes within their
source code to go beyond ISO Latin support, which means more time and more money – often cascading
into later releases and foregone revenues. However, if the software company has truly global ambitions,
they will need to take that plunge and provide Unicode support. I’ll argue that if companies are
supporting global customers, and even not doing a bit of translation/localization for the interface, they
still need to support Unicode so they can provide processing of their customer’s global data.

Unicode
We come back to Unicode, which as we mentioned above, is a character set created to enable support of
any written language worldwide. Now you might find a language or two lacking Unicode support for its
script but that is becoming extremely isolated. For instance, currently Javanese, Loma, and Tai Viet are
among scripts not yet supported. Arcane until you need them I suppose. I remember a few years ago
when we were developing a multi-lingual site which needed support for Khmer and Armenian, and we
were thankful that Unicode had just added their support a few months prior. If you have a marketing
requirement for your software to support Japanese or Chinese, think Unicode. That’s because you will
need to move to a double-byte encoding at the very least, and as soon as you go through the trouble to
do that, you might as well support Unicode and get the added benefit of support for all languages.

UTF-8
Once you’ve chosen to support Unicode, you must decide on the specific character encoding you want to
use, which will be dependent on the application requirements and technologies. UTF-8 is one of the
commonly used character encodings defined within the Unicode Standard, which uses a single byte for
each character unless it needs more, in which case it can expand up to 4 bytes. People sometimes refer
to this as a variable-width encoding since the width of the character in bytes varies depending upon the
character. The advantage of this character encoding is that all English (ASCII) characters will remain as
single-bytes, saving data space. This is especially desirable for web content, since the underlying HTML
markup will remain in single-byte ASCII. In general, UNIX platforms are optimized for UTF-8 character
encoding. Concerning databases, where large amounts of application data are integral to the application,
a developer may choose a UTF-8 encoding to save space if most of the data in the database does not
need translation and so can remain in English (which requires only a single byte in UTF-8 encoding). Note
that some databases will not support UTF-8, specifically Microsoft’s SQL Server.

UTF-16
UTF-16 is another widely adopted encoding within the Unicode standard. It assigns two bytes for each
character whether you need it or not. So the letter A is 00000000 01000001 or 9 zeros, a one, followed
by 5 zeros and a one. If more than 2 bytes are needed for a character, four bytes can be combined;
however you must adapt your software to be capable of handling this four-byte combination. Java and
.Net internally process strings (text and messages) as UTF-16.
For many applications, you can actually support multiple Unicode encodings so that for example your
data is stored in your database as UTF-8 but is handled within your code as UTF-16, or vice versa. There
are various reasons to do this, such as software limitations (different software components supporting
different Unicode encodings), storage or performance advantages, etc.. But whether that’s a good idea is
one of those “it depends” kinds of questions. Implementing can be tricky and clients pay us good money
to solve this.
Microsoft’s SQL Server is a bit of a special case, in that it supports UCS-2, which is like UTF-16 but
without the 4-byte characters (only the 16-bit characters are supported).

GB 18030
There’s also a special-case character set when it comes to engineering for software intended for sale in
China (PRC), which is required by the Chinese Government. This character set is GB 18030, and it is
actually a superset of Unicode, supporting both simplified and traditional Chinese. Similarly to UTF-16, GB
18030 character encoding allows 4 bytes per character to support characters beyond Unicode’s “basic”
(16-bit) range, and in practice supporting UTF-16 (or UTF-8) is considered an acceptable approach to
supporting GB 18030 (the UCS-2 encoding just mentioned is not, however).
Now all of this considered, a converse question might be, what happens when you try to make your
application support complex scripts that need Unicode, and the support isn’t there? Depending upon your
system, you get anything from garbled and meaningless gibberish where data or messages become
corrupted characters or weird square boxes, or the application crashes forcing a restart. Not good.

If your application supports Unicode, you are ready to take on the world.

About Lingoport
Founded in 2001, Lingoport provides extensive software localization and internationalization consulting
services. Lingoport’s Globalyzer software, a market leading software internationalization tool, helps entire
enterprises and development teams to effectively internationalize existing and newly developed source
code and to prepare their applications for localization.

                                   An Introduction to Lingoport’s Globalyzer:

More Related Content

What's hot

What Is Coding And Why Should You Learn It?
What Is Coding And Why Should You Learn It?What Is Coding And Why Should You Learn It?
What Is Coding And Why Should You Learn It?
Syed Hassan Raza
 
Unit 12 section 1 - computer programming
Unit 12   section 1 - computer programmingUnit 12   section 1 - computer programming
Unit 12 section 1 - computer programmingdlwadsworth
 
12 best programming languages for web & app development
12 best programming languages for web & app development12 best programming languages for web & app development
12 best programming languages for web & app development
Biztech Consulting & Solutions
 
Presentation on generation of languages
Presentation on generation of languagesPresentation on generation of languages
Presentation on generation of languages
Richa Pant
 
draft-slevinski-signwriting-text
draft-slevinski-signwriting-textdraft-slevinski-signwriting-text
draft-slevinski-signwriting-text
Stephen Slevinski
 
Programming languages in bioinformatics by dr. jayarama reddy
Programming languages in bioinformatics by dr. jayarama reddyProgramming languages in bioinformatics by dr. jayarama reddy
Programming languages in bioinformatics by dr. jayarama reddy
Dr. Jayarama Reddy
 
Abstraction level taxonomy of programming language frameworks
Abstraction level taxonomy of programming language frameworksAbstraction level taxonomy of programming language frameworks
Abstraction level taxonomy of programming language frameworks
ijpla
 
computer languages
computer languagescomputer languages
computer languages
Yasirali328
 
Generations of programming_language.kum_ari11-1-1-1
Generations of programming_language.kum_ari11-1-1-1Generations of programming_language.kum_ari11-1-1-1
Generations of programming_language.kum_ari11-1-1-1
lakshmi kumari neelapu
 
Introduction of c language
Introduction of c languageIntroduction of c language
Introduction of c language
Teena Bosamiya
 
Assignment on basic programming language
Assignment on  basic programming languageAssignment on  basic programming language
Assignment on basic programming language
Guru buying house , Main branch ,Barishal.
 
Machine language
Machine languageMachine language
Machine languageRipal Dhruv
 

What's hot (13)

What Is Coding And Why Should You Learn It?
What Is Coding And Why Should You Learn It?What Is Coding And Why Should You Learn It?
What Is Coding And Why Should You Learn It?
 
Unit 12 section 1 - computer programming
Unit 12   section 1 - computer programmingUnit 12   section 1 - computer programming
Unit 12 section 1 - computer programming
 
12 best programming languages for web & app development
12 best programming languages for web & app development12 best programming languages for web & app development
12 best programming languages for web & app development
 
Presentation on generation of languages
Presentation on generation of languagesPresentation on generation of languages
Presentation on generation of languages
 
draft-slevinski-signwriting-text
draft-slevinski-signwriting-textdraft-slevinski-signwriting-text
draft-slevinski-signwriting-text
 
Programming languages in bioinformatics by dr. jayarama reddy
Programming languages in bioinformatics by dr. jayarama reddyProgramming languages in bioinformatics by dr. jayarama reddy
Programming languages in bioinformatics by dr. jayarama reddy
 
Abstraction level taxonomy of programming language frameworks
Abstraction level taxonomy of programming language frameworksAbstraction level taxonomy of programming language frameworks
Abstraction level taxonomy of programming language frameworks
 
computer languages
computer languagescomputer languages
computer languages
 
Rajesh ppt
Rajesh pptRajesh ppt
Rajesh ppt
 
Generations of programming_language.kum_ari11-1-1-1
Generations of programming_language.kum_ari11-1-1-1Generations of programming_language.kum_ari11-1-1-1
Generations of programming_language.kum_ari11-1-1-1
 
Introduction of c language
Introduction of c languageIntroduction of c language
Introduction of c language
 
Assignment on basic programming language
Assignment on  basic programming languageAssignment on  basic programming language
Assignment on basic programming language
 
Machine language
Machine languageMachine language
Machine language
 

Viewers also liked

Gypsyville
GypsyvilleGypsyville
Gypsyville
Xerxes Irani
 
Internationalization (I18n) and Localization (L10n): A Study
Internationalization (I18n) and Localization (L10n): A StudyInternationalization (I18n) and Localization (L10n): A Study
Internationalization (I18n) and Localization (L10n): A Study
Lingoport (www.lingoport.com)
 
Core Service Offerings
Core Service OfferingsCore Service Offerings
Core Service Offeringsvikastar
 
LocWorld: Building an Internationalization Plan; October 2011
LocWorld: Building an Internationalization Plan; October 2011LocWorld: Building an Internationalization Plan; October 2011
LocWorld: Building an Internationalization Plan; October 2011Lingoport (www.lingoport.com)
 
Smart Assemblies
Smart AssembliesSmart Assemblies
Smart Assemblies
David Nicholson
 
Sosa-Golden Wedding
Sosa-Golden WeddingSosa-Golden Wedding
Sosa-Golden Wedding
guest9f39a82
 
Nh7 pre party
Nh7 pre partyNh7 pre party
Nh7 pre party
Xerxes Irani
 
Uncommon uses for common plants
Uncommon uses for common plantsUncommon uses for common plants
Uncommon uses for common plants
Elle D'Coda
 
Financial Modeling Services Mcg
Financial Modeling Services McgFinancial Modeling Services Mcg
Financial Modeling Services Mcg
vikastar
 
Internationalization (i18n) Primer
Internationalization (i18n) PrimerInternationalization (i18n) Primer
Internationalization (i18n) Primer
Lingoport (www.lingoport.com)
 
Worldware: Software internationalization and globalization conference summary...
Worldware: Software internationalization and globalization conference summary...Worldware: Software internationalization and globalization conference summary...
Worldware: Software internationalization and globalization conference summary...
Lingoport (www.lingoport.com)
 

Viewers also liked (12)

Gypsyville
GypsyvilleGypsyville
Gypsyville
 
Internationalization (I18n) and Localization (L10n): A Study
Internationalization (I18n) and Localization (L10n): A StudyInternationalization (I18n) and Localization (L10n): A Study
Internationalization (I18n) and Localization (L10n): A Study
 
Core Service Offerings
Core Service OfferingsCore Service Offerings
Core Service Offerings
 
LocWorld: Building an Internationalization Plan; October 2011
LocWorld: Building an Internationalization Plan; October 2011LocWorld: Building an Internationalization Plan; October 2011
LocWorld: Building an Internationalization Plan; October 2011
 
Smart Assemblies
Smart AssembliesSmart Assemblies
Smart Assemblies
 
Sosa-Golden Wedding
Sosa-Golden WeddingSosa-Golden Wedding
Sosa-Golden Wedding
 
Nh7 pre party
Nh7 pre partyNh7 pre party
Nh7 pre party
 
Uncommon uses for common plants
Uncommon uses for common plantsUncommon uses for common plants
Uncommon uses for common plants
 
Financial Modeling Services Mcg
Financial Modeling Services McgFinancial Modeling Services Mcg
Financial Modeling Services Mcg
 
G Barnett Webquest
G Barnett WebquestG Barnett Webquest
G Barnett Webquest
 
Internationalization (i18n) Primer
Internationalization (i18n) PrimerInternationalization (i18n) Primer
Internationalization (i18n) Primer
 
Worldware: Software internationalization and globalization conference summary...
Worldware: Software internationalization and globalization conference summary...Worldware: Software internationalization and globalization conference summary...
Worldware: Software internationalization and globalization conference summary...
 

Similar to Unicode Primer for the Uninitiated

How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...
agileware
 
Xml For Dummies Chapter 6 Adding Character(S) To Xml
Xml For Dummies   Chapter 6 Adding Character(S) To XmlXml For Dummies   Chapter 6 Adding Character(S) To Xml
Xml For Dummies Chapter 6 Adding Character(S) To Xmlphanleson
 
The Ring programming language version 1.5.4 book - Part 5 of 185
The Ring programming language version 1.5.4 book - Part 5 of 185The Ring programming language version 1.5.4 book - Part 5 of 185
The Ring programming language version 1.5.4 book - Part 5 of 185
Mahmoud Samir Fayed
 
Software Internationalization Crash Course
Software Internationalization Crash CourseSoftware Internationalization Crash Course
Software Internationalization Crash Course
Will Iverson
 
The Ring programming language version 1.4 book - Part 2 of 30
The Ring programming language version 1.4 book - Part 2 of 30The Ring programming language version 1.4 book - Part 2 of 30
The Ring programming language version 1.4 book - Part 2 of 30
Mahmoud Samir Fayed
 
The Ring programming language version 1.5.3 book - Part 5 of 184
The Ring programming language version 1.5.3 book - Part 5 of 184The Ring programming language version 1.5.3 book - Part 5 of 184
The Ring programming language version 1.5.3 book - Part 5 of 184
Mahmoud Samir Fayed
 
The Ring programming language version 1.4.1 book - Part 2 of 31
The Ring programming language version 1.4.1 book - Part 2 of 31The Ring programming language version 1.4.1 book - Part 2 of 31
The Ring programming language version 1.4.1 book - Part 2 of 31
Mahmoud Samir Fayed
 
Delphi unicode-migration
Delphi unicode-migrationDelphi unicode-migration
Delphi unicode-migration
zevin
 
Generations Of Programming Languages
Generations Of Programming LanguagesGenerations Of Programming Languages
Generations Of Programming Languages
py7rjs
 
The Ring programming language version 1.5.2 book - Part 5 of 181
The Ring programming language version 1.5.2 book - Part 5 of 181The Ring programming language version 1.5.2 book - Part 5 of 181
The Ring programming language version 1.5.2 book - Part 5 of 181
Mahmoud Samir Fayed
 
The Ring programming language version 1.10 book - Part 6 of 212
The Ring programming language version 1.10 book - Part 6 of 212The Ring programming language version 1.10 book - Part 6 of 212
The Ring programming language version 1.10 book - Part 6 of 212
Mahmoud Samir Fayed
 
A11Y? I18N? L10N? UTF8? WTF? Understanding the connections between: accessib...
A11Y? I18N? L10N? UTF8? WTF? Understanding the connections between:  accessib...A11Y? I18N? L10N? UTF8? WTF? Understanding the connections between:  accessib...
A11Y? I18N? L10N? UTF8? WTF? Understanding the connections between: accessib...
mtoppa
 
The Ring programming language version 1.5.1 book - Part 4 of 180
The Ring programming language version 1.5.1 book - Part 4 of 180The Ring programming language version 1.5.1 book - Part 4 of 180
The Ring programming language version 1.5.1 book - Part 4 of 180
Mahmoud Samir Fayed
 
The Ring programming language version 1.9 book - Part 6 of 210
The Ring programming language version 1.9 book - Part 6 of 210The Ring programming language version 1.9 book - Part 6 of 210
The Ring programming language version 1.9 book - Part 6 of 210
Mahmoud Samir Fayed
 
Uncdtalk
UncdtalkUncdtalk
The Ring programming language version 1.3 book - Part 4 of 88
The Ring programming language version 1.3 book - Part 4 of 88The Ring programming language version 1.3 book - Part 4 of 88
The Ring programming language version 1.3 book - Part 4 of 88
Mahmoud Samir Fayed
 
Text to speech converter in C#.NET
Text to speech converter in C#.NETText to speech converter in C#.NET
Text to speech converter in C#.NET
Mandeep Cheema
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesMilind Patil
 
Programming language design and implemenation
Programming language design and implemenationProgramming language design and implemenation
Programming language design and implemenationAshwini Awatare
 

Similar to Unicode Primer for the Uninitiated (20)

How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...
 
Xml For Dummies Chapter 6 Adding Character(S) To Xml
Xml For Dummies   Chapter 6 Adding Character(S) To XmlXml For Dummies   Chapter 6 Adding Character(S) To Xml
Xml For Dummies Chapter 6 Adding Character(S) To Xml
 
The Ring programming language version 1.5.4 book - Part 5 of 185
The Ring programming language version 1.5.4 book - Part 5 of 185The Ring programming language version 1.5.4 book - Part 5 of 185
The Ring programming language version 1.5.4 book - Part 5 of 185
 
Software Internationalization Crash Course
Software Internationalization Crash CourseSoftware Internationalization Crash Course
Software Internationalization Crash Course
 
The Ring programming language version 1.4 book - Part 2 of 30
The Ring programming language version 1.4 book - Part 2 of 30The Ring programming language version 1.4 book - Part 2 of 30
The Ring programming language version 1.4 book - Part 2 of 30
 
The Ring programming language version 1.5.3 book - Part 5 of 184
The Ring programming language version 1.5.3 book - Part 5 of 184The Ring programming language version 1.5.3 book - Part 5 of 184
The Ring programming language version 1.5.3 book - Part 5 of 184
 
The Ring programming language version 1.4.1 book - Part 2 of 31
The Ring programming language version 1.4.1 book - Part 2 of 31The Ring programming language version 1.4.1 book - Part 2 of 31
The Ring programming language version 1.4.1 book - Part 2 of 31
 
Delphi unicode-migration
Delphi unicode-migrationDelphi unicode-migration
Delphi unicode-migration
 
Generations Of Programming Languages
Generations Of Programming LanguagesGenerations Of Programming Languages
Generations Of Programming Languages
 
The Ring programming language version 1.5.2 book - Part 5 of 181
The Ring programming language version 1.5.2 book - Part 5 of 181The Ring programming language version 1.5.2 book - Part 5 of 181
The Ring programming language version 1.5.2 book - Part 5 of 181
 
The Ring programming language version 1.10 book - Part 6 of 212
The Ring programming language version 1.10 book - Part 6 of 212The Ring programming language version 1.10 book - Part 6 of 212
The Ring programming language version 1.10 book - Part 6 of 212
 
A11Y? I18N? L10N? UTF8? WTF? Understanding the connections between: accessib...
A11Y? I18N? L10N? UTF8? WTF? Understanding the connections between:  accessib...A11Y? I18N? L10N? UTF8? WTF? Understanding the connections between:  accessib...
A11Y? I18N? L10N? UTF8? WTF? Understanding the connections between: accessib...
 
The Ring programming language version 1.5.1 book - Part 4 of 180
The Ring programming language version 1.5.1 book - Part 4 of 180The Ring programming language version 1.5.1 book - Part 4 of 180
The Ring programming language version 1.5.1 book - Part 4 of 180
 
Intermediate Languages
Intermediate LanguagesIntermediate Languages
Intermediate Languages
 
The Ring programming language version 1.9 book - Part 6 of 210
The Ring programming language version 1.9 book - Part 6 of 210The Ring programming language version 1.9 book - Part 6 of 210
The Ring programming language version 1.9 book - Part 6 of 210
 
Uncdtalk
UncdtalkUncdtalk
Uncdtalk
 
The Ring programming language version 1.3 book - Part 4 of 88
The Ring programming language version 1.3 book - Part 4 of 88The Ring programming language version 1.3 book - Part 4 of 88
The Ring programming language version 1.3 book - Part 4 of 88
 
Text to speech converter in C#.NET
Text to speech converter in C#.NETText to speech converter in C#.NET
Text to speech converter in C#.NET
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfiles
 
Programming language design and implemenation
Programming language design and implemenationProgramming language design and implemenation
Programming language design and implemenation
 

More from Lingoport (www.lingoport.com)

Internationalization Conference, Webinars, Events, Book Discount and More!
Internationalization Conference, Webinars, Events, Book Discount and More!Internationalization Conference, Webinars, Events, Book Discount and More!
Internationalization Conference, Webinars, Events, Book Discount and More!Lingoport (www.lingoport.com)
 
Leading Globalized Software Effort: An Expert Discussion
Leading Globalized Software Effort: An Expert DiscussionLeading Globalized Software Effort: An Expert Discussion
Leading Globalized Software Effort: An Expert DiscussionLingoport (www.lingoport.com)
 
Wordware 2011: Lingoport i18n Planning & Static Analysis
Wordware 2011: Lingoport i18n Planning & Static AnalysisWordware 2011: Lingoport i18n Planning & Static Analysis
Wordware 2011: Lingoport i18n Planning & Static Analysis
Lingoport (www.lingoport.com)
 
Lingoport internationalization-i18n-and-localization-l10n-e newsletter-septem...
Lingoport internationalization-i18n-and-localization-l10n-e newsletter-septem...Lingoport internationalization-i18n-and-localization-l10n-e newsletter-septem...
Lingoport internationalization-i18n-and-localization-l10n-e newsletter-septem...Lingoport (www.lingoport.com)
 
JavaScript Internationalization I18n for Efficient Software Localization
JavaScript Internationalization I18n for Efficient Software LocalizationJavaScript Internationalization I18n for Efficient Software Localization
JavaScript Internationalization I18n for Efficient Software Localization
Lingoport (www.lingoport.com)
 
Internationalization (i18n) Primer: Solving Coding Issues Equals Competitive ...
Internationalization (i18n) Primer: Solving Coding Issues Equals Competitive ...Internationalization (i18n) Primer: Solving Coding Issues Equals Competitive ...
Internationalization (i18n) Primer: Solving Coding Issues Equals Competitive ...
Lingoport (www.lingoport.com)
 
Enhancing Internationalization Productivity: I18n Tools Support Software Loca...
Enhancing Internationalization Productivity: I18n Tools Support Software Loca...Enhancing Internationalization Productivity: I18n Tools Support Software Loca...
Enhancing Internationalization Productivity: I18n Tools Support Software Loca...
Lingoport (www.lingoport.com)
 
Outsourcing Internationalization (i18n) Services
Outsourcing Internationalization (i18n) ServicesOutsourcing Internationalization (i18n) Services
Outsourcing Internationalization (i18n) Services
Lingoport (www.lingoport.com)
 
Business Perspectives on Internationalization (i18n)
Business Perspectives on Internationalization (i18n)Business Perspectives on Internationalization (i18n)
Business Perspectives on Internationalization (i18n)
Lingoport (www.lingoport.com)
 
Internationalization (i18n) and Localization (l10n) - Partners in Successful ...
Internationalization (i18n) and Localization (l10n) - Partners in Successful ...Internationalization (i18n) and Localization (l10n) - Partners in Successful ...
Internationalization (i18n) and Localization (l10n) - Partners in Successful ...
Lingoport (www.lingoport.com)
 

More from Lingoport (www.lingoport.com) (17)

Staying Global in an Agile World Presentation
Staying Global in an Agile World PresentationStaying Global in an Agile World Presentation
Staying Global in an Agile World Presentation
 
Internationalizing a Multi-Layered Application
Internationalizing a Multi-Layered ApplicationInternationalizing a Multi-Layered Application
Internationalizing a Multi-Layered Application
 
Shifting Left Webinar Slideshow
Shifting Left Webinar SlideshowShifting Left Webinar Slideshow
Shifting Left Webinar Slideshow
 
Shifting Left Webinar Slides
Shifting Left Webinar SlidesShifting Left Webinar Slides
Shifting Left Webinar Slides
 
Internationalization Conference, Webinars, Events, Book Discount and More!
Internationalization Conference, Webinars, Events, Book Discount and More!Internationalization Conference, Webinars, Events, Book Discount and More!
Internationalization Conference, Webinars, Events, Book Discount and More!
 
Keyboards and Internationalization
Keyboards and InternationalizationKeyboards and Internationalization
Keyboards and Internationalization
 
Internationalization & Localization Process
Internationalization & Localization ProcessInternationalization & Localization Process
Internationalization & Localization Process
 
Leading Globalized Software Effort: An Expert Discussion
Leading Globalized Software Effort: An Expert DiscussionLeading Globalized Software Effort: An Expert Discussion
Leading Globalized Software Effort: An Expert Discussion
 
Static analysis for multiple programming languages
Static analysis for multiple programming languagesStatic analysis for multiple programming languages
Static analysis for multiple programming languages
 
Wordware 2011: Lingoport i18n Planning & Static Analysis
Wordware 2011: Lingoport i18n Planning & Static AnalysisWordware 2011: Lingoport i18n Planning & Static Analysis
Wordware 2011: Lingoport i18n Planning & Static Analysis
 
Lingoport internationalization-i18n-and-localization-l10n-e newsletter-septem...
Lingoport internationalization-i18n-and-localization-l10n-e newsletter-septem...Lingoport internationalization-i18n-and-localization-l10n-e newsletter-septem...
Lingoport internationalization-i18n-and-localization-l10n-e newsletter-septem...
 
JavaScript Internationalization I18n for Efficient Software Localization
JavaScript Internationalization I18n for Efficient Software LocalizationJavaScript Internationalization I18n for Efficient Software Localization
JavaScript Internationalization I18n for Efficient Software Localization
 
Internationalization (i18n) Primer: Solving Coding Issues Equals Competitive ...
Internationalization (i18n) Primer: Solving Coding Issues Equals Competitive ...Internationalization (i18n) Primer: Solving Coding Issues Equals Competitive ...
Internationalization (i18n) Primer: Solving Coding Issues Equals Competitive ...
 
Enhancing Internationalization Productivity: I18n Tools Support Software Loca...
Enhancing Internationalization Productivity: I18n Tools Support Software Loca...Enhancing Internationalization Productivity: I18n Tools Support Software Loca...
Enhancing Internationalization Productivity: I18n Tools Support Software Loca...
 
Outsourcing Internationalization (i18n) Services
Outsourcing Internationalization (i18n) ServicesOutsourcing Internationalization (i18n) Services
Outsourcing Internationalization (i18n) Services
 
Business Perspectives on Internationalization (i18n)
Business Perspectives on Internationalization (i18n)Business Perspectives on Internationalization (i18n)
Business Perspectives on Internationalization (i18n)
 
Internationalization (i18n) and Localization (l10n) - Partners in Successful ...
Internationalization (i18n) and Localization (l10n) - Partners in Successful ...Internationalization (i18n) and Localization (l10n) - Partners in Successful ...
Internationalization (i18n) and Localization (l10n) - Partners in Successful ...
 

Recently uploaded

GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 

Recently uploaded (20)

GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 

Unicode Primer for the Uninitiated

  • 1. This article is a part of Lingoport.com; the original article can be found at http://www.lingoport.com/software-internationalization-articles/unicode-primer-for-the-uninitiated/ Unicode Primer for the Uninitiated Among our friends and clients at Lingoport, we regularly see ranges of confusion, to complete lack of awareness of what Unicode is. So for the less- or under-informed, perhaps this article will help. The advent of Unicode is a key underpinning for global software applications and websites so that they can support worldwide language scripts. So it’s a very important standard to be aware of, whether you’re in localization, an engineer or a business manager. Firstly, Unicode is a character set standard used for displaying and processing language data in computer applications. The Unicode character set is the entire world’s set of characters, including letters, numbers, currencies, symbols and the like, supporting a number of character encodings to make that all happen. Before your eyes glaze over, let me explain what character encoding means. You have to remember that for a computer, all information is represented in zeros and ones (i.e. binary values). So if you think of the letter A in the ASCII standard of zeros and ones it would look like this: 1000001. That is, a 1 then five zeros and a 1 to make a total of 7 bits. This binary representation for A is called A’s code point, and this mapping of zeros and ones to characters is called the character encoding. In the early days of computing, unless you did something very special, ASCII (7 bits per character) was how your data got managed. The problem is that ASCII doesn’t leave you enough zeros and ones to represent extended characters, like accents and characters specific to non-English alphabets, such as you find in European languages. You certainly can’t support the complex characters that make up Chinese, Korean and Japanese languages. These languages require 8-bit (single-byte) or 16-bit (double-byte) character encodings. One important note on all of these single- and double-byte encodings is that they are a superset of 7-bit ASCII encoding, which means that English code points will always be the same regardless the encoding.
  • 2. The Bad Old Days In the early computing days, specific character single- and double-byte encodings were developed to support various languages. That was very bad, as it meant that software developers needed to build a version of their application for every language they wanted to support that used a different encoding. You’d have the Japanese version, the Western European language version, the English-only version and so on. You’d end up with a hoard of individual An Introduction to Unicode and Character Encoding software code bases, each needing their own testing, updating and ongoing maintenance and support, which is very expensive, and pretty near impossible for businesses to realistically support without serious digressions among the various language versions over time. You don’t see this problem very often for newly developed applications, but there are plenty of holdovers. We see it typically when a new client has turned over their source code to a particular country partner or marketing agent which was responsible for adapting the code to multiple languages. The worst case I saw was in 2004 when a particular client, who I will leave unmentioned, had a legacy product with 18 separate language versions and had no real idea any longer the level of functionality that varied from language to language. That’s no way to grow a corporate empire! ISO Latin A single-byte character set that we often see in applications is ISO Latin 1, which is represented in various encoding standards such as ISO-8859-1 for UNIX, Windows-1252 for Windows and MacRoman on guess what platform. This character set supports characters used in Western European languages such as French, Spanish, German, and U.K. English. Since each character requires only a single byte, this character set provides support for multiple languages, while avoiding the work required to support either Unicode or a double-byte Unicode: The Movie encoding. Trouble is that still leaves out much of the world. For example, to support Eastern European languages you need to use a different character set, often referred to as Latin 2, which provides the characters that are uniquely needed for these languages. There are also separate character sets for Baltic languages, Turkish, Arabic, Hebrew, and on and on. When having to internationalize software for the first time, sometimes companies will start with just supporting ISO Latin 1 if it meets their immediate marketing requirements and deal with the more extensive work of supporting other languages later. The reason is that it’s likely these software applications will need major reworking of the encoding support in their database and functions, methods and classes within their source code to go beyond ISO Latin support, which means more time and more money – often cascading into later releases and foregone revenues. However, if the software company has truly global ambitions, they will need to take that plunge and provide Unicode support. I’ll argue that if companies are
  • 3. supporting global customers, and even not doing a bit of translation/localization for the interface, they still need to support Unicode so they can provide processing of their customer’s global data. Unicode We come back to Unicode, which as we mentioned above, is a character set created to enable support of any written language worldwide. Now you might find a language or two lacking Unicode support for its script but that is becoming extremely isolated. For instance, currently Javanese, Loma, and Tai Viet are among scripts not yet supported. Arcane until you need them I suppose. I remember a few years ago when we were developing a multi-lingual site which needed support for Khmer and Armenian, and we were thankful that Unicode had just added their support a few months prior. If you have a marketing requirement for your software to support Japanese or Chinese, think Unicode. That’s because you will need to move to a double-byte encoding at the very least, and as soon as you go through the trouble to do that, you might as well support Unicode and get the added benefit of support for all languages. UTF-8 Once you’ve chosen to support Unicode, you must decide on the specific character encoding you want to use, which will be dependent on the application requirements and technologies. UTF-8 is one of the commonly used character encodings defined within the Unicode Standard, which uses a single byte for each character unless it needs more, in which case it can expand up to 4 bytes. People sometimes refer to this as a variable-width encoding since the width of the character in bytes varies depending upon the character. The advantage of this character encoding is that all English (ASCII) characters will remain as single-bytes, saving data space. This is especially desirable for web content, since the underlying HTML markup will remain in single-byte ASCII. In general, UNIX platforms are optimized for UTF-8 character encoding. Concerning databases, where large amounts of application data are integral to the application, a developer may choose a UTF-8 encoding to save space if most of the data in the database does not need translation and so can remain in English (which requires only a single byte in UTF-8 encoding). Note that some databases will not support UTF-8, specifically Microsoft’s SQL Server. UTF-16 UTF-16 is another widely adopted encoding within the Unicode standard. It assigns two bytes for each character whether you need it or not. So the letter A is 00000000 01000001 or 9 zeros, a one, followed by 5 zeros and a one. If more than 2 bytes are needed for a character, four bytes can be combined; however you must adapt your software to be capable of handling this four-byte combination. Java and .Net internally process strings (text and messages) as UTF-16. For many applications, you can actually support multiple Unicode encodings so that for example your data is stored in your database as UTF-8 but is handled within your code as UTF-16, or vice versa. There are various reasons to do this, such as software limitations (different software components supporting different Unicode encodings), storage or performance advantages, etc.. But whether that’s a good idea is one of those “it depends” kinds of questions. Implementing can be tricky and clients pay us good money to solve this.
  • 4. Microsoft’s SQL Server is a bit of a special case, in that it supports UCS-2, which is like UTF-16 but without the 4-byte characters (only the 16-bit characters are supported). GB 18030 There’s also a special-case character set when it comes to engineering for software intended for sale in China (PRC), which is required by the Chinese Government. This character set is GB 18030, and it is actually a superset of Unicode, supporting both simplified and traditional Chinese. Similarly to UTF-16, GB 18030 character encoding allows 4 bytes per character to support characters beyond Unicode’s “basic” (16-bit) range, and in practice supporting UTF-16 (or UTF-8) is considered an acceptable approach to supporting GB 18030 (the UCS-2 encoding just mentioned is not, however). Now all of this considered, a converse question might be, what happens when you try to make your application support complex scripts that need Unicode, and the support isn’t there? Depending upon your system, you get anything from garbled and meaningless gibberish where data or messages become corrupted characters or weird square boxes, or the application crashes forcing a restart. Not good. If your application supports Unicode, you are ready to take on the world. About Lingoport Founded in 2001, Lingoport provides extensive software localization and internationalization consulting services. Lingoport’s Globalyzer software, a market leading software internationalization tool, helps entire enterprises and development teams to effectively internationalize existing and newly developed source code and to prepare their applications for localization. An Introduction to Lingoport’s Globalyzer: