This article is a part of Lingoport.com; the original article can be found athttp://www.lingoport.com/software-internationalization-articles/unicode-primer-for-the-uninitiated/Unicode Primer for the UninitiatedAmong our friends and clients at Lingoport, we regularly see ranges of confusion, to complete lack ofawareness of what Unicode is. So for the less- or under-informed, perhaps this article will help. Theadvent of Unicode is a key underpinning for global software applications and websites so that they cansupport worldwide language scripts. So it’s a very important standard to be aware of, whether you’re inlocalization, an engineer or a business manager.Firstly, Unicode is a character set standard used fordisplaying and processing language data in computerapplications. The Unicode character set is the entireworld’s set of characters, including letters, numbers,currencies, symbols and the like, supporting a numberof character encodings to make that all happen. Beforeyour eyes glaze over, let me explain what characterencoding means. You have to remember that for acomputer, all information is represented in zeros andones (i.e. binary values). So if you think of the letter Ain the ASCII standard of zeros and ones it would looklike this: 1000001. That is, a 1 then five zeros and a 1to make a total of 7 bits. This binary representation forA is called A’s code point, and this mapping of zerosand ones to characters is called the characterencoding. In the early days of computing, unless youdid something very special, ASCII (7 bits per character) was how your data got managed. The problem isthat ASCII doesn’t leave you enough zeros and ones to represent extended characters, like accents andcharacters specific to non-English alphabets, such as you find in European languages. You certainly can’tsupport the complex characters that make up Chinese, Korean and Japanese languages. These languagesrequire 8-bit (single-byte) or 16-bit (double-byte) character encodings. One important note on all of thesesingle- and double-byte encodings is that they are a superset of 7-bit ASCII encoding, which means thatEnglish code points will always be the same regardless the encoding.
The Bad Old Days In the early computing days, specific character single- and double-byte encodings were developed to support various languages. That was very bad, as it meant that software developers needed to build a version of their application for every language they wanted to support that used a different encoding. You’d have the Japanese version, the Western European language version, the English-only version and so on. You’d end up with a hoard of individualAn Introduction to Unicode and Character Encoding software code bases, each needing their own testing,updating and ongoing maintenance and support, which is very expensive, and pretty near impossible forbusinesses to realistically support without serious digressions among the various language versions overtime. You don’t see this problem very often for newly developed applications, but there are plenty ofholdovers. We see it typically when a new client has turned over their source code to a particular countrypartner or marketing agent which was responsible for adapting the code to multiple languages. The worstcase I saw was in 2004 when a particular client, who I will leave unmentioned, had a legacy product with18 separate language versions and had no real idea any longer the level of functionality that varied fromlanguage to language. That’s no way to grow a corporate empire!ISO LatinA single-byte character set that we often see in applications isISO Latin 1, which is represented in various encodingstandards such as ISO-8859-1 for UNIX, Windows-1252 forWindows and MacRoman on guess what platform. Thischaracter set supports characters used in Western Europeanlanguages such as French, Spanish, German, and U.K. English.Since each character requires only a single byte, this characterset provides support for multiple languages, while avoiding thework required to support either Unicode or a double-byte Unicode: The Movieencoding. Trouble is that still leaves out much of the world.For example, to support Eastern European languages you need to use a different character set, oftenreferred to as Latin 2, which provides the characters that are uniquely needed for these languages. Thereare also separate character sets for Baltic languages, Turkish, Arabic, Hebrew, and on and on. Whenhaving to internationalize software for the first time, sometimes companies will start with just supportingISO Latin 1 if it meets their immediate marketing requirements and deal with the more extensive work ofsupporting other languages later. The reason is that it’s likely these software applications will need majorreworking of the encoding support in their database and functions, methods and classes within theirsource code to go beyond ISO Latin support, which means more time and more money – often cascadinginto later releases and foregone revenues. However, if the software company has truly global ambitions,they will need to take that plunge and provide Unicode support. I’ll argue that if companies are
supporting global customers, and even not doing a bit of translation/localization for the interface, theystill need to support Unicode so they can provide processing of their customer’s global data.UnicodeWe come back to Unicode, which as we mentioned above, is a character set created to enable support ofany written language worldwide. Now you might find a language or two lacking Unicode support for itsscript but that is becoming extremely isolated. For instance, currently Javanese, Loma, and Tai Viet areamong scripts not yet supported. Arcane until you need them I suppose. I remember a few years agowhen we were developing a multi-lingual site which needed support for Khmer and Armenian, and wewere thankful that Unicode had just added their support a few months prior. If you have a marketingrequirement for your software to support Japanese or Chinese, think Unicode. That’s because you willneed to move to a double-byte encoding at the very least, and as soon as you go through the trouble todo that, you might as well support Unicode and get the added benefit of support for all languages.UTF-8Once you’ve chosen to support Unicode, you must decide on the specific character encoding you want touse, which will be dependent on the application requirements and technologies. UTF-8 is one of thecommonly used character encodings defined within the Unicode Standard, which uses a single byte foreach character unless it needs more, in which case it can expand up to 4 bytes. People sometimes referto this as a variable-width encoding since the width of the character in bytes varies depending upon thecharacter. The advantage of this character encoding is that all English (ASCII) characters will remain assingle-bytes, saving data space. This is especially desirable for web content, since the underlying HTMLmarkup will remain in single-byte ASCII. In general, UNIX platforms are optimized for UTF-8 characterencoding. Concerning databases, where large amounts of application data are integral to the application,a developer may choose a UTF-8 encoding to save space if most of the data in the database does notneed translation and so can remain in English (which requires only a single byte in UTF-8 encoding). Notethat some databases will not support UTF-8, specifically Microsoft’s SQL Server.UTF-16UTF-16 is another widely adopted encoding within the Unicode standard. It assigns two bytes for eachcharacter whether you need it or not. So the letter A is 00000000 01000001 or 9 zeros, a one, followedby 5 zeros and a one. If more than 2 bytes are needed for a character, four bytes can be combined;however you must adapt your software to be capable of handling this four-byte combination. Java and.Net internally process strings (text and messages) as UTF-16.For many applications, you can actually support multiple Unicode encodings so that for example yourdata is stored in your database as UTF-8 but is handled within your code as UTF-16, or vice versa. Thereare various reasons to do this, such as software limitations (different software components supportingdifferent Unicode encodings), storage or performance advantages, etc.. But whether that’s a good idea isone of those “it depends” kinds of questions. Implementing can be tricky and clients pay us good moneyto solve this.
Microsoft’s SQL Server is a bit of a special case, in that it supports UCS-2, which is like UTF-16 butwithout the 4-byte characters (only the 16-bit characters are supported).GB 18030There’s also a special-case character set when it comes to engineering for software intended for sale inChina (PRC), which is required by the Chinese Government. This character set is GB 18030, and it isactually a superset of Unicode, supporting both simplified and traditional Chinese. Similarly to UTF-16, GB18030 character encoding allows 4 bytes per character to support characters beyond Unicode’s “basic”(16-bit) range, and in practice supporting UTF-16 (or UTF-8) is considered an acceptable approach tosupporting GB 18030 (the UCS-2 encoding just mentioned is not, however).Now all of this considered, a converse question might be, what happens when you try to make yourapplication support complex scripts that need Unicode, and the support isn’t there? Depending upon yoursystem, you get anything from garbled and meaningless gibberish where data or messages becomecorrupted characters or weird square boxes, or the application crashes forcing a restart. Not good.If your application supports Unicode, you are ready to take on the world.About LingoportFounded in 2001, Lingoport provides extensive software localization and internationalization consultingservices. Lingoport’s Globalyzer software, a market leading software internationalization tool, helps entireenterprises and development teams to effectively internationalize existing and newly developed sourcecode and to prepare their applications for localization. An Introduction to Lingoport’s Globalyzer: