This document provides an introduction to internationalization best practices. It discusses internationalization vocabulary, common problems encountered like encoding issues, and internationalization support in Java and PHP. Key topics covered include Unicode, locales, formatting numbers and dates, and using resource bundles to externalize strings.
3. Training Outline Internationalisation Vocabulary Typical Problems Outline the common problems found across the web Java and Internationalisation The level of Internationalisation support is available in Java Resource Bundles Formatting messages the correct way PHP and Internationalisation The level of Internationalisation support is available in PHP
5. Unicode International standard for representing written language in computers Latest version 5.2 adds 6648 new characters including support for Vedic Sanskrit Maintained in sync with ISO 10646 Three main encodings: UTF-8, UTF-16 and UTF-32 Address space of 21 bits
6. Unicode (contd.) UTF-8 is a multi-byte encoding and is eight bytes long An encoded character can take one, two, three or four bytes UTF-8 is backward compatible with US-ASCII Default encoding for PHP6?
7. Unicode (contd.) UTF-16 uses 16-bit code units Cannot address the complete set, so uses surrogates Default encoding for strings in Java and JavaScript
8. Unicode (contd.) UTF-32 uses 32-bit code units Every Unicode character is addressed within a single code unit
9. Internationalisation Design and development of a product, application or document content that enables easy localization for target audiences that vary in culture, region, or language Abbreviated as I18n as there are eighteen characters between “I” and “n”
10. Localisation Adaptation of a product, application or document content to meet the language, cultural and other requirements of a specific target market (a “locale”) Translation is one aspect of localisation Abbreviated as L10n as there are ten characters between “L” and “n”
17. The Solution Determine the user environment Format dates, times, currencies as per the locale Understand the Internationalisation support available with your implementation language Use the ICU/Internationalisation libraries rather than rolling out your own functions
19. Tofu characters – Black hollow boxes Shown as a black hollow box, typically one per character Indicates font problem i.e., the system doesn’t have the right fonts to display the glyph(s) Tofu isn’t always a software problem – not a bug but really annoying
21. Question Marks – Incorrect conversion “???” usually displayed when converting text from one encoding to another Means there is no equivalent character in the target encoding for the corresponding source May not be a bug always, though sometimes occurs when an incorrect encoding is specified
23. Mojibake –文字化け Pronounced as “Moh-jee-baa-kay” is a Japanese word meaning “garbled characters” Occurs when text in one encoding is “interpreted” as some other encoding Most of the times caused by interpreting Latin-1 as UTF-8 UTF-8 is compatible only with US-ASCII Characters outside the ASCII range are incompatible with UTF-8 and cause Mojibake
26. Unicode support in Java™ Java™ has always supported Unicode Java™ strings are UTF-16 A “char” in Java™ is a UTF-16 code unit, not a code point By default the input and output streams use the OS native charset On Windows™ this is Windows-1252 On most Unices and Unix-like OS this is UTF-8
39. That was broken! Its generally a bad idea to concatenate strings Does not work for all languages since the grammar is different! Always use string substitution using positional parameters
41. Correct Way (contd.) SumI18n.properties SUM = Sum of {0} and {1} is {2} SumI18n_hi.properties SUM = {0} अतिरिक्त {1} {2} के बराबर है SumI18n_ta.properties SUM = {0} மற்றும் {1} கூட்டினால் {2}
42. Oops! Java 1.5 property files are read as ISO-8859-1 (Latin-1) Use “native2ascii” tool to convert Unicode files to escape sequences (U+??) native2ascii –encoding UTF-8 SumI18n_hi.properties native2ascii –encoding UTF-8 SumI18n_ta.properties
45. Challenges PHP 5 (and earlier) does not understand characters and encodings The multi-byte extension (mbstring) in PHP works only for a few encodings (primarily CJK) PHP has very limited functions for formatting date, time, currencies, etc. PHP doesn’t provide linguistic sorting!
46. The Good News – Intl extension Open source – http://pecl.php.net/intl Designed for PHP 5.x, part of PHP 5.3 Configure using “—enable-intl” Leverages ICU and CLDR Available as OO and procedural APIs Collator::sort() vs. collator_sort() Yahoo! is a key contributor
50. Resource Bundles Externalize strings in your application Similar to how desktop applications are built One binary and additional language packs Similar to Windows™ resource files and Unix® message files Structure is different, see ICU resource bundles Key/value pairs Key is used by the application at run time to display the value
51. Additional Things Change the “default_charset” in php.ini to “utf-8” While the “mbstring” works good enough for Indic languages, use the more precise “grapheme_*” functions from the Intl library “echo” is encoding agnostic
Typically “enabling” might involve designing and developing a product that does not have any country/region specific business logic. Additionally it should externalise all country/region specific logic so that they can be customised for a country/region. For example displaying the date as “dd/mm/yyyy” by default is bad, instead it should be displayed as per the user’s locale.
Localisation involves not only translation, but additional customisation including numbers, dates, times, currency, sorting, icons, colours, etc.
First and Last names entirely depend upon region and culture. Instead, “Given name” and “Surname” should be used.
Do not validate names and e-mails on the client-side as JavaScript does a bad job when it comes to I18n.
Most of the Indic fonts are cursive and needs a minimum font size that is different from the minimum size used for English to be clearly, legibly visible.
PHP doesn’t understand Unicode by default!
Every character is a 16-bit code unit, each of them make a character. This is not true for all languages though like Japanese, but fortunately characters in all Indian languages is contained within a 16-bit code unit.Without the “virama”, the program would print “namasakara” which is incorrect. The “virama” is needed for the rendering engine to display either the half consonant, or add the consonant at the appropriate position. Note that Unicode doesn’t care about how glyphs are rendered, it is the job of the software to do this.
The question marks (as explained before) denotes incorrect character conversion.The default code page for Windows™ command prompt is the original IBM PC code page (437.) The “chcp” program can be used to display/switch the code page. Windows™ also defines several other code pages, of which the popular ones are 1252 (Western European) and 65001 (UTF-8.)
The Emacs shell (eshell) is a wonderful terminal emulation program that runs within the Emacs editing environment. It is very useful because it supports Unicode.In the second case, we force Java™ to assume that the encoding is UTF-8 and hence it (outputs the correct bytes) resulting in correct rendering of the Devanagari “Namaskar”. Even though it works this is a non-portable and bad way of doing things.
On GNU/Linux, Java™ typically uses UTF-8 as the default charset (if the locale is set as UTF-8.)
If the default charset is overridden, basically providing an incorrect one the results vary from incorrect conversion (???) to Mojibake (garbled characters) depending upon the output charset.
The “tofu” characters mean that the font isn’t available. Yes, Windows™ doesn’t have a console font to display Devanagari.
Collator provides locale-dependent collation and sortingFormatter modules provide locale-dependent formatting of numbers, dates, currencies, messages, etc.Normalizer provides methods for normalising and checking text in normalised formLocale provides access to locale-dependent resourcesGrapheme provides linguistically correct way of parsing strings, breaking a string into tokens, etc.IDN provides Internationalized Domain Name supportResourceBundle provides methods for customising messages depending on the locale
The “Rs.” isn’t hard-coded in our program which makes it easy when Unicode starts supporting the new Rupee symbol. There is no need to change code, the program will start displaying the new symbol whenever the new symbol is supported. The actual work involves installing the new Intl library that is compiled against the newer version of ICU libraries and installing the new fonts that has the glyph for the corresponding code point.