W3C I18n Best Practices and Localisation in Java & PHP

Introduction to W3C I18n Best Practices Presented by Gopal Venkatesan <g13n@ymail.com>

नमस्कार নমস্কার ನಮಸ್ಕಾರ ନମସ୍କର୍ வணக்கம் ਸਤਿ ਸ੍ਰੀ ਅਕਾਲ నమస్కారం നമസ്കാരം السلام علیکم નમસ્કાર

Training Outline Internationalisation Vocabulary Typical Problems Outline the common problems found across the web Java and Internationalisation The level of Internationalisation support is available in Java Resource Bundles Formatting messages the correct way PHP and Internationalisation The level of Internationalisation support is available in PHP

Unicode International standard for representing written language in computers Latest version 5.2 adds 6648 new characters including support for Vedic Sanskrit Maintained in sync with ISO 10646 Three main encodings: UTF-8, UTF-16 and UTF-32 Address space of 21 bits

Unicode (contd.) UTF-8 is a multi-byte encoding and is eight bytes long An encoded character can take one, two, three or four bytes UTF-8 is backward compatible with US-ASCII Default encoding for PHP6?

Unicode (contd.) UTF-16 uses 16-bit code units Cannot address the complete set, so uses surrogates Default encoding for strings in Java and JavaScript

Unicode (contd.) UTF-32 uses 32-bit code units Every Unicode character is addressed within a single code unit

Internationalisation Design and development of a product, application or document content that enables easy localization for target audiences that vary in culture, region, or language Abbreviated as I18n as there are eighteen characters between “I” and “n”

Localisation Adaptation of a product, application or document content to meet the language, cultural and other requirements of a specific target market (a “locale”) Translation is one aspect of localisation Abbreviated as L10n as there are ten characters between “L” and “n”

The Solution Determine the user environment Format dates, times, currencies as per the locale Understand the Internationalisation support available with your implementation language Use the ICU/Internationalisation libraries rather than rolling out your own functions

Tofu characters – Black hollow boxes Shown as a black hollow box, typically one per character Indicates font problem i.e., the system doesn’t have the right fonts to display the glyph(s) Tofu isn’t always a software problem – not a bug but really annoying

Tofu characters – Black hollow boxes

Question Marks – Incorrect conversion “???” usually displayed when converting text from one encoding to another Means there is no equivalent character in the target encoding for the corresponding source May not be a bug always, though sometimes occurs when an incorrect encoding is specified

Question Marks – Incorrect conversion

Mojibake –文字化け Pronounced as “Moh-jee-baa-kay” is a Japanese word meaning “garbled characters” Occurs when text in one encoding is “interpreted” as some other encoding Most of the times caused by interpreting Latin-1 as UTF-8 UTF-8 is compatible only with US-ASCII Characters outside the ASCII range are incompatible with UTF-8 and cause Mojibake

Unicode support in Java™ Java™ has always supported Unicode Java™ strings are UTF-16 A “char” in Java™ is a UTF-16 code unit, not a code point By default the input and output streams use the OS native charset On Windows™ this is Windows-1252 On most Unices and Unix-like OS this is UTF-8

A “Hello, world” example (contd.)

“Hello, world” on GNU/Linux

Externalising Strings Resource Bundles

The Need Allows a single code base to display strings in multiple languages No need to refactor code to support new languages

Beginning (Sum.properties) SUM_OF = Sum of AND = and IS = is

That was broken! Its generally a bad idea to concatenate strings Does not work for all languages since the grammar is different! Always use string substitution using positional parameters

Correct Way (contd.) SumI18n.properties SUM = Sum of {0} and {1} is {2} SumI18n_hi.properties SUM = {0} अतिरिक्त {1} {2} के बराबर है SumI18n_ta.properties SUM = {0} மற்றும் {1} கூட்டினால் {2}

Oops! Java 1.5 property files are read as ISO-8859-1 (Latin-1) Use “native2ascii” tool to convert Unicode files to escape sequences (U+??) native2ascii –encoding UTF-8 SumI18n_hi.properties native2ascii –encoding UTF-8 SumI18n_ta.properties

Challenges PHP 5 (and earlier) does not understand characters and encodings The multi-byte extension (mbstring) in PHP works only for a few encodings (primarily CJK) PHP has very limited functions for formatting date, time, currencies, etc. PHP doesn’t provide linguistic sorting!

The Good News – Intl extension Open source – http://pecl.php.net/intl Designed for PHP 5.x, part of PHP 5.3 Configure using “—enable-intl” Leverages ICU and CLDR Available as OO and procedural APIs Collator::sort() vs. collator_sort() Yahoo! is a key contributor

The PHP Intl Library Intl Collator IDN NumberFormatter Grapheme Locale ResourceBundle Normalizer IntlDateFormatter MessageFormatter

Corrected substring implementation

Resource Bundles Externalize strings in your application Similar to how desktop applications are built One binary and additional language packs Similar to Windows™ resource files and Unix® message files Structure is different, see ICU resource bundles Key/value pairs Key is used by the application at run time to display the value

Additional Things Change the “default_charset” in php.ini to “utf-8” While the “mbstring” works good enough for Indic languages, use the more precise “grapheme_*” functions from the Intl library “echo” is encoding agnostic

Why Intl is better than mbstring?

Why Intl is better than mbstring? (contd.)

Resources http://www.w3.org/International/ http://unicode.org/ http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp http://pecl.php.net/intl http://php.net/manual/en/refs.international.php

W3C I18n Best Practices and Localisation in Java & PHP

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to W3C I18n Best Practices and Localisation in Java & PHP

Similar to W3C I18n Best Practices and Localisation in Java & PHP (20)

Recently uploaded

Recently uploaded (20)

W3C I18n Best Practices and Localisation in Java & PHP

Editor's Notes