Introduction to W3C I18N Best Practices

11,947 views

Published on

A tutorial on Internationalisation, typical issues found across the web and how to go about solving it.

Published in: Technology
2 Comments
4 Likes
Statistics
Notes
No Downloads
Views
Total views
11,947
On SlideShare
0
From Embeds
0
Number of Embeds
72
Actions
Shares
0
Downloads
74
Comments
2
Likes
4
Embeds 0
No embeds

No notes for slide
  • Typically “enabling” might involve designing and developing a product that does not have any country/region specific business logic. Additionally it should externalise all country/region specific logic so that they can be customised for a country/region. For example displaying the date as “dd/mm/yyyy” by default is bad, instead it should be displayed as per the user’s locale.
  • Localisation involves not only translation, but additional customisation including numbers, dates, times, currency, sorting, icons, colours, etc.
  • First and Last names entirely depend upon region and culture. Instead, “Given name” and “Surname” should be used.
  • Do not validate names and e-mails on the client-side as JavaScript does a bad job when it comes to I18n.
  • Most of the Indic fonts are cursive and needs a minimum font size that is different from the minimum size used for English to be clearly, legibly visible.
  • PHP doesn’t understand Unicode by default!
  • Every character is a 16-bit code unit, each of them make a character. This is not true for all languages though like Japanese, but fortunately characters in all Indian languages is contained within a 16-bit code unit.Without the “virama”, the program would print “namasakara” which is incorrect. The “virama” is needed for the rendering engine to display either the half consonant, or add the consonant at the appropriate position. Note that Unicode doesn’t care about how glyphs are rendered, it is the job of the software to do this.
  • The question marks (as explained before) denotes incorrect character conversion.The default code page for Windows™ command prompt is the original IBM PC code page (437.) The “chcp” program can be used to display/switch the code page. Windows™ also defines several other code pages, of which the popular ones are 1252 (Western European) and 65001 (UTF-8.)
  • The Emacs shell (eshell) is a wonderful terminal emulation program that runs within the Emacs editing environment. It is very useful because it supports Unicode.In the second case, we force Java™ to assume that the encoding is UTF-8 and hence it (outputs the correct bytes) resulting in correct rendering of the Devanagari “Namaskar”. Even though it works this is a non-portable and bad way of doing things.
  • On GNU/Linux, Java™ typically uses UTF-8 as the default charset (if the locale is set as UTF-8.)
  • If the default charset is overridden, basically providing an incorrect one the results vary from incorrect conversion (???) to Mojibake (garbled characters) depending upon the output charset.
  • The “tofu” characters mean that the font isn’t available. Yes, Windows™ doesn’t have a console font to display Devanagari.
  • Collator provides locale-dependent collation and sortingFormatter modules provide locale-dependent formatting of numbers, dates, currencies, messages, etc.Normalizer provides methods for normalising and checking text in normalised formLocale provides access to locale-dependent resourcesGrapheme provides linguistically correct way of parsing strings, breaking a string into tokens, etc.IDN provides Internationalized Domain Name supportResourceBundle provides methods for customising messages depending on the locale
  • The “Rs.” isn’t hard-coded in our program which makes it easy when Unicode starts supporting the new Rupee symbol. There is no need to change code, the program will start displaying the new symbol whenever the new symbol is supported. The actual work involves installing the new Intl library that is compiled against the newer version of ICU libraries and installing the new fonts that has the glyph for the corresponding code point.
  • Introduction to W3C I18N Best Practices

    1. 1. Introduction to W3C I18n Best Practices<br />Presented by Gopal Venkatesan<br /><g13n@ymail.com><br />
    2. 2. नमस्कार<br />নমস্কার<br />ನಮಸ್ಕಾರ<br />ନମସ୍କର୍<br />வணக்கம்<br />ਸਤਿ ਸ੍ਰੀ ਅਕਾਲ<br />నమస్కారం<br />നമസ്കാരം<br />السلام علیکم<br />નમસ્કાર<br />
    3. 3. Training Outline<br />Internationalisation Vocabulary<br />Typical Problems<br />Outline the common problems found across the web<br />Java and Internationalisation<br />The level of Internationalisation support is available in Java<br />Resource Bundles<br />Formatting messages the correct way<br />PHP and Internationalisation<br />The level of Internationalisation support is available in PHP<br />
    4. 4. Vocabulary<br />
    5. 5. Unicode<br />International standard for representing written language in computers<br />Latest version 5.2 adds 6648 new characters including support for Vedic Sanskrit<br />Maintained in sync with ISO 10646<br />Three main encodings: UTF-8, UTF-16 and UTF-32<br />Address space of 21 bits<br />
    6. 6. Unicode (contd.)<br />UTF-8 is a multi-byte encoding and is eight bytes long<br />An encoded character can take one, two, three or four bytes<br />UTF-8 is backward compatible with US-ASCII<br />Default encoding for PHP6?<br />
    7. 7. Unicode (contd.)<br />UTF-16 uses 16-bit code units<br />Cannot address the complete set, so uses surrogates<br />Default encoding for strings in Java and JavaScript<br />
    8. 8. Unicode (contd.)<br />UTF-32 uses 32-bit code units<br />Every Unicode character is addressed within a single code unit<br />
    9. 9. Internationalisation<br />Design and development of a product, application or document content that enables easy localization for target audiences that vary in culture, region, or language<br />Abbreviated as I18n as there are eighteen characters between “I” and “n”<br />
    10. 10. Localisation<br />Adaptation of a product, application or document content to meet the language, cultural and other requirements of a specific target market (a “locale”)<br />Translation is one aspect of localisation<br />Abbreviated as L10n as there are ten characters between “L” and “n”<br />
    11. 11. Typical Problems<br />
    12. 12. Typical Problem<br />
    13. 13. Typical Problem (Contd.)<br />
    14. 14. Typical Problem (Contd.)<br />
    15. 15. Typical Problem (Contd.)<br />
    16. 16. Typical Problem (Contd.)<br />
    17. 17. The Solution<br />Determine the user environment<br />Format dates, times, currencies as per the locale<br />Understand the Internationalisation support available with your implementation language<br />Use the ICU/Internationalisation libraries rather than rolling out your own functions<br />
    18. 18. Common Encoding Problems<br />
    19. 19. Tofu characters – Black hollow boxes<br />Shown as a black hollow box, typically one per character<br />Indicates font problem i.e., the system doesn’t have the right fonts to display the glyph(s)<br />Tofu isn’t always a software problem – not a bug but really annoying<br />
    20. 20. Tofu characters – Black hollow boxes<br />
    21. 21. Question Marks – Incorrect conversion<br />“???” usually displayed when converting text from one encoding to another<br />Means there is no equivalent character in the target encoding for the corresponding source<br />May not be a bug always, though sometimes occurs when an incorrect encoding is specified<br />
    22. 22. Question Marks – Incorrect conversion<br />
    23. 23. Mojibake –文字化け <br />Pronounced as “Moh-jee-baa-kay” is a Japanese word meaning “garbled characters”<br />Occurs when text in one encoding is “interpreted” as some other encoding<br />Most of the times caused by interpreting Latin-1 as UTF-8<br />UTF-8 is compatible only with US-ASCII<br />Characters outside the ASCII range are incompatible with UTF-8 and cause Mojibake<br />
    24. 24. Mojibake – 文字化け <br />
    25. 25. Java™ And Unicode<br />
    26. 26. Unicode support in Java™<br />Java™ has always supported Unicode<br />Java™ strings are UTF-16<br />A “char” in Java™ is a UTF-16 code unit, not a code point<br />By default the input and output streams use the OS native charset<br />On Windows™ this is Windows-1252<br />On most Unices and Unix-like OS this is UTF-8<br />
    27. 27. A “Hello, world” example<br />
    28. 28. A “Hello, world” example (contd.)<br />
    29. 29. A “Hello, world” example (contd.)<br />
    30. 30. “Hello, world” on GNU/Linux<br />
    31. 31. Garbage In, Garbage Out!<br />
    32. 32. “Hello, world” Corrected!<br />
    33. 33. Oops!<br />
    34. 34. “Hello, world” Corrected!<br />
    35. 35. Externalising Strings<br />Resource Bundles<br />
    36. 36. The Need<br />Allows a single code base to display strings in multiple languages<br />No need to refactor code to support new languages<br />
    37. 37. Beginning<br />
    38. 38. Beginning (Sum.properties)<br />SUM_OF = Sum of<br />AND = and<br />IS = is<br />
    39. 39. That was broken!<br />Its generally a bad idea to concatenate strings<br />Does not work for all languages since the grammar is different!<br />Always use string substitution using positional parameters<br />
    40. 40. Correct Way<br />
    41. 41. Correct Way (contd.)<br />SumI18n.properties<br />SUM = Sum of {0} and {1} is {2}<br />SumI18n_hi.properties<br />SUM = {0} अतिरिक्त {1} {2} के बराबर है<br />SumI18n_ta.properties<br />SUM = {0} மற்றும் {1} கூட்டினால் {2}<br />
    42. 42. Oops!<br />Java 1.5 property files are read as ISO-8859-1 (Latin-1)<br />Use “native2ascii” tool to convert Unicode files to escape sequences (U+??)<br />native2ascii –encoding UTF-8 SumI18n_hi.properties<br />native2ascii –encoding UTF-8 SumI18n_ta.properties<br />
    43. 43. It’s working!<br />
    44. 44. Internationalisation in PHP<br />
    45. 45. Challenges<br />PHP 5 (and earlier) does not understand characters and encodings<br />The multi-byte extension (mbstring) in PHP works only for a few encodings (primarily CJK)<br />PHP has very limited functions for formatting date, time, currencies, etc.<br />PHP doesn’t provide linguistic sorting!<br />
    46. 46. The Good News – Intl extension<br />Open source – http://pecl.php.net/intl<br />Designed for PHP 5.x, part of PHP 5.3<br />Configure using “—enable-intl”<br />Leverages ICU and CLDR<br />Available as OO and procedural APIs<br />Collator::sort() vs. collator_sort()<br />Yahoo! is a key contributor<br />
    47. 47. The PHP Intl Library<br />Intl<br />Collator<br />IDN<br />NumberFormatter<br />Grapheme<br />Locale<br />ResourceBundle<br />Normalizer<br />IntlDateFormatter<br />MessageFormatter<br />
    48. 48. Corrected substring implementation<br />
    49. 49. Formatting Numbers<br />
    50. 50. Resource Bundles<br />Externalize strings in your application<br />Similar to how desktop applications are built<br />One binary and additional language packs<br />Similar to Windows™ resource files and Unix® message files<br />Structure is different, see ICU resource bundles<br />Key/value pairs<br />Key is used by the application at run time to display the value<br />
    51. 51. Additional Things<br />Change the “default_charset” in php.ini to “utf-8”<br />While the “mbstring” works good enough for Indic languages, use the more precise “grapheme_*” functions from the Intl library<br />“echo” is encoding agnostic<br />
    52. 52. Why Intl is better than mbstring?<br />
    53. 53. Why Intl is better than mbstring? (contd.)<br />
    54. 54. Resources<br />http://www.w3.org/International/<br />http://unicode.org/<br />http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp<br />http://pecl.php.net/intl<br />http://php.net/manual/en/refs.international.php<br />

    ×