• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Introduction to W3C I18N Best Practices
 

Introduction to W3C I18N Best Practices

on

  • 5,109 views

A tutorial on Internationalisation, typical issues found across the web and how to go about solving it.

A tutorial on Internationalisation, typical issues found across the web and how to go about solving it.

Statistics

Views

Total Views
5,109
Views on SlideShare
5,087
Embed Views
22

Actions

Likes
3
Downloads
56
Comments
2

3 Embeds 22

http://blogsbyhroy.wordpress.com 12
http://www.linkedin.com 6
https://www.linkedin.com 4

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

12 of 2 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Typically “enabling” might involve designing and developing a product that does not have any country/region specific business logic. Additionally it should externalise all country/region specific logic so that they can be customised for a country/region. For example displaying the date as “dd/mm/yyyy” by default is bad, instead it should be displayed as per the user’s locale.
  • Localisation involves not only translation, but additional customisation including numbers, dates, times, currency, sorting, icons, colours, etc.
  • First and Last names entirely depend upon region and culture. Instead, “Given name” and “Surname” should be used.
  • Do not validate names and e-mails on the client-side as JavaScript does a bad job when it comes to I18n.
  • Most of the Indic fonts are cursive and needs a minimum font size that is different from the minimum size used for English to be clearly, legibly visible.
  • PHP doesn’t understand Unicode by default!
  • Every character is a 16-bit code unit, each of them make a character. This is not true for all languages though like Japanese, but fortunately characters in all Indian languages is contained within a 16-bit code unit.Without the “virama”, the program would print “namasakara” which is incorrect. The “virama” is needed for the rendering engine to display either the half consonant, or add the consonant at the appropriate position. Note that Unicode doesn’t care about how glyphs are rendered, it is the job of the software to do this.
  • The question marks (as explained before) denotes incorrect character conversion.The default code page for Windows™ command prompt is the original IBM PC code page (437.) The “chcp” program can be used to display/switch the code page. Windows™ also defines several other code pages, of which the popular ones are 1252 (Western European) and 65001 (UTF-8.)
  • The Emacs shell (eshell) is a wonderful terminal emulation program that runs within the Emacs editing environment. It is very useful because it supports Unicode.In the second case, we force Java™ to assume that the encoding is UTF-8 and hence it (outputs the correct bytes) resulting in correct rendering of the Devanagari “Namaskar”. Even though it works this is a non-portable and bad way of doing things.
  • On GNU/Linux, Java™ typically uses UTF-8 as the default charset (if the locale is set as UTF-8.)
  • If the default charset is overridden, basically providing an incorrect one the results vary from incorrect conversion (???) to Mojibake (garbled characters) depending upon the output charset.
  • The “tofu” characters mean that the font isn’t available. Yes, Windows™ doesn’t have a console font to display Devanagari.
  • Collator provides locale-dependent collation and sortingFormatter modules provide locale-dependent formatting of numbers, dates, currencies, messages, etc.Normalizer provides methods for normalising and checking text in normalised formLocale provides access to locale-dependent resourcesGrapheme provides linguistically correct way of parsing strings, breaking a string into tokens, etc.IDN provides Internationalized Domain Name supportResourceBundle provides methods for customising messages depending on the locale
  • The “Rs.” isn’t hard-coded in our program which makes it easy when Unicode starts supporting the new Rupee symbol. There is no need to change code, the program will start displaying the new symbol whenever the new symbol is supported. The actual work involves installing the new Intl library that is compiled against the newer version of ICU libraries and installing the new fonts that has the glyph for the corresponding code point.

Introduction to W3C I18N Best Practices Introduction to W3C I18N Best Practices Presentation Transcript

  • Introduction to W3C I18n Best Practices
    Presented by Gopal Venkatesan
    <g13n@ymail.com>
  • नमस्कार
    নমস্কার
    ನಮಸ್ಕಾರ
    ନମସ୍କର୍
    வணக்கம்
    ਸਤਿ ਸ੍ਰੀ ਅਕਾਲ
    నమస్కారం
    നമസ്കാരം
    السلام علیکم
    નમસ્કાર
  • Training Outline
    Internationalisation Vocabulary
    Typical Problems
    Outline the common problems found across the web
    Java and Internationalisation
    The level of Internationalisation support is available in Java
    Resource Bundles
    Formatting messages the correct way
    PHP and Internationalisation
    The level of Internationalisation support is available in PHP
  • Vocabulary
  • Unicode
    International standard for representing written language in computers
    Latest version 5.2 adds 6648 new characters including support for Vedic Sanskrit
    Maintained in sync with ISO 10646
    Three main encodings: UTF-8, UTF-16 and UTF-32
    Address space of 21 bits
  • Unicode (contd.)
    UTF-8 is a multi-byte encoding and is eight bytes long
    An encoded character can take one, two, three or four bytes
    UTF-8 is backward compatible with US-ASCII
    Default encoding for PHP6?
  • Unicode (contd.)
    UTF-16 uses 16-bit code units
    Cannot address the complete set, so uses surrogates
    Default encoding for strings in Java and JavaScript
  • Unicode (contd.)
    UTF-32 uses 32-bit code units
    Every Unicode character is addressed within a single code unit
  • Internationalisation
    Design and development of a product, application or document content that enables easy localization for target audiences that vary in culture, region, or language
    Abbreviated as I18n as there are eighteen characters between “I” and “n”
  • Localisation
    Adaptation of a product, application or document content to meet the language, cultural and other requirements of a specific target market (a “locale”)
    Translation is one aspect of localisation
    Abbreviated as L10n as there are ten characters between “L” and “n”
  • Typical Problems
  • Typical Problem
  • Typical Problem (Contd.)
  • Typical Problem (Contd.)
  • Typical Problem (Contd.)
  • Typical Problem (Contd.)
  • The Solution
    Determine the user environment
    Format dates, times, currencies as per the locale
    Understand the Internationalisation support available with your implementation language
    Use the ICU/Internationalisation libraries rather than rolling out your own functions
  • Common Encoding Problems
  • Tofu characters – Black hollow boxes
    Shown as a black hollow box, typically one per character
    Indicates font problem i.e., the system doesn’t have the right fonts to display the glyph(s)
    Tofu isn’t always a software problem – not a bug but really annoying
  • Tofu characters – Black hollow boxes
  • Question Marks – Incorrect conversion
    “???” usually displayed when converting text from one encoding to another
    Means there is no equivalent character in the target encoding for the corresponding source
    May not be a bug always, though sometimes occurs when an incorrect encoding is specified
  • Question Marks – Incorrect conversion
  • Mojibake –文字化け
    Pronounced as “Moh-jee-baa-kay” is a Japanese word meaning “garbled characters”
    Occurs when text in one encoding is “interpreted” as some other encoding
    Most of the times caused by interpreting Latin-1 as UTF-8
    UTF-8 is compatible only with US-ASCII
    Characters outside the ASCII range are incompatible with UTF-8 and cause Mojibake
  • Mojibake – 文字化け
  • Java™ And Unicode
  • Unicode support in Java™
    Java™ has always supported Unicode
    Java™ strings are UTF-16
    A “char” in Java™ is a UTF-16 code unit, not a code point
    By default the input and output streams use the OS native charset
    On Windows™ this is Windows-1252
    On most Unices and Unix-like OS this is UTF-8
  • A “Hello, world” example
  • A “Hello, world” example (contd.)
  • A “Hello, world” example (contd.)
  • “Hello, world” on GNU/Linux
  • Garbage In, Garbage Out!
  • “Hello, world” Corrected!
  • Oops!
  • “Hello, world” Corrected!
  • Externalising Strings
    Resource Bundles
  • The Need
    Allows a single code base to display strings in multiple languages
    No need to refactor code to support new languages
  • Beginning
  • Beginning (Sum.properties)
    SUM_OF = Sum of
    AND = and
    IS = is
  • That was broken!
    Its generally a bad idea to concatenate strings
    Does not work for all languages since the grammar is different!
    Always use string substitution using positional parameters
  • Correct Way
  • Correct Way (contd.)
    SumI18n.properties
    SUM = Sum of {0} and {1} is {2}
    SumI18n_hi.properties
    SUM = {0} अतिरिक्त {1} {2} के बराबर है
    SumI18n_ta.properties
    SUM = {0} மற்றும் {1} கூட்டினால் {2}
  • Oops!
    Java 1.5 property files are read as ISO-8859-1 (Latin-1)
    Use “native2ascii” tool to convert Unicode files to escape sequences (U+??)
    native2ascii –encoding UTF-8 SumI18n_hi.properties
    native2ascii –encoding UTF-8 SumI18n_ta.properties
  • It’s working!
  • Internationalisation in PHP
  • Challenges
    PHP 5 (and earlier) does not understand characters and encodings
    The multi-byte extension (mbstring) in PHP works only for a few encodings (primarily CJK)
    PHP has very limited functions for formatting date, time, currencies, etc.
    PHP doesn’t provide linguistic sorting!
  • The Good News – Intl extension
    Open source – http://pecl.php.net/intl
    Designed for PHP 5.x, part of PHP 5.3
    Configure using “—enable-intl”
    Leverages ICU and CLDR
    Available as OO and procedural APIs
    Collator::sort() vs. collator_sort()
    Yahoo! is a key contributor
  • The PHP Intl Library
    Intl
    Collator
    IDN
    NumberFormatter
    Grapheme
    Locale
    ResourceBundle
    Normalizer
    IntlDateFormatter
    MessageFormatter
  • Corrected substring implementation
  • Formatting Numbers
  • Resource Bundles
    Externalize strings in your application
    Similar to how desktop applications are built
    One binary and additional language packs
    Similar to Windows™ resource files and Unix® message files
    Structure is different, see ICU resource bundles
    Key/value pairs
    Key is used by the application at run time to display the value
  • Additional Things
    Change the “default_charset” in php.ini to “utf-8”
    While the “mbstring” works good enough for Indic languages, use the more precise “grapheme_*” functions from the Intl library
    “echo” is encoding agnostic
  • Why Intl is better than mbstring?
  • Why Intl is better than mbstring? (contd.)
  • Resources
    http://www.w3.org/International/
    http://unicode.org/
    http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp
    http://pecl.php.net/intl
    http://php.net/manual/en/refs.international.php