Software Internationalization Crash Course


Published on

Crash course in software internationalization and localization. Based on Java, but most of the information applicable for all web development.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Software Internationalization Crash Course

  1. 1. Internationalization, or I18N Crash Course & Plan of Attack Will Iverson
  2. 2. Contents <ul><li>What is I18N? </li></ul><ul><li>Goals </li></ul><ul><li>Demonstration </li></ul><ul><li>Unicode & UTF-8 </li></ul><ul><li>Configuration </li></ul><ul><li>Java & I18N </li></ul><ul><ul><li>Property Files </li></ul></ul><ul><ul><li>Locales </li></ul></ul><ul><ul><li>Web UI </li></ul></ul><ul><ul><li>AWT/Swing UI </li></ul></ul><ul><li>Example Code Review </li></ul><ul><li>Phases To Support </li></ul><ul><li>Cheat Sheet </li></ul>
  3. 3. What is I18N? <ul><li>The process of making an application localizable </li></ul><ul><li>Not the actual localization itself! </li></ul>
  4. 4. Goals <ul><li>Isolate culturally dependent information from the application </li></ul><ul><ul><li>Messages </li></ul></ul><ul><ul><li>GUI component labels </li></ul></ul><ul><ul><li>Online help </li></ul></ul><ul><ul><li>Sounds </li></ul></ul><ul><ul><li>Colors </li></ul></ul><ul><ul><li>Graphics </li></ul></ul><ul><ul><li>Icons </li></ul></ul><ul><ul><li>Dates </li></ul></ul><ul><ul><li>Times </li></ul></ul><ul><ul><li>Numbers </li></ul></ul><ul><ul><li>Currencies </li></ul></ul><ul><ul><li>Measurements </li></ul></ul><ul><ul><li>Phone numbers </li></ul></ul><ul><ul><li>Honorifics & personal titles </li></ul></ul><ul><ul><li>Postal addresses </li></ul></ul><ul><ul><li>Page layouts </li></ul></ul>
  5. 5. Contest <ul><li>Guess how many hard-coded user displayed Strings I estimate there are in the com.yourcompany.* packages? </li></ul><ul><li>A bit unscientific </li></ul><ul><ul><li>excludes SQL </li></ul></ul><ul><ul><li>other obviously not user displayed Strings… </li></ul></ul><ul><ul><ul><li>but does include some ambiguous Strings </li></ul></ul></ul><ul><li>Prize: Drink of choice at StarBucks </li></ul>
  6. 6. Demonstration <ul><li>Simple JSP page </li></ul><ul><ul><li>sample_i18n.jsp </li></ul></ul><ul><li>Localization files </li></ul><ul><ul><li> </li></ul></ul><ul><ul><li> </li></ul></ul><ul><ul><li> </li></ul></ul><ul><li>Generating Japanese Properties </li></ul><ul><ul><li>native2ascii </li></ul></ul><ul><ul><li> </li></ul></ul>
  7. 7. Unicode <ul><li>Unicode provides a unique number for every character, </li></ul><ul><ul><li>no matter what the platform, </li></ul></ul><ul><ul><li>no matter what the program, </li></ul></ul><ul><ul><li>no matter what the language </li></ul></ul><ul><ul><li>… from the Unicode website </li></ul></ul>
  8. 8. UTF-8 <ul><li>An encoding standard for Unicode </li></ul><ul><ul><li>If a character is character number 16,434, how is that written out? </li></ul></ul><ul><ul><li>UTF-8, UTF-16, UTF-32 </li></ul></ul><ul><ul><ul><li>UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites. - </li></ul></ul></ul>
  9. 9. UTF-8 con’t <ul><li>8-bit character scheme </li></ul><ul><li>Multi-character </li></ul><ul><ul><li>0x00..0x7F ==> 1 byte </li></ul></ul><ul><ul><li>0x80..0x7FF ==> 2 bytes </li></ul></ul><ul><ul><li>0x800..0xD7FF, 0xE000..0xFFFF ==> 3 bytes </li></ul></ul><ul><ul><li>0x10000 .. 0x10FFFF ==> 4 bytes </li></ul></ul>
  10. 10. UTF-8 & Unsupported Apps <ul><li>Note that raw UTF-8, non-7-bit ASCII characters will display as high-bit characters in non-Unicode aware applications </li></ul><ul><ul><li>e.g. 歐洲 would display as 歐洲 </li></ul></ul><ul><li>A Unicode aware app displaying in a font without the specific character will display a “box” </li></ul><ul><ul><li>歐洲 will display as ⁍⁍ </li></ul></ul><ul><li>This presentation done with Ariel Unicode MS! </li></ul><ul><li>This is all bad news for an ASCII-editor-wielding developer </li></ul><ul><ul><li>Easy to munge stuff! </li></ul></ul>
  11. 11. Java & UTF8 <ul><li>Correct name is UTF-8 </li></ul><ul><li>Java calls UTF-8, “UTF8” </li></ul><ul><li>Java uses Universal Character Set (UCS-2) internally </li></ul><ul><ul><li>Fixed width </li></ul></ul><ul><ul><li>Two bytes wide </li></ul></ul><ul><ul><li>So you’re already paying for the encode/decode memory & processing “hit” today </li></ul></ul><ul><li>Automatic conversion of UCS-2 to the default encoding is done throughout the JVM </li></ul><ul><ul><li>Have to go through an encode/decode regardless </li></ul></ul>
  12. 12. Java & UTF8 con’t <ul><li>Java supports the “escaping” of characters in the 7-bit ASCII range with the uXXXX format. </li></ul><ul><ul><li>This is a very useful Java feature! </li></ul></ul><ul><ul><li>Allows view/editing text files with US ASCII editors </li></ul></ul><ul><ul><li>Needs JVM to translate back into “real” Unicode at runtime </li></ul></ul><ul><li>Use the native2ascii tool to translate back and forth between uXXXX , “real” Unicode UTF-8 encoding and other encodings </li></ul>
  13. 13. Alternate Encodings <ul><li>ISO 8859-1 </li></ul><ul><li>ISO 8859-8 (Hebrew) </li></ul><ul><li>CP1251 (Russian) </li></ul><ul><li>Shift-JIS </li></ul><ul><ul><li>“ zillions” more… </li></ul></ul><ul><ul><li>a single language may have multiple encoding schemes </li></ul></ul><ul><li>Standardization on Unicode & UTF-8 </li></ul>
  14. 14. System Configuration <ul><li>Server JVM </li></ul><ul><li>Client JVM </li></ul><ul><li>Oracle </li></ul><ul><li>Apache </li></ul>
  15. 15. Server JVM configuration <ul><li>Set the default encoding to UTF8 </li></ul><ul><ul><li>-Dfile.encoding=UTF8 </li></ul></ul><ul><li>Default I/O to UTF8 </li></ul><ul><li>Helps minimize encoding mistakes </li></ul><ul><li>Otherwise defaults to platform default </li></ul><ul><ul><li>Cp1252 on Windows 2000 US </li></ul></ul>
  16. 16. Client JVM configuration <ul><li>May want to allow user to set the default Locale </li></ul><ul><ul><li>Locale.setDefault()…depending on security… </li></ul></ul><ul><li>May want to send configuration data on user’s preferred Locale back to server </li></ul><ul><li>May need to pay extra attention to character set encodings </li></ul><ul><ul><li>Manually ensure that UTF8 is used for relevant sections </li></ul></ul><ul><ul><li>String.getBytes(“) </li></ul></ul><ul><ul><li>new String(“test”, “UTF8”); </li></ul></ul>
  17. 17. Oracle configuration <ul><li>UNIX </li></ul><ul><ul><li>NLS_LANG set as an environment variable. </li></ul></ul><ul><li>Windows </li></ul><ul><ul><li>NLS_LANG set globally in registry </li></ul></ul><ul><ul><ul><li>override by setting NLS_LANG local environment in command prompt </li></ul></ul></ul><ul><li>NLS_LANG=UTF8 </li></ul><ul><li>Minimizes driver level translations </li></ul><ul><ul><li>Driver handles conversion from UTF8 to UCS-2 </li></ul></ul>
  18. 18. Apache Configuration <ul><li>Needs to ensure that the proper types in the config match sending UTF8 as the encoding </li></ul><ul><li>Required to work with Netscape </li></ul><ul><ul><li>MSIE, Opera respect meta tag </li></ul></ul><ul><ul><li>“ Mostly” works in Netscape regardless </li></ul></ul><ul><li>E.g. AddDefaultCharset directive </li></ul><ul><ul><li>Syntax: AddDefaultCharset On|Off|charset </li></ul></ul><ul><ul><li>Default: AddDefaultCharset Off </li></ul></ul><ul><ul><li>This directive specifies the name of the character set that will be added to any response that does not have any parameter on the content type in the HTTP headers. This will override any character set specified in the body of the document via a META tag. A setting of AddDefaultCharset Off disables this functionality. AddDefaultCharset On enables Apache's internal default charset of iso-8859-1 as required by the directive. You can also specify an alternate charset to be used; e.g. AddDefaultCharset utf-8. </li></ul></ul>
  19. 19. Java Property Files <ul><li>Use to break out hard coded Strings </li></ul><ul><li>Go through the application code and identify the user-visible Strings </li></ul><ul><li>Use a tool to do this for the first pass! </li></ul><ul><ul><li>ResourceBundle labels = ResourceBundle.getBundle(“client&quot;, currentLocale); </li></ul></ul><ul><ul><li>String value = labels.getString(key); </li></ul></ul>
  20. 20. Generating Property Files <ul><li>Generate the master property file[s] </li></ul><ul><ul><li>e.g. “” </li></ul></ul><ul><li>Clone off the file for the locale </li></ul><ul><ul><li>“” = German </li></ul></ul><ul><ul><li>“” = Japanese </li></ul></ul>
  21. 21. Localizing Property Files <ul><li>Open file in text editor that supports alternative encoding </li></ul><ul><ul><li>change the encoding to native (e.g. EmEditor) </li></ul></ul><ul><li>Use the editor to input localized messages </li></ul><ul><li>Save the file using the proper native encoding </li></ul><ul><li>Use native2ascii tool (included with JDK) to convert native encoding to ASCII </li></ul>
  22. 22. Property File Management <ul><li>Note that you are maintaining… </li></ul><ul><ul><li>Base (default) Locale </li></ul></ul><ul><ul><ul><li> </li></ul></ul></ul><ul><ul><li>Source (native) encoding file </li></ul></ul><ul><ul><ul><li> </li></ul></ul></ul><ul><ul><li>native2ascii encoded </li></ul></ul><ul><ul><ul><li> </li></ul></ul></ul><ul><ul><li>Need strategies for ensuring all Strings defined in appear in </li></ul></ul><ul><ul><ul><li>Tools can help here </li></ul></ul></ul>
  23. 23. Initial Property File Groups <ul><li>applet </li></ul><ul><li>Swing client </li></ul><ul><li>web user interface[s] </li></ul><ul><li>Others TBD </li></ul><ul><ul><li>Depends on deployment configuration </li></ul></ul>
  24. 24. Note: Compound Messages <ul><li>“ There are 1 file[s] on disk” </li></ul><ul><ul><li>“ There are ” + x + “file[s] on disk” </li></ul></ul><ul><li>May not be appropriate formatting </li></ul><ul><li>Instead try… </li></ul><ul><ul><ul><li>“ There are {1} file[s] on disk” </li></ul></ul></ul><ul><ul><ul><li>“ 디스크에 1 파일 있다” </li></ul></ul></ul><ul><ul><li>Dynamic replace of the {1} </li></ul></ul>
  25. 25. Java Locale <ul><li>A Java Locale object defines many details </li></ul><ul><ul><li>What the Properties search order is for Strings </li></ul></ul><ul><ul><ul><li>Which in turn describes which graphics to display </li></ul></ul></ul><ul><ul><li>The proper Date & Time display </li></ul></ul><ul><ul><li>The proper Currency display </li></ul></ul><ul><ul><li>The proper sorting (Collator) routines </li></ul></ul><ul><ul><li>Text boundaries (Text Boundaries) </li></ul></ul><ul><ul><li>Preferred direction for typing! </li></ul></ul><ul><ul><ul><li>left to right </li></ul></ul></ul><ul><ul><ul><li>right to left </li></ul></ul></ul><ul><ul><ul><ul><li>Split mode not supported… </li></ul></ul></ul></ul>
  26. 26. Additional Locale Information <ul><li>A user’s locale may have the following additional details </li></ul><ul><ul><li>The proper font for the browser </li></ul></ul><ul><ul><ul><li>contains the correct character set </li></ul></ul></ul><ul><ul><li>The proper font for the AWT/Swing UI </li></ul></ul>
  27. 27. Locale & User Preference <ul><li>A user may wish to shift Locales </li></ul><ul><ul><li>E.g. in Europe, a German may be using a French or English machine, but would prefer to access a site in German </li></ul></ul><ul><li>Therefore, a browser may be able to “hint” on a Locale… </li></ul><ul><ul><li>but may not be definitive </li></ul></ul><ul><li>Mechanism for setting a user preferred Locale on a session/persistent basis </li></ul>
  28. 28. Web User Interface <ul><li>Servlets/JSP pages </li></ul><ul><ul><li>Suggest use the following default style… </li></ul></ul><ul><ul><ul><li>font-family: Arial Unicode MS, Arial, Helvetica, sans-serif </li></ul></ul></ul><ul><li>Set encoding for pages to UTF-8 </li></ul><ul><li>Oddity of HTML is that results are posted in 8859_1 regardless </li></ul><ul><ul><li>but can be translated back to Unicode </li></ul></ul><ul><ul><li>Works well in MSIE, Opera </li></ul></ul><ul><ul><li>Works but “less well” in Netscape </li></ul></ul>
  29. 29. AWT/Swing User Interface <ul><li>UI </li></ul><ul><ul><li>Window.applyResourceBundle(java.util.ResourceBundle) </li></ul></ul><ul><li>Arial Unicode is a 23.0 MB file (uncompressed) or 14MB (zipped)! </li></ul><ul><ul><li>Users may not have installed </li></ul></ul><ul><ul><li>Has “all” the Unicode characters! </li></ul></ul><ul><ul><li>May be required for CSR? </li></ul></ul><ul><li>Provide options for multi-lingual user preferences </li></ul><ul><ul><li>E.g. English/Japanese speaker may wish to switch Locale </li></ul></ul>
  30. 30. Design Question <ul><li>Is a Locale attached to a Session alone, or to a user preference? </li></ul><ul><ul><li>Regardless, best first time guess is made when a user logs in </li></ul></ul><ul><ul><li>Session object has Locale attached </li></ul></ul><ul><li>Can this be overridden by the user on a per session basis? </li></ul><ul><ul><li>Is this override persistent from one session to the next? </li></ul></ul><ul><li>Consider: </li></ul><ul><ul><li>European in Dublin supports both French, English customers… and logs in with a computer running a German OS & browser. </li></ul></ul>
  31. 31. Example Code Analysis <ul><li>Contest winner… </li></ul>
  32. 32. com.yourcompany.* <ul><li>Hard-coded user displayable Strings </li></ul><ul><ul><li>xxx (est, apx xxxK English) </li></ul></ul><ul><ul><ul><li>Of these, apx xxx are in applet (apx. 37K English) </li></ul></ul></ul><ul><ul><li>Excludes SQL </li></ul></ul><ul><li>Apx xxx classes use java.util.Date, xxx classes use java.sql.Timestamp </li></ul><ul><ul><li>Display or internal? </li></ul></ul><ul><li>Apx xxx instances of sorting </li></ul><ul><ul><li>Most in applet </li></ul></ul><ul><li>Need to validate bit-shifting operations, mostly in the custom protocol </li></ul><ul><ul><li>Appears to be OK </li></ul></ul><ul><ul><li>uses non-valid Unicode codes for bracketing text </li></ul></ul><ul><li>Initial Coverage Report.xls </li></ul>
  33. 33. Servlets <ul><li>Hard-coded user displayable Strings </li></ul><ul><ul><li>xxx (est.) </li></ul></ul><ul><li>HTML files contain text, but also a lot of layout & JavaScripts </li></ul><ul><li>Graphics need to be evaluated </li></ul><ul><ul><li>possibly broken out on a per locale basis </li></ul></ul>
  34. 34. Design Question <ul><li>JSP have both JavaScripts (code), layout, and user readable text </li></ul><ul><li>Difficult to maintain JavaScripts AND localization </li></ul><ul><ul><li>1 JSP per locale </li></ul></ul><ul><ul><ul><li>Difficult to maintain </li></ul></ul></ul><ul><ul><ul><li>Especially hard to synchronize JavaScript and formatting changes </li></ul></ul></ul>
  35. 35. Note: Resource/Locale Cache <ul><li>Probably worth writing an in-memory preloading cache for supported ResourceBundles </li></ul><ul><ul><li>Preload supported bundles </li></ul></ul><ul><ul><li>Manager to identify… </li></ul></ul><ul><ul><ul><li>supported bundles for user </li></ul></ul></ul><ul><ul><ul><li>supported bundles for client </li></ul></ul></ul><ul><ul><ul><ul><li>Won’t do much good to have a button for Japanese if there is no other Japanese </li></ul></ul></ul></ul>
  36. 36. Whew <ul><li>Lots of material </li></ul><ul><li>Lots of changes </li></ul><ul><ul><li>Processing, not process </li></ul></ul><ul><ul><ul><li>(except user prefs for sessions) </li></ul></ul></ul><ul><li>Phased approach </li></ul><ul><ul><li>Phase 1 : Oracle </li></ul></ul><ul><ul><li>Phase 2 : JRun </li></ul></ul><ul><ul><li>Phase 3 : Browsers & Applets </li></ul></ul><ul><ul><li>Phase 4 : Localization </li></ul></ul>
  37. 37. Phase 1 - Oracle <ul><li>Data store & persistence </li></ul><ul><ul><li>Write Java class[es] to test round trip access to database </li></ul></ul><ul><ul><li>Allows testing for… </li></ul></ul><ul><ul><ul><li>Driver connectivity </li></ul></ul></ul><ul><ul><ul><li>specific SQL queries (AESC/DESC) </li></ul></ul></ul><ul><ul><ul><li>specific Datatypes (SQL Date, TimeStamp) </li></ul></ul></ul><ul><ul><ul><li>specific Stored Procedures </li></ul></ul></ul><ul><ul><ul><li>NLS setting & conversion of core store to UTF-8 </li></ul></ul></ul><ul><ul><ul><li>Performance, memory, and disk impact data </li></ul></ul></ul><ul><li>Write today for 8 </li></ul><ul><ul><li>Serves as regression test for 9 </li></ul></ul>
  38. 38. Phase 2 - JRun <ul><li>Ensure that all Date, Sorting, TimeStamps, etc are configured properly </li></ul><ul><ul><li>Everything internal to system is internal </li></ul></ul><ul><ul><li>User displayed Dates are set according to user Locale </li></ul></ul><ul><ul><ul><li>May require Locale/Session Java method signature changes </li></ul></ul></ul><ul><ul><ul><li>Side effect – user specified time zones </li></ul></ul></ul><ul><li>Conversion of hard coded Strings to property lists & resource bundles </li></ul><ul><ul><li>Write cache </li></ul></ul><ul><ul><li>Write Locale subclass with additional prefs </li></ul></ul><ul><li>Initial mockup of testing </li></ul><ul><ul><li>Japanese, German mechanical translations of properties file </li></ul></ul><ul><ul><li>Batch conversion of “test” graphics </li></ul></ul>
  39. 39. Phase 3 – Browsers & Applets <ul><li>Validate user selection of Locale </li></ul><ul><ul><li>affects display </li></ul></ul><ul><ul><li>tagging text blocks as L2R or R2L </li></ul></ul><ul><ul><li>round-trip data processing is correct </li></ul></ul><ul><li>Select recommended/required browsers, fonts, OSes </li></ul>
  40. 40. Phase 4 – Localization <ul><li>Native language speakers perform translations </li></ul><ul><ul><li>Text </li></ul></ul><ul><ul><li>Graphics </li></ul></ul><ul><li>Validation of user experience with native speakers </li></ul><ul><li>Revenue opportunity </li></ul><ul><ul><li>additional fees for additional languages </li></ul></ul>
  41. 41. Note: Email <ul><li>Content Type Header </li></ul><ul><ul><li>Content-type: text/plain; charset=&quot;utf-8&quot; </li></ul></ul><ul><li>Works with Outlook, Yahoo mail… </li></ul>
  42. 42. Note: Further Discussion <ul><li>Glyphs vs. Characters </li></ul><ul><li>Input methods </li></ul><ul><li>Bidirectional systems </li></ul><ul><ul><li>left-to-right & right-to-left </li></ul></ul>
  43. 43. Cheat Sheet <ul><li>UTF-8 </li></ul><ul><ul><li>Multi-character encoding </li></ul></ul><ul><li>UTF8 </li></ul><ul><ul><li>Java term for UTF-8 </li></ul></ul><ul><li>UCS-2 </li></ul><ul><ul><li>Fixed width internal Java Unicode </li></ul></ul><ul><li>8859_1 </li></ul><ul><ul><li>Latin-1 character set (vs. UTF-8) </li></ul></ul><ul><li>Cp1252 </li></ul><ul><ul><li>Default character set for US JVM </li></ul></ul>
  44. 44. Cheat Sheet (con’t) Oracle UTF-8 UCS-2 in memory fixed width UTF-8 encoding UTF-8 encoding Client Web Browser & Applets Unicode Font Input Method 8859_1 POST UTF-8 GET JVM JDBC Driver 8859_1 encoding