SlideShare a Scribd company logo
1 of 54
Introduction to W3C I18n Best Practices Presented by Gopal Venkatesan <g13n@ymail.com>
नमस्कार নমস্কার ನಮಸ್ಕಾರ ନମସ୍କର୍ வணக்கம் ਸਤਿ ਸ੍ਰੀ ਅਕਾਲ నమస్కారం നമസ്കാരം السلام علیکم નમસ્કાર
Training Outline Internationalisation Vocabulary Typical Problems Outline the common problems found across the web Java and Internationalisation The level of Internationalisation support is available in Java Resource Bundles Formatting messages the correct way PHP and Internationalisation The level of Internationalisation support is available in PHP
Vocabulary
Unicode International standard for representing written language in computers Latest version 5.2 adds 6648 new characters including support for Vedic Sanskrit Maintained in sync with ISO 10646 Three main encodings: UTF-8, UTF-16 and UTF-32 Address space of 21 bits
Unicode (contd.) UTF-8 is a multi-byte encoding and is eight bytes long An encoded character can take one, two, three or four bytes UTF-8 is backward compatible with US-ASCII Default encoding for PHP6?
Unicode (contd.) UTF-16 uses 16-bit code units Cannot address the complete set, so uses surrogates Default encoding for strings in Java and JavaScript
Unicode (contd.) UTF-32 uses 32-bit code units Every Unicode character is addressed within a single code unit
Internationalisation Design and development of a product, application or document content that enables easy localization for target audiences that vary in culture, region, or language Abbreviated as I18n as there are eighteen characters between “I” and “n”
Localisation Adaptation of a product, application or document content to meet the language, cultural and other requirements of a specific target market (a “locale”) Translation is one aspect of localisation Abbreviated as L10n as there are ten characters between “L” and “n”
Typical Problems
Typical Problem
Typical Problem (Contd.)
Typical Problem (Contd.)
Typical Problem (Contd.)
Typical Problem (Contd.)
The Solution Determine the user environment Format dates, times, currencies as per the locale Understand the Internationalisation support available with your implementation language Use the ICU/Internationalisation libraries rather than rolling out your own functions
Common Encoding Problems
Tofu characters – Black hollow boxes Shown as a black hollow box, typically one per character Indicates font problem i.e., the system doesn’t have the right fonts to display the glyph(s) Tofu isn’t always a software problem – not a bug but really annoying
Tofu characters – Black hollow boxes
Question Marks – Incorrect conversion “???” usually displayed when converting text from one encoding to another Means there is no equivalent character in the target encoding for the corresponding source May not be a bug always, though sometimes occurs when an incorrect encoding is specified
Question Marks – Incorrect conversion
Mojibake –文字化け  Pronounced as “Moh-jee-baa-kay” is a Japanese word meaning “garbled characters” Occurs when text in one encoding is “interpreted” as some other encoding Most of the times caused by interpreting Latin-1 as UTF-8 UTF-8 is compatible only with US-ASCII Characters outside the ASCII range are incompatible with UTF-8 and cause Mojibake
Mojibake – 文字化け
Java™ And Unicode
Unicode support in Java™ Java™ has always supported Unicode Java™ strings are UTF-16 A “char” in Java™ is a UTF-16 code unit, not a code point By default the input and output streams use the OS native charset On Windows™ this is Windows-1252 On most Unices and Unix-like OS this is UTF-8
A “Hello, world” example
A “Hello, world” example (contd.)
A “Hello, world” example (contd.)
“Hello, world” on GNU/Linux
Garbage In, Garbage Out!
“Hello, world” Corrected!
Oops!
“Hello, world” Corrected!
Externalising Strings Resource Bundles
The Need Allows a single code base to display strings in multiple languages No need to refactor code to support new languages
Beginning
Beginning (Sum.properties) SUM_OF = Sum of AND = and IS = is
That was broken! Its generally a bad idea to concatenate strings Does not work for all languages since the grammar is different! Always use string substitution using positional parameters
Correct Way
Correct Way (contd.) SumI18n.properties SUM = Sum of {0} and {1} is {2} SumI18n_hi.properties SUM = {0} अतिरिक्त {1} {2} के बराबर है SumI18n_ta.properties SUM = {0} மற்றும் {1} கூட்டினால் {2}
Oops! Java 1.5 property files are read as ISO-8859-1 (Latin-1) Use “native2ascii” tool to convert Unicode files to escape sequences (U+??) native2ascii –encoding UTF-8 SumI18n_hi.properties native2ascii –encoding UTF-8 SumI18n_ta.properties
It’s working!
Internationalisation in PHP
Challenges PHP 5 (and earlier) does not understand characters and encodings The multi-byte extension (mbstring) in PHP works only for a few encodings (primarily CJK) PHP has very limited functions for formatting date, time, currencies, etc. PHP doesn’t provide linguistic sorting!
The Good News – Intl extension Open source – http://pecl.php.net/intl Designed for PHP 5.x, part of PHP 5.3 Configure using “—enable-intl” Leverages ICU and CLDR Available as OO and procedural APIs Collator::sort() vs. collator_sort() Yahoo! is a key contributor
The PHP Intl Library Intl Collator IDN NumberFormatter Grapheme Locale ResourceBundle Normalizer IntlDateFormatter MessageFormatter
Corrected substring implementation
Formatting Numbers
Resource Bundles Externalize strings in your application Similar to how desktop applications are built One binary and additional language packs Similar to Windows™ resource files and Unix® message files Structure is different, see ICU resource bundles Key/value pairs Key is used by the application at run time to display the value
Additional Things Change the “default_charset” in php.ini to “utf-8” While the “mbstring” works good enough for Indic languages, use the more precise “grapheme_*” functions from the Intl library “echo” is encoding agnostic
Why Intl is better than mbstring?
Why Intl is better than mbstring? (contd.)
Resources http://www.w3.org/International/ http://unicode.org/ http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp http://pecl.php.net/intl http://php.net/manual/en/refs.international.php

More Related Content

What's hot

Python programming introduction
Python programming introductionPython programming introduction
Python programming introductionSiddique Ibrahim
 
From Programming to Modeling And Back Again
From Programming to Modeling And Back AgainFrom Programming to Modeling And Back Again
From Programming to Modeling And Back AgainMarkus Voelter
 
Which is better, Java or Python? And how?
Which is better, Java or Python? And how?Which is better, Java or Python? And how?
Which is better, Java or Python? And how?narendrachinnu
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translationkhyati gupta
 
Introduction to python programming
Introduction to python programmingIntroduction to python programming
Introduction to python programmingSrinivas Narasegouda
 
Architecting Domain-Specific Languages
Architecting Domain-Specific LanguagesArchitecting Domain-Specific Languages
Architecting Domain-Specific LanguagesMarkus Voelter
 
Deep contextualized word representations
Deep contextualized word representationsDeep contextualized word representations
Deep contextualized word representationsJunya Kamura
 
An Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representationszperjaccico
 
Before Starting Python Programming Language
Before Starting Python Programming LanguageBefore Starting Python Programming Language
Before Starting Python Programming LanguageKishan Tongrao
 
ITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and GrammarsITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and GrammarsTonny Madsen
 
IBM Bluemix Paris meetup #26 - 20171114 - Chatbot Project
IBM Bluemix Paris meetup #26 - 20171114 - Chatbot ProjectIBM Bluemix Paris meetup #26 - 20171114 - Chatbot Project
IBM Bluemix Paris meetup #26 - 20171114 - Chatbot ProjectIBM France Lab
 
Internationalisation And Globalisation
Internationalisation And GlobalisationInternationalisation And Globalisation
Internationalisation And GlobalisationAlan Dean
 

What's hot (20)

Python programming introduction
Python programming introductionPython programming introduction
Python programming introduction
 
From Programming to Modeling And Back Again
From Programming to Modeling And Back AgainFrom Programming to Modeling And Back Again
From Programming to Modeling And Back Again
 
Go programing language
Go programing languageGo programing language
Go programing language
 
SMT3
SMT3SMT3
SMT3
 
Unicode 101
Unicode 101Unicode 101
Unicode 101
 
Which is better, Java or Python? And how?
Which is better, Java or Python? And how?Which is better, Java or Python? And how?
Which is better, Java or Python? And how?
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translation
 
Introduction to python programming
Introduction to python programmingIntroduction to python programming
Introduction to python programming
 
Architecting Domain-Specific Languages
Architecting Domain-Specific LanguagesArchitecting Domain-Specific Languages
Architecting Domain-Specific Languages
 
Computer programming languages
Computer programming languagesComputer programming languages
Computer programming languages
 
Deep contextualized word representations
Deep contextualized word representationsDeep contextualized word representations
Deep contextualized word representations
 
An Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representations
 
Unit1 pps
Unit1 ppsUnit1 pps
Unit1 pps
 
Before Starting Python Programming Language
Before Starting Python Programming LanguageBefore Starting Python Programming Language
Before Starting Python Programming Language
 
Tools of translation
Tools of translationTools of translation
Tools of translation
 
LANGUAGE TRANSLATOR
LANGUAGE TRANSLATORLANGUAGE TRANSLATOR
LANGUAGE TRANSLATOR
 
ITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and GrammarsITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and Grammars
 
IBM Bluemix Paris meetup #26 - 20171114 - Chatbot Project
IBM Bluemix Paris meetup #26 - 20171114 - Chatbot ProjectIBM Bluemix Paris meetup #26 - 20171114 - Chatbot Project
IBM Bluemix Paris meetup #26 - 20171114 - Chatbot Project
 
Php functions
Php functionsPhp functions
Php functions
 
Internationalisation And Globalisation
Internationalisation And GlobalisationInternationalisation And Globalisation
Internationalisation And Globalisation
 

Similar to W3C I18n Best Practices and Localisation in Java & PHP

Encoding Nightmares (and how to avoid them)
Encoding Nightmares (and how to avoid them)Encoding Nightmares (and how to avoid them)
Encoding Nightmares (and how to avoid them)Kenneth Farrall
 
Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsUnicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsRay Paseur
 
Software Internationalization Crash Course
Software Internationalization Crash CourseSoftware Internationalization Crash Course
Software Internationalization Crash CourseWill Iverson
 
Intro flash cards
Intro flash cardsIntro flash cards
Intro flash cardslorhow58
 
Intro flash cards
Intro flash cardsIntro flash cards
Intro flash cardslorhow58
 
Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)Jerome Eteve
 
Lecture 1 introduction to language processors
Lecture 1  introduction to language processorsLecture 1  introduction to language processors
Lecture 1 introduction to language processorsRebaz Najeeb
 
How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...agileware
 
Unicode Fundamentals
Unicode Fundamentals Unicode Fundamentals
Unicode Fundamentals SamiHsDU
 
Os Worthington
Os WorthingtonOs Worthington
Os Worthingtonoscon2007
 
Sugar Presentation - YULHackers March 2009
Sugar Presentation - YULHackers March 2009Sugar Presentation - YULHackers March 2009
Sugar Presentation - YULHackers March 2009spierre
 
Xml For Dummies Chapter 6 Adding Character(S) To Xml
Xml For Dummies   Chapter 6 Adding Character(S) To XmlXml For Dummies   Chapter 6 Adding Character(S) To Xml
Xml For Dummies Chapter 6 Adding Character(S) To Xmlphanleson
 
Introduction to Programming in Go
Introduction to Programming in GoIntroduction to Programming in Go
Introduction to Programming in GoAmr Hassan
 
[EclipseCon France 2017] Language Server Protocol in action
[EclipseCon France 2017] Language Server Protocol in action[EclipseCon France 2017] Language Server Protocol in action
[EclipseCon France 2017] Language Server Protocol in actionMickael Istria
 
Java Course 7: Text processing, Charsets & Encodings
Java Course 7: Text processing, Charsets & EncodingsJava Course 7: Text processing, Charsets & Encodings
Java Course 7: Text processing, Charsets & EncodingsAnton Keks
 
Static typing vs dynamic typing languages
Static typing vs dynamic typing languagesStatic typing vs dynamic typing languages
Static typing vs dynamic typing languagesJawad Khan
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingBert Pattyn
 

Similar to W3C I18n Best Practices and Localisation in Java & PHP (20)

Encoding Nightmares (and how to avoid them)
Encoding Nightmares (and how to avoid them)Encoding Nightmares (and how to avoid them)
Encoding Nightmares (and how to avoid them)
 
Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsUnicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set Collisions
 
Software Internationalization Crash Course
Software Internationalization Crash CourseSoftware Internationalization Crash Course
Software Internationalization Crash Course
 
Intro flash cards
Intro flash cardsIntro flash cards
Intro flash cards
 
Intro flash cards
Intro flash cardsIntro flash cards
Intro flash cards
 
Unicode Primer for the Uninitiated
Unicode Primer for the UninitiatedUnicode Primer for the Uninitiated
Unicode Primer for the Uninitiated
 
Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)
 
Lecture 1 introduction to language processors
Lecture 1  introduction to language processorsLecture 1  introduction to language processors
Lecture 1 introduction to language processors
 
How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...
 
Unicode Fundamentals
Unicode Fundamentals Unicode Fundamentals
Unicode Fundamentals
 
Os Worthington
Os WorthingtonOs Worthington
Os Worthington
 
Sugar Presentation - YULHackers March 2009
Sugar Presentation - YULHackers March 2009Sugar Presentation - YULHackers March 2009
Sugar Presentation - YULHackers March 2009
 
Xml For Dummies Chapter 6 Adding Character(S) To Xml
Xml For Dummies   Chapter 6 Adding Character(S) To XmlXml For Dummies   Chapter 6 Adding Character(S) To Xml
Xml For Dummies Chapter 6 Adding Character(S) To Xml
 
Introduction to Programming in Go
Introduction to Programming in GoIntroduction to Programming in Go
Introduction to Programming in Go
 
[EclipseCon France 2017] Language Server Protocol in action
[EclipseCon France 2017] Language Server Protocol in action[EclipseCon France 2017] Language Server Protocol in action
[EclipseCon France 2017] Language Server Protocol in action
 
Java Course 7: Text processing, Charsets & Encodings
Java Course 7: Text processing, Charsets & EncodingsJava Course 7: Text processing, Charsets & Encodings
Java Course 7: Text processing, Charsets & Encodings
 
Static typing vs dynamic typing languages
Static typing vs dynamic typing languagesStatic typing vs dynamic typing languages
Static typing vs dynamic typing languages
 
Unicode & PHP6
Unicode & PHP6Unicode & PHP6
Unicode & PHP6
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character Encoding
 
vb script
vb scriptvb script
vb script
 

Recently uploaded

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Recently uploaded (20)

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

W3C I18n Best Practices and Localisation in Java & PHP

  • 1. Introduction to W3C I18n Best Practices Presented by Gopal Venkatesan <g13n@ymail.com>
  • 2. नमस्कार নমস্কার ನಮಸ್ಕಾರ ନମସ୍କର୍ வணக்கம் ਸਤਿ ਸ੍ਰੀ ਅਕਾਲ నమస్కారం നമസ്കാരം السلام علیکم નમસ્કાર
  • 3. Training Outline Internationalisation Vocabulary Typical Problems Outline the common problems found across the web Java and Internationalisation The level of Internationalisation support is available in Java Resource Bundles Formatting messages the correct way PHP and Internationalisation The level of Internationalisation support is available in PHP
  • 5. Unicode International standard for representing written language in computers Latest version 5.2 adds 6648 new characters including support for Vedic Sanskrit Maintained in sync with ISO 10646 Three main encodings: UTF-8, UTF-16 and UTF-32 Address space of 21 bits
  • 6. Unicode (contd.) UTF-8 is a multi-byte encoding and is eight bytes long An encoded character can take one, two, three or four bytes UTF-8 is backward compatible with US-ASCII Default encoding for PHP6?
  • 7. Unicode (contd.) UTF-16 uses 16-bit code units Cannot address the complete set, so uses surrogates Default encoding for strings in Java and JavaScript
  • 8. Unicode (contd.) UTF-32 uses 32-bit code units Every Unicode character is addressed within a single code unit
  • 9. Internationalisation Design and development of a product, application or document content that enables easy localization for target audiences that vary in culture, region, or language Abbreviated as I18n as there are eighteen characters between “I” and “n”
  • 10. Localisation Adaptation of a product, application or document content to meet the language, cultural and other requirements of a specific target market (a “locale”) Translation is one aspect of localisation Abbreviated as L10n as there are ten characters between “L” and “n”
  • 17. The Solution Determine the user environment Format dates, times, currencies as per the locale Understand the Internationalisation support available with your implementation language Use the ICU/Internationalisation libraries rather than rolling out your own functions
  • 19. Tofu characters – Black hollow boxes Shown as a black hollow box, typically one per character Indicates font problem i.e., the system doesn’t have the right fonts to display the glyph(s) Tofu isn’t always a software problem – not a bug but really annoying
  • 20. Tofu characters – Black hollow boxes
  • 21. Question Marks – Incorrect conversion “???” usually displayed when converting text from one encoding to another Means there is no equivalent character in the target encoding for the corresponding source May not be a bug always, though sometimes occurs when an incorrect encoding is specified
  • 22. Question Marks – Incorrect conversion
  • 23. Mojibake –文字化け Pronounced as “Moh-jee-baa-kay” is a Japanese word meaning “garbled characters” Occurs when text in one encoding is “interpreted” as some other encoding Most of the times caused by interpreting Latin-1 as UTF-8 UTF-8 is compatible only with US-ASCII Characters outside the ASCII range are incompatible with UTF-8 and cause Mojibake
  • 26. Unicode support in Java™ Java™ has always supported Unicode Java™ strings are UTF-16 A “char” in Java™ is a UTF-16 code unit, not a code point By default the input and output streams use the OS native charset On Windows™ this is Windows-1252 On most Unices and Unix-like OS this is UTF-8
  • 28. A “Hello, world” example (contd.)
  • 29. A “Hello, world” example (contd.)
  • 33. Oops!
  • 36. The Need Allows a single code base to display strings in multiple languages No need to refactor code to support new languages
  • 38. Beginning (Sum.properties) SUM_OF = Sum of AND = and IS = is
  • 39. That was broken! Its generally a bad idea to concatenate strings Does not work for all languages since the grammar is different! Always use string substitution using positional parameters
  • 41. Correct Way (contd.) SumI18n.properties SUM = Sum of {0} and {1} is {2} SumI18n_hi.properties SUM = {0} अतिरिक्त {1} {2} के बराबर है SumI18n_ta.properties SUM = {0} மற்றும் {1} கூட்டினால் {2}
  • 42. Oops! Java 1.5 property files are read as ISO-8859-1 (Latin-1) Use “native2ascii” tool to convert Unicode files to escape sequences (U+??) native2ascii –encoding UTF-8 SumI18n_hi.properties native2ascii –encoding UTF-8 SumI18n_ta.properties
  • 45. Challenges PHP 5 (and earlier) does not understand characters and encodings The multi-byte extension (mbstring) in PHP works only for a few encodings (primarily CJK) PHP has very limited functions for formatting date, time, currencies, etc. PHP doesn’t provide linguistic sorting!
  • 46. The Good News – Intl extension Open source – http://pecl.php.net/intl Designed for PHP 5.x, part of PHP 5.3 Configure using “—enable-intl” Leverages ICU and CLDR Available as OO and procedural APIs Collator::sort() vs. collator_sort() Yahoo! is a key contributor
  • 47. The PHP Intl Library Intl Collator IDN NumberFormatter Grapheme Locale ResourceBundle Normalizer IntlDateFormatter MessageFormatter
  • 50. Resource Bundles Externalize strings in your application Similar to how desktop applications are built One binary and additional language packs Similar to Windows™ resource files and Unix® message files Structure is different, see ICU resource bundles Key/value pairs Key is used by the application at run time to display the value
  • 51. Additional Things Change the “default_charset” in php.ini to “utf-8” While the “mbstring” works good enough for Indic languages, use the more precise “grapheme_*” functions from the Intl library “echo” is encoding agnostic
  • 52. Why Intl is better than mbstring?
  • 53. Why Intl is better than mbstring? (contd.)
  • 54. Resources http://www.w3.org/International/ http://unicode.org/ http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp http://pecl.php.net/intl http://php.net/manual/en/refs.international.php

Editor's Notes

  1. Typically “enabling” might involve designing and developing a product that does not have any country/region specific business logic. Additionally it should externalise all country/region specific logic so that they can be customised for a country/region. For example displaying the date as “dd/mm/yyyy” by default is bad, instead it should be displayed as per the user’s locale.
  2. Localisation involves not only translation, but additional customisation including numbers, dates, times, currency, sorting, icons, colours, etc.
  3. First and Last names entirely depend upon region and culture. Instead, “Given name” and “Surname” should be used.
  4. Do not validate names and e-mails on the client-side as JavaScript does a bad job when it comes to I18n.
  5. Most of the Indic fonts are cursive and needs a minimum font size that is different from the minimum size used for English to be clearly, legibly visible.
  6. PHP doesn’t understand Unicode by default!
  7. Every character is a 16-bit code unit, each of them make a character. This is not true for all languages though like Japanese, but fortunately characters in all Indian languages is contained within a 16-bit code unit.Without the “virama”, the program would print “namasakara” which is incorrect. The “virama” is needed for the rendering engine to display either the half consonant, or add the consonant at the appropriate position. Note that Unicode doesn’t care about how glyphs are rendered, it is the job of the software to do this.
  8. The question marks (as explained before) denotes incorrect character conversion.The default code page for Windows™ command prompt is the original IBM PC code page (437.) The “chcp” program can be used to display/switch the code page. Windows™ also defines several other code pages, of which the popular ones are 1252 (Western European) and 65001 (UTF-8.)
  9. The Emacs shell (eshell) is a wonderful terminal emulation program that runs within the Emacs editing environment. It is very useful because it supports Unicode.In the second case, we force Java™ to assume that the encoding is UTF-8 and hence it (outputs the correct bytes) resulting in correct rendering of the Devanagari “Namaskar”. Even though it works this is a non-portable and bad way of doing things.
  10. On GNU/Linux, Java™ typically uses UTF-8 as the default charset (if the locale is set as UTF-8.)
  11. If the default charset is overridden, basically providing an incorrect one the results vary from incorrect conversion (???) to Mojibake (garbled characters) depending upon the output charset.
  12. The “tofu” characters mean that the font isn’t available. Yes, Windows™ doesn’t have a console font to display Devanagari.
  13. Collator provides locale-dependent collation and sortingFormatter modules provide locale-dependent formatting of numbers, dates, currencies, messages, etc.Normalizer provides methods for normalising and checking text in normalised formLocale provides access to locale-dependent resourcesGrapheme provides linguistically correct way of parsing strings, breaking a string into tokens, etc.IDN provides Internationalized Domain Name supportResourceBundle provides methods for customising messages depending on the locale
  14. The “Rs.” isn’t hard-coded in our program which makes it easy when Unicode starts supporting the new Rupee symbol. There is no need to change code, the program will start displaying the new symbol whenever the new symbol is supported. The actual work involves installing the new Intl library that is compiled against the newer version of ICU libraries and installing the new fonts that has the glyph for the corresponding code point.