Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building Blocks for Accessing Multilingual Data: CLDR

442 views

Published on

Given June 27, 2015 in San Francisco

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Building Blocks for Accessing Multilingual Data: CLDR

  1. 1. Building Blocks for Accessing Multilingual Data: CLDR Steven R. Loomis, IBM GFTT 1
  2. 2. Access available handouts at ala.15.ala.org/sessions/handouts. About Me • Senior Software Engineer, 
 IBM Global Foundations Technology Team • IBM’s technical lead for the ICU4C/C++ software library, and primary voting representative to Unicode • Member of CLDR-TC, lead of ULI-TC 2
  3. 3. Access available handouts at ala.15.ala.org/sessions/handouts. Agenda • About CLDR • Focus Areas: • Language Identification • Transliteration • Searching and Sorting • Keyboards/Entry • Q&A 3
  4. 4. Access available handouts at ala.15.ala.org/sessions/handouts. What is CLDR? • Common Locale Data Repository • Language and region-specific data • Covers hundreds of language/region pairs • Open data (like Unicode itself), XML/JSON format • Community input, carefully curated 4
  5. 5. Access available handouts at ala.15.ala.org/sessions/handouts. Who is CLDR? • CLDR’s Technical Committee,
 the CLDR-TC, is part of the Unicode Consortium • Active participation by industry, academic, open source projects, national standards bodies, individuals 5
  6. 6. Access available handouts at ala.15.ala.org/sessions/handouts. Who uses CLDR? • Apple, Google, IBM, Microsoft… • Wikimedia foundation, jQuery, … • Java, node.js, php, … • Many users via ICU C/C++/Java library 6
  7. 7. Access available handouts at ala.15.ala.org/sessions/handouts. Locale Data • Data required for respecting the linguistic, cultural, geopolitical requirements of specific users • Example: "What day is it?" 7
  8. 8. Access available handouts at ala.15.ala.org/sessions/handouts. XML / JSON • XML: “es-US” • <month type="6">Junio</month> • JSON: “es-US” • { …
 "6": "Junio", …
 } 8
  9. 9. Access available handouts at ala.15.ala.org/sessions/handouts. CLDR Coverage • Coverage vs. number of languages 9
  10. 10. Access available handouts at ala.15.ala.org/sessions/handouts. CLDR site and SurveyTool (DEMO) • DEMO: • http://unicode.org/cldr • http://st.unicode.org/cldr-apps 10
  11. 11. Access available handouts at ala.15.ala.org/sessions/handouts. Locale Identifiers — BCP47 • Example: sr-Latn-RS • sr : ISO-639 "Serbian" • Latn : ISO-15924 "Latin Script"
 (vs Cyrillic) • RS : ISO 3166 / UN M.49 "Serbia" Latn Latnsr Latn LatnLatn Latn LatnRS 11
  12. 12. Access available handouts at ala.15.ala.org/sessions/handouts. Language/Territory/Script info Facts: • “The Cyrillic Script can be used to write Mongolian, Russian, Serbian…” • “Italian is spoken in Italy, San Marino, Switzerland…” 12
  13. 13. Access available handouts at ala.15.ala.org/sessions/handouts. Language Identification: Exemplars English (Latin) a b c d e f g h i j k l m 
 n o p q r s t u v w x y z Serbian (Latin) a b c ć č d đ dž e f g h i j k l lj m 
 n nj o p r s š t u v z ž Serbian (Cyrillic) а б в г д ђ е ж з и ј к л љ м н њ о п р 
 с т ћ у ф х ц ч џ ш Russian (Cyrillic) а б в г д е ё ж з и й к л м н о п р 
 с т у ф х ц ч ш щ ъ ы ь э ю я 13
  14. 14. Access available handouts at ala.15.ala.org/sessions/handouts. Transliteration • Existing data for rule sets. • ALA-LC format could be included. • Rule based engine. 14
  15. 15. Access available handouts at ala.15.ala.org/sessions/handouts. Transliteration Rule Example: Greek • <tRule>Σ ↔ S ;</tRule> • <tRule>τ ↔ t ;</tRule> • <tRule>Τ ↔ T ;</tRule> 15
  16. 16. Access available handouts at ala.15.ala.org/sessions/handouts. Demo: ICU transliterator demo • http://demo.icu-project.org/icu-bin/ translit 16
  17. 17. Access available handouts at ala.15.ala.org/sessions/handouts. Searching and Sorting • Unicode (UCA) provides base • CLDR “tailors”: 
 English vs. Danish vs. French • German: Mueller = Müller = MUELLER • Multiple stages and options: • blackbird vs black-bird vs BlackBird 17
  18. 18. Access available handouts at ala.15.ala.org/sessions/handouts. Demo: Collator • http://demo.icu-project.org/icu-bin/ collation.html 18
  19. 19. Access available handouts at ala.15.ala.org/sessions/handouts. Keyboards / Entry • Standardized identifier for keyboard tables • Allows comparison between keyboard providers 19
  20. 20. Access available handouts at ala.15.ala.org/sessions/handouts. Demo: MARC processor CLDR data Script: Armn (Armenian) Exemplar text matches hy “Armenian” Transliterate to latin: 
 “Hayastaneayc‘ ekeġec‘i” Regions where spoken: 
 Armenia, Russia, Georgia, Syria, Lebanon, Iran, Turkey, Cyprus 20 uses: CLDR, ICU4J, MARC4J
  21. 21. Access available handouts at ala.15.ala.org/sessions/handouts. Thank You / Q&A • srloomis@us.ibm.com • @srl295 ( Twitter, GitHub, Freenode ) • ibm.biz/srloomis 21

×