02 c a306-phillips_langtags


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

02 c a306-phillips_langtags

  1. 1. Language Tagsand Locale Identifiers A Status Report 1
  2. 2. Presenter and Agenda Addison Phillips Internationalization Architect, Yahoo! Co-Editor, Language Tag Registry Update (LTRU) Working Group (RFC 3066bis, draft-matching) Language tags Locale identifiersAddison Phillips is the co-editor to the recent Language Tag registry RFC and itsassociated matching draft. This presentation details the history of language tagsand locale identifiers on the Internet, with a focus on the recent changes andupdates to RFC 3066 and efforts to create standardized locales and localeidentifiers for the Internet. 2
  3. 3. Languages? Locales? What’s a language tag? What the #@&%$ is a locale? Why do identifiers matter?If the Internet is anything, it is a means of communication. While there are manyforms of communication, language and textual information in particular loom largein computer systems.The identification of human “natural language”, as a result, is important, sinceusers expect their computer systems to interact with textual data in useful ways(be it searching, relaying, checking, formatting, or otherwise processing it).Alas, defining what a language is and what constitutes the difference betweenvarious forms of language is a complex problem. And, for computer systems,there is another kind of beast: the “locale”, which is even more difficult to grasp.What are these things? How do we identify them? Why do language and localeidentifiers matter? 3
  4. 4. Language Tags Enable presentation, selection, and negotiation of content Defined by BCP 47 – Widely used! XML, HTML, RSS, MIME, SOAP, SMTP, LDAP, CSS, XSL, CCXML, Java, C#, ASP, perl………. – Well understood (?)Natural language and especially written (that is, textual) information are a key and fundamentalpart of most computer systems. When computer systems were mostly isolated and notinterconnected, they mostly dealt with a single language at a time and could be tuned to deal withthe particular idiosyncrasies of that language. But the Internet (and other networking technologies)have changed that. Now textual data may be stored, processed, or viewed in many differentcontexts and many different languages simultaneously. And increasingly the boundaries between“computer” and the world at large is becoming blurred: your “computer” today might equally beyour TV, your telephone, your game player, your music player, your PDA, or your automobile. Thedigital content delivered to your “computer” is more important than the form factor the computeritself takes. As text, speech, and other content associated with language become pervasive andnetworked together, the selection, identification, and correct processing of the language becomecritical.Most people seem to believe that they have a relatively good grasp of languages and, thus, oflanguage identification. If you ask your mother-in-law what language the folks in Germany orFrance speak, for example, she probably will have a ready answer. But the more one delves intolanguages and language identification, the more complex the problem seems to become.The standard for language identification on the Internet is something called “BCP 47”. It is widelyused: the list above is a small fraction of the formats and technologies that implement it. What,never heard of “BCP 47”? BCP 47 is the official designation for the language tagging specificationof the IETF. BCP stands for “best current practice”. The most recent document to be BCP 47 is (or,by the time you read this, “was”) RFC 3066, which was preceded by RFC 1766. You’re probablymore familiar with the RFC numbers than the BCP number. 4
  5. 5. Locale Identifiers Different ideas: – Accept-Locale vs. Accept-Language – URIs/URNs, etc. – CLDR/LDML And Requirements: – Operating environments and harmonization – App Servers – Web Services New Solution? Cost of Adoption: – UTF-8 to the browser: 8 long yearsLocale identifiers, by contrast, are somewhat more difficult to grasp. Your mother-in-law (unless she’s a software engineer) probably has no idea what a locale is.One definition of a locale is: “a data structure or concept used by programmers to identify a particular  collection of cultural, regional, or linguistic preferences.”Locales are tied to specific programming languages or operating environments.What they do and how they are identified are unique and usually proprietary.There is a relationship of sorts between language and locale: most localeidentifiers include a language identifier. So if locale identifiers need to beexchanged on the Internet, as in Web services or between different applicationservers, how would these identifiers be defined?There are different ideas for how this might happen. One question is cost ofadoption: new headers, identifiers, or data structures might take a long time toreach “critical mass” and become useful, while adaptation or cooption of existingstructures might introduce problems for existing applications. 5
  6. 6. In the Beginning Received Wisdom from the Dark Ages Locales: – japanese, french, german, C – ENU, FRA, JPN – ja_JP.PCK – AMERICAN_AMERICA.WE8ISO8859P1 Languages… … looked a lot like locales (and vice versa)In the beginning, there was very little difference between language and locale in computersystems. Locale identifiers (some historical examples are shown above) usually included somekind of language identification.When the Internet became accessible to mere mortals in the early 1990’s, language identificationbecame an immediate concern. The Internet made content easy to exchange across boundariesand borders in ways that closed networks like CompuServe never could master. Identifyinglanguages was necessary for applications such as email and http, so Harald Alvestrand worked tocreate the first version of BCP 47, which was known as RFC 1766 to address the problem.These language tags became widely adopted, as we’ve noted. Locale identifiers were not createdfor the Internet, though, because of a lack of distributed applications.“Now, hold on!” you might say. “I’ve used distributed applications for years now: I’ve got client-server and I’ve bought books from Amazon or stocks from my broker or airline tickets on-line.What do you mean ‘there’s a lack of distributed applications’?!?”It is true that there are client-server architectures and Web applications are now quitecommonplace. However, these are not truly distributed applications. In a Web application, forexample, there is a host where all the logic is stored. This host and its associated programminglanguage or operating environment completely encapsulates the overall locale model. Client-server architectures are similar: the client and server each have specific technology choicesassociated with them and the business logic lives in one or the other (and usually in the server).Truly distributed applications are the province of integration (EAI, B2B), Web services, and theidea of Service Oriented Architectures (SOA). You only need a shared concept of locale whenyour logic is being hosted in discrete chunks on multiple systems and when you cannot count onthe systems using the same technology!Web apps are usually hosted in a single container or are written by people who have chosen aparticular technology. The locale model associated with that technology becomes the localemodel of the Web application. The whole point of Web services, by contrast, is to hide thistechnology decision. 6
  7. 7. Locales and Language Tags meet Conversations in Prague… – Language tags are being locale identifiers anyway… – Not going to need a big new thing… – Just a few things to fix… … we can do this really fastIn 2002, Mark Davis and I attended the Internationalization and UnicodeConference in Prague (so you can see that it pays to attend these events!),where I had a paper about locale identifiers. The basic problem was thatlanguage tags were widely distributed, and, since they looked an awful lot likePOSIX locale identifiers, most Web application platforms were actually usingthem as locale identifiers already by mapping language tags to their localequivalent. Mark was working on the CLDR project and was concerned aboutproblems involving script identification (especially for compatibility withMicrosoft’s .NET Culture identifiers). It seemed that a few small fixes to BCP 47(to allow some script subtags) and some documentation (“how to get a locale outof a language tag”) might solve several problems all at once. 7
  8. 8. BCP 47 Basic Structure Alphanumeric (ASCII only) subtags Up to eight characters long Separated by hyphens Case not important (i.e. zh = ZH = zH = Zh) 1*8alphanum * [ “-” 1*8 alphanum ]The basic structure of language tags has been remarkably stable. Language tagsare ASCII strings consisting of subtags separated by hyphens (and notunderscores). The subtags may consist of either (ASCII) letters or digits.There exist suggested capitalization rules for some of the underlying standardsused by language tags, but these do not apply to language tags and have nomeaning in a language tag context. Language tags are case insensitive.At the bottom of the slide is the original “ABNF” which describes the language taggrammar. 8
  9. 9. RFC 1766 zh-TW ISO 639-1 (alpha2) ISO 3166 (alpha2) i-klingon Registered valueRFC 1766 defined language tags in two distinct ways.All language tags took the form of a sequence of subtags composed of the ASCIIletters and digits separated by the hyphen character. The subtags could be, atmost, eight characters long. RFC 1766 said that:•If the first subtag consisted of two letters, it was a language code from the ISO639-1 standard.•If there is a second subtag (additional subtags are optional) and it consisted oftwo letters, it was a region code from the ISO 3166 standard.Otherwise, the interpretation of the tag (and its subtags) was defined by a registrymaintained by IANA. If users needed a specific language tag, they could write toa mailing list (ietf-languages@iana.org) and request a registration be created.Here is one such tag, for the Klingon language. 9
  10. 10. RFC 3066 sco-GBISO 639-2 (alpha 3 codes) Bu tu se … eng-GB X alpha 2 codes when they existRFC 3066 expanded on RFC 1766, making a few minor additions and cleaningup a few problems that arose.The main change was the addition of ISO 639-2 codes for languages. The ISO639-1 codes are two-letters long and there are, necessarily, a limited number ofthese (about 650 total, given that some letters are reserved). Since there are atleast several thousand languages that exist in modern times, this isn’t sufficientto encode the world’s languages. ISO 639-2 assigns three-letter codes, whichallows for many more potential codes. This allows all of the languages to berepresented by one code or another.RFC 3066 also mandated that if an ISO 639-1 code exists for a language, thenthat code must be used (and not the ISO 639-2 code). This prevents languagesfrom being encoded using different tags. Thus the tag “eng-UK” is not legal, eventhough “eng” is a valid ISO 639-2 code: tags must use the “en” code for English.The IANA language tag registry remained the same as during the RFC 1766 era:a collection of isolated registrations.(‘sco’ is the code for ‘Scots’) 10
  11. 11. Problems Script Variation: – zh-Hant/zh-Hans – (sr-Cyrl/sr-Latn, az-Arab/az-Latn/az-Cyrl, etc.) Obsolence of registrations: – art-lojban (now jbo), i-klingon (now tlh) Instability in underlying standards: – sr-CS (CS used to be Czechoslovakia…A variety of problems were associated with language tags, despite their success. The one Markand I were primarily interested in was the problem of script variation. Most languages arecustomarily written in a single script. They may be transcribed in another script, but most nativespeakers and most content in that language use a single script.A few languages are written equally—or at least “commonly”—in more than one script. Some ofthe languages are undergoing transitions (Cyrillic script was imposed on several languages duringthe Soviet era, for example), while others are just naturally written in more than one script. Forexample, Serbian can be written in either Cyrillic or Latin script. Both traditions are historical to thelanguage, not artificially imposed.The most notable example of script variation is in Chinese, where the traditional form of the scriptis used in some Chinese speaking regions (Taiwan, Hong Kong) while the simplified form of thescript is used in others (the PRC, Singapore). These variations do not follow spoken variation inthe language (Hong Kong, for example, speaks Cantonese while Taiwan speaks Mandarin)…which leads to vocabulary and other variations with the writing systems in question. Andidentifying “Traditional Chinese” using a region has other cultural sensitivity problems…Another problem was the relative ease of registration for language tags compared to the action ofthe various ISO maintenance and registration bodies. Many of the registered tags were laterdeprecated due to standards action.A last problem I’ll mention here was instability in ISO 3166 (the region codes). Codes in ISO 3166are changing all the time, which is not a surprise, given that countries are changing name,boundaries, and organization with some regularity. Alas, ISO 3166 doesn’t just remove old codes:they sometimes give them to a new country or region. So the language code today for “Serbian forSerbia and Montenegro” would have been “Serbian for Czechoslovakia” a couple decades ago. 11
  12. 12. And More Problems Lack of scripts Little support for registered values in software Reassignment of values by ISO 3166 Lack of consistent tag formation (Chinese dialects?) Standards not readily available, bad references Bad implementation assumptions – 1*8 alphanum *[ “-” 1*8 alphanum] – 2*3 ALPHA [ “-” 2ALPHA ] Many registrations to cover small variations – 8 German registrations to cover two variationsThere were a few other problems, which I’ve listed here… 12
  13. 13. LTRU and “draft-registry” Defines a generative syntax – machine readable – future proof, extensible Defines a single source – Stable subtags, no conflicts – Machine readable Defines when to use subtags – (sometimes)So Mark and I started writing Internet-Drafts. Eventually, a Working Group wasformed at the IETF called the Language Tag Registry Update or LTRU workinggroup.Out of this working group comes a new RFC, which is the new BCP 47. As I writethis the RFC has not yet been assigned a number, so it is called RFC 3066bisinformally. It changes language tags in a number of interesting ways, whilemaintaining full compatibility with all existing tags. 13
  14. 14. 14 sl-Latn-IT-rozaj-x-mine Private Use and ExtensionRFC 3066bis and LTRU Here is an illustration of a new-style language tag. Registered variants (any number) ISO 3166 (alpha2) or UN M49 ISO 15924 script codes (alpha 4) ISO 639-1/2 (alpha2/3)
  15. 15. More Examples es-419 (Spanish for Americas) en-US (English for USA) de-CH-1996 (Old tags are all valid) sl-rozaj-nedis (Multiple variants) zh-t-wadegile (Extensions)Here are some more examples of language tags showing some of the interestingvariations.es-419 makes use of the UN M.49 region codes to describe a language for alarger area than a country.de-CH-1996 was registered in the old IANA Language Tag Registry. It is still avalid tag.sl-rozaj-nedis is probably not a good tag choice, but illustrates that you can havemore than a single variant in a well-formed tag. In this case, both –rozaj and –nedis are dialects of Slovenian (sl), but –nedis doesn’t include sl-rozaj in itsregistered list of prefixes, so this tag is probably meaningless.zh-t-wadegile is a hypothetical tag: if there were an extension for transliterationsand it if it were assigned the letter ‘t’, than one valid subtag might be ‘wadegile’.** Several well-informed people have cast doubt on the idea of a transliteration extension, not to mention the“wadegile” example shown. 15
  16. 16. Benefits Subtag registry in one place: one source. Subtags identified by length/content Extensible Compatible with RFC 3066 tags Stable: subtags are foreverThere are several benefits to switching over to RFC 3066bis.For the first time there is a single, authoritative source for subtags. It containsdate versioning information, as well as information on the formation of useful tags.Instead of having to hunt through various versions of ISO 639, ISO 3166, ISO15924, UN M.49 and the IANA registry, there is one source.It is machine readable and the entries are dated. There is even a mechanism forcanonicalizing tags as they evolve.Inside a language tag, the subtags can be identified by length and content.Parsers do not have to have a copy of the registry to extract most of theinformation in a tag.There are several extension mechanisms. In particular, private use subtags canbe used in otherwise public tags.The tags are all backwards compatible with RFC 3066. Any new tag would havebeen valid to register under pervious versions of BCP 47. And all of the old tagsare forwards compatible (although a few are only compatible via fiat).Finally: tags and subtags are stable. Forever. 16
  17. 17. Problems Matching – Does “en-US” match “en-Latn-US”? Tag Choices – Users have more to choose from. Implementations – More to do, more to think about – (easier to parse, process, support the good stuff)The creation of the new format does create a few problems for users andimplementers, though.In particular, there are now more choices for how to form the generative languagetags.Matching of tags is a particular issue we’ll cover in more depth in a second.Users have more choices available, so implementations and guidelines are goingto be necessary to help people decide what’s best for them.Software implementations will have to do several things. Of course, they’ll haveto be modified to be either well-formed or validating processors. The good newshere is that the tag syntax is more deterministic and thus more amenable toparsing. And there is a data source that can easily be incorporated into code. Thebad news is that some badly-written implementations are going to break and thatdevelopers need to go back and evaluate their software. 17
  18. 18. Tag Matching Uses “Language Ranges” to select sets of content according to the language tag Four Schemes – Basic Filtering – Extended Filtering – Scored Filtering – LookupThe remaining work of LTRU relates to matching and selecting content basedlanguage tags. This has some impact on implementations, which need to guideusers in selection of the most appropriate tags.Tag matching depends on language ranges, which are identifiers that people useto specify what they are looking for or wish to match. Ranges select sets of tags.The current version of the Internet-Draft on matching describes four types ofmatching in two categories (filtering and lookup). 18
  19. 19. Filtering Ranges specify the least specific item – “en” matches “en”, “en-US”, “en-Brai”, “en-boont” Basic matching uses plain prefixes Extended matching can match “inside bits” – “en-*-US”Filtering is one type of matching. In filtering, the range specifies the least specificitem that constitutes a match. For example, if I specify a range of “de-CH”, allcontent in the matching set must include the language “de” (German) and theregion “CH” (Switzerland) in its tags.•“Basic filtering” is strict prefix matching. That is range “de-CH” matches tags “de-CH” and “de-CH-1996” but not “de-Brai-CH”, “de”, or “de-Latn-CH-1996”•In “extended filtering”, ranges can match missing elements. Thus “de-*-CH”would match all of the foregoing examples except “de”. 19
  20. 20. Scored Filtering Assigns a “weight” or “score” to each match Result set is ordered by match quality Postulated by John CowanScored filtering, which was first postulated by John Cowan, assigns a weight orscore to each potential range-to-tag match. Unlike the other two forms of filtering,scored filtering results in an ordered set of matching tags. This might be usefulwith search results, for example. 20
  21. 21. Lookup Range specifies the most specific tag in a match. – “en-US” matches “en” and “en-US” but not “en- US-boont” Mirrors the locale fallback mechanism and many language negotiation schemes.The other form of matching is called lookup. In lookup, the user specifies themost specific tag that represents a match. The lookup algorithm is for use incases where the user wants exactly one item returned for each request. Softwareresources are examples of language tag matching.(Demo of all matching types) 21
  22. 22. What Do I Do (Content Author)? Not much. – Existing tags are all still valid: tagging is mostly unchanged. – Resist temptation to (ab)use the private use subtags. Unless your language has script variations: – Tag content with the appropriate script subtag(s) Script subtags only apply to a small number of languages: “zh”, “sr”, “uz”, “az”, “mn”, and a very small number of others. 22
  23. 23. What Do I Do (Programmer)? Check code for compliance with 3066bis – Decide on well-formed or validating – Implement suppress-script – Change to using the registry – Bother infrastructure folks (Java, MS, Mozilla, etc) to implement the standard 23
  24. 24. What Do I Do (End-User)? Check and update your language ranges. Tag content wisely. 24
  25. 25. LTRU Milestone Dates (Done) RFC 3066bis – Registry went live in December 2005 Produce “Matching” RFC – Draft-04 available (Anticipated) Produce RFC 3066ter – This includes ISO 639-3 support, extended language subtags, and possibly ISO 639-6 25
  26. 26. Things to Read Registry Draft http://www.inter-locale.com http://www.ietf.org/internet-drafts/draft-ietf-ltru- registry-12.txt Matching Draft http://www.inter-locale.com LTRU Mailing List https://www1.ietf.org/mailman/listinfo/ltru 26
  27. 27. Things to Do (languages) Get involved in LTRU Get involved in W3C I18N Core WG! Write implementations Work on adoption of 3066bis: understand the impact Then get involved with Locale identifiers… 27
  28. 28. Back to Locales… IUC 20 Round Table Suzanne Topping’s Multilingual Article Tex Texin and the Locales list…So we’ve done a deep dive into Language Tags, whereas my point of entry waslocale identifiers. What’s going on with locales?Back at IUC20 (see, it pays to go to these events!) there was a round-table inwhich there was a discussion of problems confronting the Web. Language tagsand locale identifiers was one of the key topics discussed at this round table,apparently. I say “apparently”, because I left the conference before the roundtable. I read about the results on the W3C website and in an article by SuzanneTopping in Multilingual magazine. What I read there surprised and dismayed me.A few weeks later, I found that others in the community were working on localesor, rather, on rubbishing locales. Tex Texin started a list (now defunct) fordiscussing the problem.I got involved in thinking about the problem. 28
  29. 29. Locale Identifiers and Web ServicesFundamentally, my interest stemmed from the fact that I was working on Web services. Webservices are supposed to define a platform-agnostic way to expose logic or functionality in adistributed fashion. By using XML and HTTP, it was hoped that Web services could provide astandards oriented way to accomplish what CORBA or EAI vendors had been providing in aproprietary fashion previously.The problem I was grappling with was: “how do you internationalize a Web service?”Web services have all the same requirements any distributed system has: they have messages,data, text, and potentially cultural, regional, or other issues in them. In our programmingenvironments we have a ready solution for addressing these problems. These often hinge on thelocale. And the locale hinges on the user’s preferences in the matter.We have standard language identifiers. We don’t have standard locale anything. What to do?There were (and are) three schools of thought.On the one hand are the identifier folks (such as myself) who think that if we had a general locale-and/or-international-preferences-ID-mechanism, each vendor would implement it in a mannerconsistent with their existing language/platform and everything would work pretty well.On the other hand are the locale definition folks (such as Mark Davis) who think that if we allagreed to use the same locale data and locale data structures, then we could exchange identifiersand get the same results because everything is the same.On the left foot are the folks who think locales are just a bad idea and ought to be placed in thenearest landfill or entombed in concrete, Chernobyl-style. 29
  30. 30. W3C and Unicode W3C – Identifiers and cross-over with language tags – Web services – XML, HTML Unicode Consortium – LDML – CLDR – Standards for contentTwo standards organizations that are working in the area of locales and localeidentifiers are the W3C (Internationalization Core Working Group) and theUnicode Consortium (the Common Locale Data Repository project).The W3C is, of course, directly concerned with the use and implementation oflanguage tags in document formats and technologies. In addition, the need forlocale identification for Web services is a specific work item for the I18n workinggroup.The Unicode folks are working to build a standardized, comprehensive set oflocale data. 30
  31. 31. “Language Tags and Locale Identifiers” SPEC First Working Draft coming soon – URIs? – Simple tags?The W3C is currently working on a pair of specifications (W3C-ese for “standardstrack documents”). The first is called “Language Tags and Locale Identifiers”,which, as its names says, has to do with actually creating locale identifiers, aswell as providing implementation guidelines for RFC 3066bis and draft-matching.There are questions about how a locale identifier should be structured. Severalideas are currently floating around. For example, URIs might be used. Or 3066bistags might be “extended” in some way. 31
  32. 32. WS-I18N SPEC First Working Draft now available: – http://www.w3.org/TR/ws-i18nThe second spec that the W3C is working on is the WS-I18N spec, or “WebServices Internationalization”. This spec relies on the preceding document forlocale identifiers and describes how to use locales with Web servicestechnologies. Previous work by the W3C I18N WG in this area includerequirements and usage scenarios. 32
  33. 33. Ideas? 33