• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Internationalizing Twitter
 

Internationalizing Twitter

on

  • 990 views

Matt Sanford's talk about Twitter's i18n/L10n at the International Multi-lingual User's Group on 2010-05-20. (imug.org)

Matt Sanford's talk about Twitter's i18n/L10n at the International Multi-lingual User's Group on 2010-05-20. (imug.org)

Statistics

Views

Total Views
990
Views on SlideShare
978
Embed Views
12

Actions

Likes
2
Downloads
0
Comments
0

2 Embeds 12

http://www.slideshare.net 10
http://coderwall.com 2

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />

Internationalizing Twitter Internationalizing Twitter Presentation Transcript

  • Internationalizing Twitter TM Matt Sanford @ IMUG // 2010-05-20
  • Feedback ‣ Hashtag: #IMUG408 ‣ @Reply me: @mzsanford ‣ Email me: matt@twitter.com ‣ Or, talk to me afterward.
  • Feedback ‣ Hashtag: #IMUG408 ‣ @Reply me: @mzsanford ‣ Email me: matt@twitter.com ‣ Or, talk to me afterward. "It's real, human interaction. It ain't gonna hurt you. " - @mchammer
  • Agenda ‣ Twitter’s Non-US Popularity ‣ Growth & Localization ‣ Case Studies: Chile & Japan ‣ Community Translation ‣ The Good, The Bad & The Ugly ‣ Technical Hurdles
  • Mr. Popular Non-US Growth for Twitter.
  • Mr. Popular (almost) Non-US Growth for Twitter.
  • International: 60+% of all accounts 100% 75% 50% 25% 0% June 2009 September 2009 December 2009 March 2010
  • International: 60+% of all accounts 100% 75% 50% 25% 0% June 2009 September 2009 December 2009 March 2010
  • Case Study: Chile We’re There When People Need Us.
  • Twitter Signups in Chile February 21st February 24th February 27th March 2nd
  • Twitter Signups in Chile February 21st February 24th February 27th March 2nd URGENTE en Constitución apareció IVAN LARA DE 8 Urgent. In Constitucion an eight-year old boy named Ivan AÑOS QUE ESTÁ ABANDONADO en esa ciudad...busca Lara showed up alone. He's looking for his family parientes en todo Chile favor copiar y pegar 10:50 AM Mar 2nd via web 10:50 AM Mar 2nd via web
  • Case Study: Japan Not Godzilla Big, But We’re Working On It
  • Daily Tweeters in Japan July ‘09 October ‘09 January ‘10 April ‘10
  • Japanese Mobile Follow Me Localizing is more than translation
  • Japanese Mobile Follow Me Localizing is more than translation
  • Japanese Mobile Follow Me Localizing is more than translation
  • Japanese Mobile Follow Me Localizing is more than translation
  • Mobile in Japan: Galapagos Phones ‣ We have a special mobile web site ‣ Emoji support ‣ No cookies ‣ Image conversion ‣ Designed with Japanese expectations in mind ‣ In addition to Android & iPhone clients, we’re working with carriers on integrated clients
  • Mobile in Japan: Galapagos Phones ‣ We have a special mobile web site ‣ Emoji support ‣ No cookies ‣ Image conversion ‣ Designed with Japanese expectations in mind ‣ In addition to Android & iPhone clients, we’re working with carriers on integrated clients
  • Mobile in Japan: Galapagos Phones ‣ We have a special mobile web site ‣ Emoji support ‣ No cookies ‣ Image conversion ‣ Designed with Japanese expectations in mind ‣ In addition to Android & iPhone clients, we’re working with carriers on integrated clients
  • Community Translation Why we chose it. How we do it. What works. What doesn’t.
  • Don’t Panic
  • Don’t Panic “If you know the enemy and know yourself, you need not fear the result of a hundred battles. If you know yourself but not the enemy, for every victory gained you will also suffer a defeat. If you know neither the enemy nor yourself, you will succumb in every battle.” — Sun Tsu, The Art of War
  • Why Community Translation? ‣ We didn’t have a budget ‣ We were/are a small, Open Source based business ‣ We had a large number of willing volunteers ‣ We’re committed to user involvement ‣ We have a very specific tone and vocabulary ‣ We had already tried direct translation and it didn’t mesh well with our release cycle ‣ Twitter.com is deployed several times a day
  • How We Do Community Translation
  • How We Do Community Translation
  • How We Do Community Translation
  • How We Do Community Translation
  • Community Translation Stats Translators: 2,600 Strings: 3,7000 Translations: 480,000 Average /Translator: 184
  • What Works Well? ‣ In-line translation (added context) ‣ Multi-level voting ‣ Discussion groups for user input ‣ French “follow” for example: ‣ Mouton – Sheep ‣ Suiveur – Stalkers ‣ Adepte – Followers
  • What Works Less Well? ‣ Turn around time ‣ Long, difficult strings are often skipped ‣ Inconsistent wording choices ‣ Sensitive content, such as email notices ‣ Pre-launch project disclosure ‣ Management of the groups takes some resources
  • Technical Hurdles Internationalizing is more than GetText* * but you already knew that.
  • Character Counting “If you base a product on a character count, you better get it right” – @mzsanford
  • Character Counting “If you base a product on a character count, you better get it right” – @mzsanford Don’t count bytes UTF-8: 0xE5 0x91 0xB3 (3 bytes) UTF-16: 0x54 0x73 (2 bytes) Human: 1 character U+5473
  • Character Counting “If you base a product on a character count, you better get it right” – @mzsanford Don’t count bytes Don’t even count Unicode code points UTF-8: 0xE5 0x91 0xB3 (3 bytes) U+5473 UTF-16: 0x54 0x73 (2 bytes) Human: 1 character e + U+0065 U+0301 = é {U+0065, U+0301} OR é U+00E9
  • Character Counting “If you base a product on a character count, you better get it right” – @mzsanford Don’t count bytes Don’t even count Unicode code points UTF-8: 0xE5 0x91 0xB3 (3 bytes) U+5473 UTF-16: 0x54 0x73 (2 bytes) Human: 1 character e + U+0065 U+0301 = é {U+0065, U+0301} OR é U+00E9 We try to count the shortest representation* * Unicode NFC form. See: http://unicode.org/reports/tr15/
  • Tweet Processing (part 1) ‣ Auto linking ‣ Japanese, for example, has no spaces. ‣ We’ve worked out a solution that balances how people use Twitter with complete correctness ‣ We’ve Open Source our solution ‣ Language identification ‣ Traditional methods rely on more text ‣ Tweets also have a vocabulary of their own (tw*)
  • Tweet Processing (part 2) ‣ Searching Tweets ‣ Per-language tokenizing is difficult given the language identification challenges ‣ Average Tweet length varies noticeably by language ‣ Trends ‣ Finding entities in Tweets requires either NLP (which is highly language dependent) or pure statistical analysis (which can produce poor quality trends) ‣ All of this is harder given the very-short nature of Tweets
  • Other Technical Lessons ‣ Ruby 1.8 Unicode support is lacking ‣ MySQL before v6.0 doesn’t allow all unicode characters ‣ And 6.0 died in Alpha ‣ Memcached keys only support a subset of characters ‣ You can either validate or encode ‣ Unicode security is a real thing ‣ Directional change spoofing attack
  • Questions/Answers TM
  • Appendix Slides There’s more data where that came from … TM
  • Our Translation Back-End ‣ Based on the ruby FastGettext library ‣ Custom back-end ‣ Re-loaded at process start-up (~2 hours) ‣ Data is stored in memcached ‣ Loaded into memcached from our database ‣ No engineer needed to deploy ‣ Completely self-managed
  • Twitter Text Libraries ‣ Provides extraction and auto-linking ‣ @user, @user/list, #hashtag, URLs ‣ Open Source* ‣ Available in Ruby and Java from Twitter ‣ Conformance Testing Data ‣ Modeled after the Unicode conformance suite ‣ YAML description of test cases for any language ‣ Assurance that you meet the same standards ‣ Many non-English test cases