Internationalizing Twitter

1,005 views

Published on

Matt Sanford's talk about Twitter's i18n/L10n at the International Multi-lingual User's Group on 2010-05-20. (imug.org)

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,005
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide




















































  • Internationalizing Twitter

    1. 1. Internationalizing Twitter TM Matt Sanford @ IMUG // 2010-05-20
    2. 2. Feedback ‣ Hashtag: #IMUG408 ‣ @Reply me: @mzsanford ‣ Email me: matt@twitter.com ‣ Or, talk to me afterward.
    3. 3. Feedback ‣ Hashtag: #IMUG408 ‣ @Reply me: @mzsanford ‣ Email me: matt@twitter.com ‣ Or, talk to me afterward. "It's real, human interaction. It ain't gonna hurt you. " - @mchammer
    4. 4. Agenda ‣ Twitter’s Non-US Popularity ‣ Growth & Localization ‣ Case Studies: Chile & Japan ‣ Community Translation ‣ The Good, The Bad & The Ugly ‣ Technical Hurdles
    5. 5. Mr. Popular Non-US Growth for Twitter.
    6. 6. Mr. Popular (almost) Non-US Growth for Twitter.
    7. 7. International: 60+% of all accounts 100% 75% 50% 25% 0% June 2009 September 2009 December 2009 March 2010
    8. 8. International: 60+% of all accounts 100% 75% 50% 25% 0% June 2009 September 2009 December 2009 March 2010
    9. 9. Case Study: Chile We’re There When People Need Us.
    10. 10. Twitter Signups in Chile February 21st February 24th February 27th March 2nd
    11. 11. Twitter Signups in Chile February 21st February 24th February 27th March 2nd URGENTE en Constitución apareció IVAN LARA DE 8 Urgent. In Constitucion an eight-year old boy named Ivan AÑOS QUE ESTÁ ABANDONADO en esa ciudad...busca Lara showed up alone. He's looking for his family parientes en todo Chile favor copiar y pegar 10:50 AM Mar 2nd via web 10:50 AM Mar 2nd via web
    12. 12. Case Study: Japan Not Godzilla Big, But We’re Working On It
    13. 13. Daily Tweeters in Japan July ‘09 October ‘09 January ‘10 April ‘10
    14. 14. Japanese Mobile Follow Me Localizing is more than translation
    15. 15. Japanese Mobile Follow Me Localizing is more than translation
    16. 16. Japanese Mobile Follow Me Localizing is more than translation
    17. 17. Japanese Mobile Follow Me Localizing is more than translation
    18. 18. Mobile in Japan: Galapagos Phones ‣ We have a special mobile web site ‣ Emoji support ‣ No cookies ‣ Image conversion ‣ Designed with Japanese expectations in mind ‣ In addition to Android & iPhone clients, we’re working with carriers on integrated clients
    19. 19. Mobile in Japan: Galapagos Phones ‣ We have a special mobile web site ‣ Emoji support ‣ No cookies ‣ Image conversion ‣ Designed with Japanese expectations in mind ‣ In addition to Android & iPhone clients, we’re working with carriers on integrated clients
    20. 20. Mobile in Japan: Galapagos Phones ‣ We have a special mobile web site ‣ Emoji support ‣ No cookies ‣ Image conversion ‣ Designed with Japanese expectations in mind ‣ In addition to Android & iPhone clients, we’re working with carriers on integrated clients
    21. 21. Community Translation Why we chose it. How we do it. What works. What doesn’t.
    22. 22. Don’t Panic
    23. 23. Don’t Panic “If you know the enemy and know yourself, you need not fear the result of a hundred battles. If you know yourself but not the enemy, for every victory gained you will also suffer a defeat. If you know neither the enemy nor yourself, you will succumb in every battle.” — Sun Tsu, The Art of War
    24. 24. Why Community Translation? ‣ We didn’t have a budget ‣ We were/are a small, Open Source based business ‣ We had a large number of willing volunteers ‣ We’re committed to user involvement ‣ We have a very specific tone and vocabulary ‣ We had already tried direct translation and it didn’t mesh well with our release cycle ‣ Twitter.com is deployed several times a day
    25. 25. How We Do Community Translation
    26. 26. How We Do Community Translation
    27. 27. How We Do Community Translation
    28. 28. How We Do Community Translation
    29. 29. Community Translation Stats Translators: 2,600 Strings: 3,7000 Translations: 480,000 Average /Translator: 184
    30. 30. What Works Well? ‣ In-line translation (added context) ‣ Multi-level voting ‣ Discussion groups for user input ‣ French “follow” for example: ‣ Mouton – Sheep ‣ Suiveur – Stalkers ‣ Adepte – Followers
    31. 31. What Works Less Well? ‣ Turn around time ‣ Long, difficult strings are often skipped ‣ Inconsistent wording choices ‣ Sensitive content, such as email notices ‣ Pre-launch project disclosure ‣ Management of the groups takes some resources
    32. 32. Technical Hurdles Internationalizing is more than GetText* * but you already knew that.
    33. 33. Character Counting “If you base a product on a character count, you better get it right” – @mzsanford
    34. 34. Character Counting “If you base a product on a character count, you better get it right” – @mzsanford Don’t count bytes UTF-8: 0xE5 0x91 0xB3 (3 bytes) UTF-16: 0x54 0x73 (2 bytes) Human: 1 character U+5473
    35. 35. Character Counting “If you base a product on a character count, you better get it right” – @mzsanford Don’t count bytes Don’t even count Unicode code points UTF-8: 0xE5 0x91 0xB3 (3 bytes) U+5473 UTF-16: 0x54 0x73 (2 bytes) Human: 1 character e + U+0065 U+0301 = é {U+0065, U+0301} OR é U+00E9
    36. 36. Character Counting “If you base a product on a character count, you better get it right” – @mzsanford Don’t count bytes Don’t even count Unicode code points UTF-8: 0xE5 0x91 0xB3 (3 bytes) U+5473 UTF-16: 0x54 0x73 (2 bytes) Human: 1 character e + U+0065 U+0301 = é {U+0065, U+0301} OR é U+00E9 We try to count the shortest representation* * Unicode NFC form. See: http://unicode.org/reports/tr15/
    37. 37. Tweet Processing (part 1) ‣ Auto linking ‣ Japanese, for example, has no spaces. ‣ We’ve worked out a solution that balances how people use Twitter with complete correctness ‣ We’ve Open Source our solution ‣ Language identification ‣ Traditional methods rely on more text ‣ Tweets also have a vocabulary of their own (tw*)
    38. 38. Tweet Processing (part 2) ‣ Searching Tweets ‣ Per-language tokenizing is difficult given the language identification challenges ‣ Average Tweet length varies noticeably by language ‣ Trends ‣ Finding entities in Tweets requires either NLP (which is highly language dependent) or pure statistical analysis (which can produce poor quality trends) ‣ All of this is harder given the very-short nature of Tweets
    39. 39. Other Technical Lessons ‣ Ruby 1.8 Unicode support is lacking ‣ MySQL before v6.0 doesn’t allow all unicode characters ‣ And 6.0 died in Alpha ‣ Memcached keys only support a subset of characters ‣ You can either validate or encode ‣ Unicode security is a real thing ‣ Directional change spoofing attack
    40. 40. Questions/Answers TM
    41. 41. Appendix Slides There’s more data where that came from … TM
    42. 42. Our Translation Back-End ‣ Based on the ruby FastGettext library ‣ Custom back-end ‣ Re-loaded at process start-up (~2 hours) ‣ Data is stored in memcached ‣ Loaded into memcached from our database ‣ No engineer needed to deploy ‣ Completely self-managed
    43. 43. Twitter Text Libraries ‣ Provides extraction and auto-linking ‣ @user, @user/list, #hashtag, URLs ‣ Open Source* ‣ Available in Ruby and Java from Twitter ‣ Conformance Testing Data ‣ Modeled after the Unicode conformance suite ‣ YAML description of test cases for any language ‣ Assurance that you meet the same standards ‣ Many non-English test cases

    ×