2. Feedback
‣ Hashtag: #IMUG408
‣ @Reply me: @mzsanford
‣ Email me: matt@twitter.com
‣ Or, talk to me afterward.
3. Feedback
‣ Hashtag: #IMUG408
‣ @Reply me: @mzsanford
‣ Email me: matt@twitter.com
‣ Or, talk to me afterward.
"It's real, human interaction. It ain't gonna hurt you. "
- @mchammer
4. Agenda
‣ Twitter’s Non-US Popularity
‣ Growth & Localization
‣ Case Studies: Chile & Japan
‣ Community Translation
‣ The Good, The Bad & The Ugly
‣ Technical Hurdles
10. Twitter Signups in Chile
February 21st February 24th February 27th March 2nd
11. Twitter Signups in Chile
February 21st February 24th February 27th March 2nd
URGENTE en Constitución apareció IVAN LARA DE 8 Urgent. In Constitucion an eight-year old boy named Ivan
AÑOS QUE ESTÁ ABANDONADO en esa ciudad...busca Lara showed up alone. He's looking for his family
parientes en todo Chile favor copiar y pegar
10:50 AM Mar 2nd via web 10:50 AM Mar 2nd via web
18. Mobile in Japan: Galapagos Phones
‣ We have a special mobile web site
‣ Emoji support
‣ No cookies
‣ Image conversion
‣ Designed with Japanese expectations in mind
‣ In addition to Android & iPhone clients, we’re
working with carriers on integrated clients
19. Mobile in Japan: Galapagos Phones
‣ We have a special mobile web site
‣ Emoji support
‣ No cookies
‣ Image conversion
‣ Designed with Japanese expectations in mind
‣ In addition to Android & iPhone clients, we’re
working with carriers on integrated clients
20. Mobile in Japan: Galapagos Phones
‣ We have a special mobile web site
‣ Emoji support
‣ No cookies
‣ Image conversion
‣ Designed with Japanese expectations in mind
‣ In addition to Android & iPhone clients, we’re
working with carriers on integrated clients
23. Don’t Panic
“If you know the enemy and know yourself, you need not
fear the result of a hundred battles. If you know yourself
but not the enemy, for every victory gained you will also
suffer a defeat. If you know neither the enemy nor yourself,
you will succumb in every battle.”
— Sun Tsu, The Art of War
24. Why Community Translation?
‣ We didn’t have a budget
‣ We were/are a small, Open Source based business
‣ We had a large number of willing volunteers
‣ We’re committed to user involvement
‣ We have a very specific tone and vocabulary
‣ We had already tried direct translation and it didn’t mesh well
with our release cycle
‣ Twitter.com is deployed several times a day
29. Community Translation Stats
Translators: 2,600
Strings: 3,7000
Translations: 480,000
Average /Translator: 184
30. What Works Well?
‣ In-line translation (added context)
‣ Multi-level voting
‣ Discussion groups for user input
‣ French “follow” for example:
‣ Mouton – Sheep
‣ Suiveur – Stalkers
‣ Adepte – Followers
31. What Works Less Well?
‣ Turn around time
‣ Long, difficult strings are often skipped
‣ Inconsistent wording choices
‣ Sensitive content, such as email notices
‣ Pre-launch project disclosure
‣ Management of the groups takes some resources
34. Character Counting
“If you base a product on a character count, you better get it right”
– @mzsanford
Don’t count bytes
UTF-8: 0xE5 0x91 0xB3 (3 bytes)
UTF-16: 0x54 0x73 (2 bytes)
Human: 1 character
U+5473
35. Character Counting
“If you base a product on a character count, you better get it right”
– @mzsanford
Don’t count bytes Don’t even count Unicode code points
UTF-8: 0xE5 0x91 0xB3 (3 bytes)
U+5473
UTF-16: 0x54 0x73 (2 bytes)
Human: 1 character
e +
U+0065 U+0301
= é
{U+0065, U+0301}
OR
é
U+00E9
36. Character Counting
“If you base a product on a character count, you better get it right”
– @mzsanford
Don’t count bytes Don’t even count Unicode code points
UTF-8: 0xE5 0x91 0xB3 (3 bytes)
U+5473
UTF-16: 0x54 0x73 (2 bytes)
Human: 1 character
e +
U+0065 U+0301
= é
{U+0065, U+0301}
OR
é
U+00E9
We try to count the shortest representation*
* Unicode NFC form. See: http://unicode.org/reports/tr15/
37. Tweet Processing (part 1)
‣ Auto linking
‣ Japanese, for example, has no spaces.
‣ We’ve worked out a solution that balances how people use
Twitter with complete correctness
‣ We’ve Open Source our solution
‣ Language identification
‣ Traditional methods rely on more text
‣ Tweets also have a vocabulary of their own (tw*)
38. Tweet Processing (part 2)
‣ Searching Tweets
‣ Per-language tokenizing is difficult given the language identification
challenges
‣ Average Tweet length varies noticeably by language
‣ Trends
‣ Finding entities in Tweets requires either NLP (which is highly
language dependent) or pure statistical analysis (which can
produce poor quality trends)
‣ All of this is harder given the very-short nature of Tweets
39. Other Technical Lessons
‣ Ruby 1.8 Unicode support is lacking
‣ MySQL before v6.0 doesn’t allow all unicode characters
‣ And 6.0 died in Alpha
‣ Memcached keys only support a subset of characters
‣ You can either validate or encode
‣ Unicode security is a real thing
‣ Directional change spoofing attack
42. Our Translation Back-End
‣ Based on the ruby FastGettext library
‣ Custom back-end
‣ Re-loaded at process start-up (~2 hours)
‣ Data is stored in memcached
‣ Loaded into memcached from our database
‣ No engineer needed to deploy
‣ Completely self-managed
43. Twitter Text Libraries
‣ Provides extraction and auto-linking
‣ @user, @user/list, #hashtag, URLs
‣ Open Source*
‣ Available in Ruby and Java from Twitter
‣ Conformance Testing Data
‣ Modeled after the Unicode conformance suite
‣ YAML description of test cases for any language
‣ Assurance that you meet the same standards
‣ Many non-English test cases