Internationalizing Twitter

Internationalizing Twitter

TM

Matt Sanford @ IMUG // 2010-05-20

Feedback
‣ Hashtag: #IMUG408
‣ @Reply me: @mzsanford
‣ Email me: matt@twitter.com
‣ Or, talk to me afterward.

Feedback
‣ Hashtag: #IMUG408
‣ @Reply me: @mzsanford
‣ Email me: matt@twitter.com
‣ Or, talk to me afterward.

"It's real, human interaction. It ain't gonna hurt you. "
- @mchammer

Agenda
‣ Twitter’s Non-US Popularity
‣ Growth & Localization
‣ Case Studies: Chile & Japan
‣ Community Translation
‣ The Good, The Bad & The Ugly
‣ Technical Hurdles

Mr. Popular
Non-US Growth for Twitter.

Mr. Popular (almost)
Non-US Growth for Twitter.

International: 60+% of all accounts
100%

75%

50%

25%

0%
June 2009 September 2009 December 2009 March 2010

Case Study: Chile
We’re There When People Need Us.

Twitter Signups in Chile

February 21st February 24th February 27th March 2nd

Twitter Signups in Chile

February 21st February 24th February 27th March 2nd

URGENTE en Constitución apareció IVAN LARA DE 8 Urgent. In Constitucion an eight-year old boy named Ivan
AÑOS QUE ESTÁ ABANDONADO en esa ciudad...busca Lara showed up alone. He's looking for his family
parientes en todo Chile favor copiar y pegar
10:50 AM Mar 2nd via web 10:50 AM Mar 2nd via web

Case Study: Japan
Not Godzilla Big, But We’re Working On It

Daily Tweeters in Japan

July ‘09 October ‘09 January ‘10 April ‘10

Japanese Mobile Follow Me
Localizing is more than translation

Mobile in Japan: Galapagos Phones
‣ We have a special mobile web site
‣ Emoji support
‣ No cookies
‣ Image conversion
‣ Designed with Japanese expectations in mind
‣ In addition to Android & iPhone clients, we’re
working with carriers on integrated clients

Community Translation
Why we chose it. How we do it. What works. What doesn’t.

Don’t Panic
“If you know the enemy and know yourself, you need not
fear the result of a hundred battles. If you know yourself
but not the enemy, for every victory gained you will also
suffer a defeat. If you know neither the enemy nor yourself,
you will succumb in every battle.”
— Sun Tsu, The Art of War

Why Community Translation?
‣ We didn’t have a budget
‣ We were/are a small, Open Source based business
‣ We had a large number of willing volunteers
‣ We’re committed to user involvement
‣ We have a very specific tone and vocabulary
‣ We had already tried direct translation and it didn’t mesh well
with our release cycle
‣ Twitter.com is deployed several times a day

How We Do Community Translation

Community Translation Stats

Translators: 2,600
Strings: 3,7000
Translations: 480,000
Average /Translator: 184

What Works Well?
‣ In-line translation (added context)
‣ Multi-level voting
‣ Discussion groups for user input
‣ French “follow” for example:
‣ Mouton – Sheep
‣ Suiveur – Stalkers
‣ Adepte – Followers

What Works Less Well?
‣ Turn around time
‣ Long, difficult strings are often skipped
‣ Inconsistent wording choices
‣ Sensitive content, such as email notices
‣ Pre-launch project disclosure
‣ Management of the groups takes some resources

Technical Hurdles
Internationalizing is more than GetText*

* but you already knew that.

Character Counting
“If you base a product on a character count, you better get it right”
– @mzsanford

Character Counting
– @mzsanford

Don’t count bytes
UTF-8: 0xE5 0x91 0xB3 (3 bytes)
UTF-16: 0x54 0x73 (2 bytes)
Human: 1 character
U+5473

Character Counting
– @mzsanford

Don’t count bytes Don’t even count Unicode code points

U+5473
UTF-16: 0x54 0x73 (2 bytes)
Human: 1 character
e +
U+0065 U+0301
= é
{U+0065, U+0301}
OR
é
U+00E9

Character Counting
– @mzsanford

Don’t count bytes Don’t even count Unicode code points

U+5473
UTF-16: 0x54 0x73 (2 bytes)
Human: 1 character
e +
U+0065 U+0301
= é
{U+0065, U+0301}
OR
é
U+00E9

We try to count the shortest representation*

* Unicode NFC form. See: http://unicode.org/reports/tr15/

Tweet Processing (part 1)
‣ Auto linking
‣ Japanese, for example, has no spaces.
‣ We’ve worked out a solution that balances how people use
Twitter with complete correctness
‣ We’ve Open Source our solution
‣ Language identification
‣ Traditional methods rely on more text
‣ Tweets also have a vocabulary of their own (tw*)

Tweet Processing (part 2)
‣ Searching Tweets
‣ Per-language tokenizing is difficult given the language identification
challenges
‣ Average Tweet length varies noticeably by language
‣ Trends
‣ Finding entities in Tweets requires either NLP (which is highly
language dependent) or pure statistical analysis (which can
produce poor quality trends)
‣ All of this is harder given the very-short nature of Tweets

Other Technical Lessons
‣ Ruby 1.8 Unicode support is lacking
‣ MySQL before v6.0 doesn’t allow all unicode characters
‣ And 6.0 died in Alpha
‣ Memcached keys only support a subset of characters
‣ You can either validate or encode
‣ Unicode security is a real thing
‣ Directional change spoofing attack

Questions/Answers

TM

Appendix Slides
There’s more data where that came from …

TM

Our Translation Back-End
‣ Based on the ruby FastGettext library
‣ Custom back-end
‣ Re-loaded at process start-up (~2 hours)
‣ Data is stored in memcached
‣ Loaded into memcached from our database
‣ No engineer needed to deploy
‣ Completely self-managed

Twitter Text Libraries
‣ Provides extraction and auto-linking
‣ @user, @user/list, #hashtag, URLs
‣ Open Source*
‣ Available in Ruby and Java from Twitter
‣ Conformance Testing Data
‣ Modeled after the Unicode conformance suite
‣ YAML description of test cases for any language
‣ Assurance that you meet the same standards
‣ Many non-English test cases

Internationalizing Twitter

Recommended

Recommended

More Related Content

Similar to Internationalizing Twitter

Similar to Internationalizing Twitter (20)

Recently uploaded

Recently uploaded (20)

Internationalizing Twitter

Editor's Notes