3. Agenda:
* Who am I
* Some business-y talk about
popularity outside of the US
What’s on Tap * Some quick notes on our translation
process
* Technical details on what’s hard
about non-English text
• Hashtag for Questions: #chirpintl
• Who’s this guy?
• Twitter’s popularity outside of the US
• Twitter’s Current & Future Translation Tools
• Non-English Tweet Handling
• Extraction and Auto-linking with Twitter Text
• Character Counting
• Invalid Tweet Text
4. Matt Sanford / @mzsanford
• Joined Twitter from Summize (Twitter Search)
• Worked on Search and Platform Short bio slide. Helpful
when it comes to Q&A
time.
• Search by language, search refresh bar
• Original OAuth implementer at Twitter
• Now tech lead of the International team
• Working on translation tools and non-US features
• Standardized character counting
• Author of Open Source Twitter Text libraries
5. Before I cover any technical details I
wanted to give a little information on
why people using the Twitter Platform
should be interested in International
The best way to do that is with numbers …
International Business.
Why Bother With International?
6. International: 60% & Growing
100%
75%
50%
25%
0%
June 2009 September 2009 December 2009 March 2010
Bam. 60% of all Twitter A big part of this is Japan, where
accounts are non-US we’re quite popular.
… We crossed the 50% Another big part was the new
mark September of translation efforts we launched.
2009 Spanish especially has been well
received.
7. Attendees vs. Users
Non-US
17%
US
International
US
83%
Chirp Attendees Twitter Accounts
8. A good example of Twitter
International is Chile.
Translating didn’t create an
explosion in Twitter usage. What
created an explosion was a need
for faster information.
Case Study: Chile
We’re There When People Need Us.
9. Twitter Signups in Chile
We’re There When People Need Us.
Fenruary 21st February 24th February 27th March 2nd
10. Twitter Signups in Chile
We’re There When People Need Us.
Fenruary 21st February 24th February 27th March 2nd
Urgent. In en Constitución apareció IVAN LARA DE
URGENTEConstitucion an eight-year old boy named 8
Ivan Lara showed ABANDONADO en esa ciudad...busca
AÑOS QUE ESTÁ up alone. He's looking for his family
parientes en todo Chile favor copiar y pegar
10:50 AM Mar 2nd via web
11. As opposed to the event inflection we
saw in Chile, in Japan we’ve seen long
term, sustained growth. We’ve also
been dedicating resources to some local-
specific features.
Case Study: Japan
Not Godzilla Big, But We’re Working On It.
12. Daily Tweeters in Japan
More Users Are Good. More Engaged Users Are Better.
July ‘09 October ‘09 January ‘10 April ‘10
15. Japanese Mobile Follow Me
Take Advantage of Existing Behavior
Photo: flickr.com/cogdog
Photo: flickr.com/netwalkerz
16. Japanese Mobile Follow Me
Take Advantage of Existing Behavior
Photo: flickr.com/cogdog
Photo: flickr.com/netwalkerz
17. Since translation is a big
part of what we’re
working on I want to cover
that a little bit.
Like all Twitter features
we rely on user need to
help define what we do. We
could have paid translators
but we felt like having
user’s participate in the
process was important.
Translation Tool
That led us to our current
crow-sourcing model …
Present & Future
22. On context: Point out the Post-slide note: We’ll be
arrow versus the list- rolling out changes very
view of other sites. Also: soon that focus on
Translation Tools
suggestions consensus over new
translations.
On deploy: unaided
today
• Volunteer crowd-sourcing
• Augmented by in-house people
• Built-in to twitter.com
• Provides context during translation
• Significantly higher quality
• Social game dynamics
• Database backed and heavily cached
• Edits are launched in ~2 hours
• Multiple levels of voting
• Helps prevent abuse
• Built-in proofing system
23. Translation Tools
tomorrow
• We’ve released some common terms on the API wiki
• So you can benefit from our translation work
• To help with consistency across clients
• We’re hoping to provide even more data in the future
• More languages. More strings. More ease.
• New translation UI changes coming soon
On releasing translations: We made
this a goal and covered it in the
translation agreement. Let me know
after this talk what would help you.
24. Up until now we’ve covered more general Twitter topics.
Now we’re going to talk about some of the more
complicated topics. Most international issues boil down
to things you think are simple turning out to be
deceptively hard to get right. Things like:
* Parsing t weets (and what’s so hard about it)
* Counting characters (and why it’s not that simple)
* Tweet text that we cannot accept (today)
Engineering Topics
Yeah, It’s Complicated.
25. Twitter Text Libraries
Rather than re-implement
these common features we
recommend using the Open
Source libraries we help
maintain.
• Provides extraction and auto-linking
If you’re not using Ruby or
• @user, @user/list, #hashtag, URLs Java: We provide a cross-
language test suite so you
can implement the same
• Open Source* rules in another language.
• Available in Ruby and Java from Twitter
• Conformance Testing Data
• Modeled after the Unicode conformance suite
• YAML description of test cases for any language
• Assurance that you meet the same standards
• Many non-English test cases
* http://twitter.com/about/opensource and on github
26. Twitter Text: Japanese Linking
Issues not encountered in English:
• Additional punctuation characters
Quick tour of the issues
• s in many languages ignores U+3000 (‘ ’) the Twitter Text libraries
handle in Japanese that
many previous libraries
didn’t handle.
• Full-width punctuation forms:
• @ versus
The lack of word spaces is a fundamental
• # versus issue when it comes to parsing Tweets.
• No spaces between words
27. Twitter Text: Japanese Linking
Issues not encountered in English:
• Additional punctuation characters
Quick tour of the issues
• s in many languages ignores U+3000 (‘ ’) the Twitter Text libraries
handle in Japanese that
many previous libraries
didn’t handle.
• Full-width punctuation forms:
• @ versus
The lack of word spaces is a fundamental
• # versus issue when it comes to parsing Tweets.
• No spaces between words
My homepage is http://twitter.com
http://twitter.com
30. Character counting
Unicode FTW!
Don’t count bytes
UTF-8: 0xE5 0x91 0xB3 (3 bytes)
UTF-16: 0x54 0x73 (2 bytes)
U+5473 Human: 1 character
Don’t even count Unicode code points
e +
U+0065 U+0301
=é {U+0065, U+0301}
OR
é
U+00E9
31. Character counting
Unicode FTW!
Don’t count bytes
UTF-8: 0xE5 0x91 0xB3 (3 bytes)
UTF-16: 0x54 0x73 (2 bytes)
U+5473 Human: 1 character
Don’t even count Unicode code points
e +
U+0065 U+0301
=é {U+0065, U+0301}
OR
é
U+00E9
We count the shortest representation*
* Unicode NFC form. See: http://unicode.org/reports/tr15/
32. Invalid Tweet Text
Slide on characters that Twitter does not
allow in a Tweet.
We purposely disallow those that have no
meaning in the context of a Tweet, or that
For a variety of reasons have security implications.
We also have a technical limitation in
MySQL that disallows certain characters.
It’s fixed in MySQL 6 but we’ll be moving to
Disallowed on Purpose Cassandra.
• Byte order Marks (not needed since we only accept UTF-8): U+FFFE & U+FEFF
• Reserved Unicode Special: U+FFFF
• Directional Change Characters (they allow complicated phishing attacks)*: U
+202A, U+202B, U+202C, U+202D & U+202E
Disallowed Due to Technical Limitations
• Characters outside of the Basic Multilingual Plane (BMP)
• That means all Unicode code points above U+FFFF
• Some Unicode 5 Kanji, Many ancient writing systems and things like musical symbols.
• We’re actively working on the move from MySQL to Cassandra, which will solve this.
* Unicode Security Considerations: http://www.unicode.org/reports/tr36
Title Slide.
• “let’s get to it”
• Who am I?
• Some business-y talk for the entrepreneurs in the group
• Notes on how we’ve gone about translation
• Engineering challenges for the coders in the group
• From Summmize
• Search, platform (might remember API Group)
• Original OAuth (sorry)
• International (translation, char counting, twitter-text)
• Before the technical stuff a little info on why you should be interested.
• the main reason: Users
• You might have seen the blog post on international growth.
• Passed 50% not long after the team formed
• In large part: Japan, translation
• Take advantage of these markets.
• Due to a slew of factors dev is mainly US (Twitter, EN, etc) but that does not mean it’s not looking outward
• Case in point, Chile
• Translating alone was not a big jump
• But we had set the stage. When the need for faster information arose we were there
• You can see the Earthquake effect clearly. Not sahown here is that signups have remained higher than pre-quake levels.
• What’s great isn’t the users, but the utility [click]
• This tweet for example.
• It’s not what someone had for breakfast, but solving a real communication problem.
• You can see the Earthquake effect clearly. Not sahown here is that signups have remained higher than pre-quake levels.
• What’s great isn’t the users, but the utility [click]
• This tweet for example.
• It’s not what someone had for breakfast, but solving a real communication problem.
• unlike the event-driven growth in Chile, Japan is a long-term stead growth
• We’ve been dedicating resources and working on more local features
• Rather than users I want to highlight daily unique ‘Tweeters’ (people who tweet)
- We’ve been working as much on adding people as increasing the utility to those people
- Done this via a new mobile site matching Japanese expectations, along with email/photoposting
• The red dot here is the ‘follow me’ feature on the mobile site. It’s not the sole cause of the uptake but it’s helped.
• I’d like to take a moment and explain that … [next]
• We’ve done a bunch of features on the JP mobile site (Yoshi, Sean), one of those is the ‘follow me’ flow.
• This is something that people can learn from: We took advantage of existing user behavior, even though it’s not a behavior in the US. We use the QR-code.
• QR-codes are big in Japan [click] … like this one on a sign. Goes to the store site
• People are so used to this they use it for context like [click] these real estate listings
• We used this existing behavior [click] to let people share their ‘contact info’ in the form of their twitter profile.
- Like ‘Bump’ on the iPhone but it works on all handsets in Japan and is immediately evident to users.
• We’ve done a bunch of features on the JP mobile site (Yoshi, Sean), one of those is the ‘follow me’ flow.
• This is something that people can learn from: We took advantage of existing user behavior, even though it’s not a behavior in the US. We use the QR-code.
• QR-codes are big in Japan [click] … like this one on a sign. Goes to the store site
• People are so used to this they use it for context like [click] these real estate listings
• We used this existing behavior [click] to let people share their ‘contact info’ in the form of their twitter profile.
- Like ‘Bump’ on the iPhone but it works on all handsets in Japan and is immediately evident to users.
• We’ve done a bunch of features on the JP mobile site (Yoshi, Sean), one of those is the ‘follow me’ flow.
• This is something that people can learn from: We took advantage of existing user behavior, even though it’s not a behavior in the US. We use the QR-code.
• QR-codes are big in Japan [click] … like this one on a sign. Goes to the store site
• People are so used to this they use it for context like [click] these real estate listings
• We used this existing behavior [click] to let people share their ‘contact info’ in the form of their twitter profile.
- Like ‘Bump’ on the iPhone but it works on all handsets in Japan and is immediately evident to users.
• We’ve done a bunch of features on the JP mobile site (Yoshi, Sean), one of those is the ‘follow me’ flow.
• This is something that people can learn from: We took advantage of existing user behavior, even though it’s not a behavior in the US. We use the QR-code.
• QR-codes are big in Japan [click] … like this one on a sign. Goes to the store site
• People are so used to this they use it for context like [click] these real estate listings
• We used this existing behavior [click] to let people share their ‘contact info’ in the form of their twitter profile.
- Like ‘Bump’ on the iPhone but it works on all handsets in Japan and is immediately evident to users.
• We’ve done a bunch of features on the JP mobile site (Yoshi, Sean), one of those is the ‘follow me’ flow.
• This is something that people can learn from: We took advantage of existing user behavior, even though it’s not a behavior in the US. We use the QR-code.
• QR-codes are big in Japan [click] … like this one on a sign. Goes to the store site
• People are so used to this they use it for context like [click] these real estate listings
• We used this existing behavior [click] to let people share their ‘contact info’ in the form of their twitter profile.
- Like ‘Bump’ on the iPhone but it works on all handsets in Japan and is immediately evident to users.
• Translation is a big part of what we do, and we do it a little different
• Like all features we turn to users for feedback. Could have paid, would have been cheaper, but would not have had community feedback
• Crowd-source, like open source for data. We had a great group … [next]
Of more than 2,600 translators.
- Soon to send out more invites. Planning to make it open to anyone later this year.
Twitter isn’t just 200 labels. Settings, about pages, features, etc.
[click] and more features every day.
Twitter isn’t just 200 labels. Settings, about pages, features, etc.
[click] and more features every day.
Those 2,600 translators have been so passionate it just blows me away. As of today they’ve contributed 480k translation
• We augmented with a wonderful group in-house (shoutout)
• Built the tool into twitter.com, provides context (see pointer) for quality, social game dynamic in jump-around prompt (see counter)
• DB backed with cache, no-deploy launching.
• Multi-level voting
We’ve released translations of the most common terms on the wiki so you can use them. We want to provide even more help, let us know how. New translation UI upcoming (not of too much interest, other than more data)
Engineering topics. Not complete but most i18n topics boil down to things that are easy 99% of the time and very hard 1% of the time. We’ll cover parsing tweets, counting characters, and invalid tweet text
Twitter-text libs.
- Extract, autolink
- Open Source Ruby and Java. Also following community ports to Python and PHP (though PHP could use some love). Look forward to more.
- Conformance data: Unicode, YAML, assurance, non-EN test cases
• A good example of the 1% issues we handle in the libs are Japanese Tweets …[next]
• Punctuation: s sucks in most languages. Full-width @ and # (if you want more info on this let me know afterward.)
• No spaces between words. Turns out, we assume a lot [click]
- http://S+ does not work.
• When your product is char based, it matters. Some issues are obvious, some not.
• [click] Don’t count bytes. You knew that.
• [click] Don’t count code points. That’s news to many people.
• We try to count what a person would call a char, where possible. So, we [click] use the shortest.
• When your product is char based, it matters. Some issues are obvious, some not.
• [click] Don’t count bytes. You knew that.
• [click] Don’t count code points. That’s news to many people.
• We try to count what a person would call a char, where possible. So, we [click] use the shortest.
• When your product is char based, it matters. Some issues are obvious, some not.
• [click] Don’t count bytes. You knew that.
• [click] Don’t count code points. That’s news to many people.
• We try to count what a person would call a char, where possible. So, we [click] use the shortest.
• When your product is char based, it matters. Some issues are obvious, some not.
• [click] Don’t count bytes. You knew that.
• [click] Don’t count code points. That’s news to many people.
• We try to count what a person would call a char, where possible. So, we [click] use the shortest.
• When your product is char based, it matters. Some issues are obvious, some not.
• [click] Don’t count bytes. You knew that.
• [click] Don’t count code points. That’s news to many people.
• We try to count what a person would call a char, where possible. So, we [click] use the shortest.
• When your product is char based, it matters. Some issues are obvious, some not.
• [click] Don’t count bytes. You knew that.
• [click] Don’t count code points. That’s news to many people.
• We try to count what a person would call a char, where possible. So, we [click] use the shortest.
• When your product is char based, it matters. Some issues are obvious, some not.
• [click] Don’t count bytes. You knew that.
• [click] Don’t count code points. That’s news to many people.
• We try to count what a person would call a char, where possible. So, we [click] use the shortest.
• When your product is char based, it matters. Some issues are obvious, some not.
• [click] Don’t count bytes. You knew that.
• [click] Don’t count code points. That’s news to many people.
• We try to count what a person would call a char, where possible. So, we [click] use the shortest.
• When your product is char based, it matters. Some issues are obvious, some not.
• [click] Don’t count bytes. You knew that.
• [click] Don’t count code points. That’s news to many people.
• We try to count what a person would call a char, where possible. So, we [click] use the shortest.
• When your product is char based, it matters. Some issues are obvious, some not.
• [click] Don’t count bytes. You knew that.
• [click] Don’t count code points. That’s news to many people.
• We try to count what a person would call a char, where possible. So, we [click] use the shortest.
• When your product is char based, it matters. Some issues are obvious, some not.
• [click] Don’t count bytes. You knew that.
• [click] Don’t count code points. That’s news to many people.
• We try to count what a person would call a char, where possible. So, we [click] use the shortest.
• When your product is char based, it matters. Some issues are obvious, some not.
• [click] Don’t count bytes. You knew that.
• [click] Don’t count code points. That’s news to many people.
• We try to count what a person would call a char, where possible. So, we [click] use the shortest.
• When your product is char based, it matters. Some issues are obvious, some not.
• [click] Don’t count bytes. You knew that.
• [click] Don’t count code points. That’s news to many people.
• We try to count what a person would call a char, where possible. So, we [click] use the shortest.
• When your product is char based, it matters. Some issues are obvious, some not.
• [click] Don’t count bytes. You knew that.
• [click] Don’t count code points. That’s news to many people.
• We try to count what a person would call a char, where possible. So, we [click] use the shortest.
• When your product is char based, it matters. Some issues are obvious, some not.
• [click] Don’t count bytes. You knew that.
• [click] Don’t count code points. That’s news to many people.
• We try to count what a person would call a char, where possible. So, we [click] use the shortest.
• When your product is char based, it matters. Some issues are obvious, some not.
• [click] Don’t count bytes. You knew that.
• [click] Don’t count code points. That’s news to many people.
• We try to count what a person would call a char, where possible. So, we [click] use the shortest.
• When your product is char based, it matters. Some issues are obvious, some not.
• [click] Don’t count bytes. You knew that.
• [click] Don’t count code points. That’s news to many people.
• We try to count what a person would call a char, where possible. So, we [click] use the shortest.
• Two types of things we don’t allow. On purpose, technical limitation.
• On Purpose: BOM (not utf-16), reserved, dir change (security, layout is not at home in a Tweet)
&#x2022; Limitations of MySQL (<v6) prevent some chars. (small set of Kanji, musical symbols, ancient scripts)