SlideShare a Scribd company logo
1 of 43
Internationalizing Twitter

                                        TM


Matt Sanford @ IMUG // 2010-05-20
Feedback
 ‣   Hashtag: #IMUG408
 ‣   @Reply me: @mzsanford
 ‣   Email me: matt@twitter.com
 ‣   Or, talk to me afterward.
Feedback
 ‣   Hashtag: #IMUG408
 ‣   @Reply me: @mzsanford
 ‣   Email me: matt@twitter.com
 ‣   Or, talk to me afterward.



 "It's real, human interaction. It ain't gonna hurt you. "
                                             - @mchammer
Agenda
‣   Twitter’s Non-US Popularity
‣   Growth & Localization
    ‣   Case Studies: Chile & Japan
‣   Community Translation
    ‣   The Good, The Bad & The Ugly
‣   Technical Hurdles
Mr. Popular
Non-US Growth for Twitter.
Mr. Popular (almost)
Non-US Growth for Twitter.
International: 60+% of all accounts
100%




 75%




50%




25%




 0%
 June 2009        September 2009   December 2009   March 2010
International: 60+% of all accounts
100%




 75%




50%




25%




 0%
 June 2009        September 2009   December 2009   March 2010
Case Study: Chile
We’re There When People Need Us.
Twitter Signups in Chile




February 21st    February 24th   February 27th   March 2nd
Twitter Signups in Chile




February 21st                  February 24th                                    February 27th                          March 2nd




URGENTE en Constitución apareció IVAN LARA DE 8                           Urgent. In Constitucion an eight-year old boy named Ivan
AÑOS QUE ESTÁ ABANDONADO en esa ciudad...busca                            Lara showed up alone. He's looking for his family
parientes en todo Chile favor copiar y pegar
                                               10:50 AM Mar 2nd via web                                                   10:50 AM Mar 2nd via web
Case Study: Japan
Not Godzilla Big, But We’re Working On It
Daily Tweeters in Japan




July ‘09     October ‘09   January ‘10   April ‘10
Japanese Mobile Follow Me
           Localizing is more than translation
Japanese Mobile Follow Me
           Localizing is more than translation
Japanese Mobile Follow Me
           Localizing is more than translation
Japanese Mobile Follow Me
           Localizing is more than translation
Mobile in Japan: Galapagos Phones
‣   We have a special mobile web site
    ‣   Emoji support
    ‣   No cookies
    ‣   Image conversion
    ‣   Designed with Japanese expectations in mind
‣   In addition to Android & iPhone clients, we’re
    working with carriers on integrated clients
Mobile in Japan: Galapagos Phones
‣   We have a special mobile web site
    ‣   Emoji support
    ‣   No cookies
    ‣   Image conversion
    ‣   Designed with Japanese expectations in mind
‣   In addition to Android & iPhone clients, we’re
    working with carriers on integrated clients
Mobile in Japan: Galapagos Phones
‣   We have a special mobile web site
    ‣   Emoji support
    ‣   No cookies
    ‣   Image conversion
    ‣   Designed with Japanese expectations in mind
‣   In addition to Android & iPhone clients, we’re
    working with carriers on integrated clients
Community Translation
Why we chose it. How we do it. What works. What doesn’t.
Don’t Panic
Don’t Panic
“If you know the enemy and know yourself, you need not
fear the result of a hundred battles. If you know yourself
but not the enemy, for every victory gained you will also
suffer a defeat. If you know neither the enemy nor yourself,
you will succumb in every battle.”
                                        — Sun Tsu, The Art of War
Why Community Translation?
 ‣   We didn’t have a budget
     ‣   We were/are a small, Open Source based business
 ‣   We had a large number of willing volunteers
 ‣   We’re committed to user involvement
 ‣   We have a very specific tone and vocabulary
 ‣   We had already tried direct translation and it didn’t mesh well
     with our release cycle
     ‣   Twitter.com is deployed several times a day
How We Do Community Translation
How We Do Community Translation
How We Do Community Translation
How We Do Community Translation
Community Translation Stats

           Translators: 2,600
              Strings: 3,7000
          Translations: 480,000
   Average /Translator: 184
What Works Well?
 ‣   In-line translation (added context)
 ‣   Multi-level voting
 ‣   Discussion groups for user input
     ‣   French “follow” for example:
         ‣   Mouton – Sheep
         ‣   Suiveur – Stalkers
         ‣   Adepte – Followers
What Works Less Well?
 ‣   Turn around time
 ‣   Long, difficult strings are often skipped
 ‣   Inconsistent wording choices
 ‣   Sensitive content, such as email notices
 ‣   Pre-launch project disclosure
 ‣   Management of the groups takes some resources
Technical Hurdles
Internationalizing is more than GetText*




                                           * but you already knew that.
Character Counting
“If you base a product on a character count, you better get it right”
                                                             – @mzsanford
Character Counting
“If you base a product on a character count, you better get it right”
                                                             – @mzsanford


Don’t count bytes
           UTF-8: 0xE5 0x91 0xB3 (3 bytes)
          UTF-16: 0x54 0x73 (2 bytes)
           Human: 1 character
 U+5473
Character Counting
“If you base a product on a character count, you better get it right”
                                                                                   – @mzsanford


Don’t count bytes                            Don’t even count Unicode code points
           UTF-8: 0xE5 0x91 0xB3 (3 bytes)



 U+5473
          UTF-16: 0x54 0x73 (2 bytes)
           Human: 1 character
                                             e +
                                             U+0065   U+0301
                                                               = é
                                                                {U+0065, U+0301}
                                                                                      OR
                                                                                                  é
                                                                                              U+00E9
Character Counting
“If you base a product on a character count, you better get it right”
                                                                                    – @mzsanford


Don’t count bytes                            Don’t even count Unicode code points
           UTF-8: 0xE5 0x91 0xB3 (3 bytes)



 U+5473
          UTF-16: 0x54 0x73 (2 bytes)
           Human: 1 character
                                             e +
                                             U+0065   U+0301
                                                               = é
                                                                {U+0065, U+0301}
                                                                                            OR
                                                                                                              é
                                                                                                            U+00E9




We try to count the shortest representation*


                                                                                   * Unicode NFC form. See: http://unicode.org/reports/tr15/
Tweet Processing (part 1)
 ‣   Auto linking
     ‣   Japanese, for example, has no spaces.
     ‣   We’ve worked out a solution that balances how people use
         Twitter with complete correctness
         ‣   We’ve Open Source our solution
 ‣   Language identification
     ‣   Traditional methods rely on more text
     ‣   Tweets also have a vocabulary of their own (tw*)
Tweet Processing (part 2)
 ‣   Searching Tweets
     ‣   Per-language tokenizing is difficult given the language identification
         challenges
     ‣   Average Tweet length varies noticeably by language
 ‣   Trends
     ‣   Finding entities in Tweets requires either NLP (which is highly
         language dependent) or pure statistical analysis (which can
         produce poor quality trends)
     ‣   All of this is harder given the very-short nature of Tweets
Other Technical Lessons
 ‣   Ruby 1.8 Unicode support is lacking
 ‣   MySQL before v6.0 doesn’t allow all unicode characters
     ‣   And 6.0 died in Alpha
 ‣   Memcached keys only support a subset of characters
     ‣   You can either validate or encode
 ‣   Unicode security is a real thing
     ‣   Directional change spoofing attack
Questions/Answers

                    TM
Appendix Slides
There’s more data where that came from …




                                           TM
Our Translation Back-End
 ‣   Based on the ruby FastGettext library
 ‣   Custom back-end
     ‣   Re-loaded at process start-up (~2 hours)
     ‣   Data is stored in memcached
     ‣   Loaded into memcached from our database
         ‣   No engineer needed to deploy
         ‣   Completely self-managed
Twitter Text Libraries
 ‣   Provides extraction and auto-linking
     ‣   @user, @user/list, #hashtag, URLs
 ‣   Open Source*
 ‣   Available in Ruby and Java from Twitter
 ‣   Conformance Testing Data
     ‣   Modeled after the Unicode conformance suite
     ‣   YAML description of test cases for any language
     ‣   Assurance that you meet the same standards
 ‣   Many non-English test cases

More Related Content

Similar to Internationalizing Twitter

From MonitoringSucks to Monitoring Love , 2016 Edition
From MonitoringSucks to Monitoring Love , 2016 EditionFrom MonitoringSucks to Monitoring Love , 2016 Edition
From MonitoringSucks to Monitoring Love , 2016 EditionKris Buytaert
 
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignityCharacter Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignityTravis Fischer
 
Hello, Joe. Hello, Mike; Hello, Robert.
Hello, Joe. Hello, Mike; Hello, Robert.Hello, Joe. Hello, Mike; Hello, Robert.
Hello, Joe. Hello, Mike; Hello, Robert.⌨️ Steven Proctor
 
Lesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squencesLesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squencesLexume1
 
Future of the Web for Ayrshire Business Week
Future of the Web for Ayrshire Business WeekFuture of the Web for Ayrshire Business Week
Future of the Web for Ayrshire Business WeekNSDesign Ltd
 
6 key learnings for responsive webdesign
6 key learnings for responsive webdesign6 key learnings for responsive webdesign
6 key learnings for responsive webdesignBart De Waele
 
Better Machine Learning with Less Data - Slater Victoroff (Indico Data)
Better Machine Learning with Less Data - Slater Victoroff (Indico Data)Better Machine Learning with Less Data - Slater Victoroff (Indico Data)
Better Machine Learning with Less Data - Slater Victoroff (Indico Data)Shift Conference
 
Except UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in PythonExcept UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in PythonAram Dulyan
 
Mistakes I Made Building Netflix for the iPhone
Mistakes I Made Building Netflix for the iPhoneMistakes I Made Building Netflix for the iPhone
Mistakes I Made Building Netflix for the iPhonekentbrew
 
Android dev and MWC report
Android dev and MWC reportAndroid dev and MWC report
Android dev and MWC report01Booster
 
2013 - Andrei Zmievski: Machine learning para datos
2013 - Andrei Zmievski: Machine learning para datos2013 - Andrei Zmievski: Machine learning para datos
2013 - Andrei Zmievski: Machine learning para datosPHP Conference Argentina
 
A living hell - lessons learned in eight years of parsing real estate data
A living hell - lessons learned in eight years of parsing real estate data  A living hell - lessons learned in eight years of parsing real estate data
A living hell - lessons learned in eight years of parsing real estate data lokku
 
[KOR][E-Kor-Seminar 2014][8/8] Enlightenment Window Manager (Carsten Haitzler)
[KOR][E-Kor-Seminar 2014][8/8] Enlightenment Window Manager (Carsten Haitzler)[KOR][E-Kor-Seminar 2014][8/8] Enlightenment Window Manager (Carsten Haitzler)
[KOR][E-Kor-Seminar 2014][8/8] Enlightenment Window Manager (Carsten Haitzler)EnlightenmentProject
 
Section 8 Programming Style and Your Brain: Douglas Crockford
Section 8 Programming Style and Your Brain: Douglas CrockfordSection 8 Programming Style and Your Brain: Douglas Crockford
Section 8 Programming Style and Your Brain: Douglas Crockfordjaxconf
 
MS TECH CHALLENGE 2015
MS TECH CHALLENGE 2015MS TECH CHALLENGE 2015
MS TECH CHALLENGE 2015Bivash Rath
 
Of innovation and impatience - Future Decoded 2015
Of innovation and impatience - Future Decoded 2015Of innovation and impatience - Future Decoded 2015
Of innovation and impatience - Future Decoded 2015Christian Heilmann
 
「機械翻訳の現在と未来:機械翻訳が新たに生み出すサービスは何か?」
「機械翻訳の現在と未来:機械翻訳が新たに生み出すサービスは何か?」「機械翻訳の現在と未来:機械翻訳が新たに生み出すサービスは何か?」
「機械翻訳の現在と未来:機械翻訳が新たに生み出すサービスは何か?」Osaka University
 
State of the art in Natural Language Processing (March 2019)
State of the art in Natural Language Processing (March 2019)State of the art in Natural Language Processing (March 2019)
State of the art in Natural Language Processing (March 2019)Liad Magen
 

Similar to Internationalizing Twitter (20)

From MonitoringSucks to Monitoring Love , 2016 Edition
From MonitoringSucks to Monitoring Love , 2016 EditionFrom MonitoringSucks to Monitoring Love , 2016 Edition
From MonitoringSucks to Monitoring Love , 2016 Edition
 
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignityCharacter Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
 
Hello, Joe. Hello, Mike; Hello, Robert.
Hello, Joe. Hello, Mike; Hello, Robert.Hello, Joe. Hello, Mike; Hello, Robert.
Hello, Joe. Hello, Mike; Hello, Robert.
 
Lesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squencesLesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squences
 
Moodle and the Internet of Things
Moodle and the Internet of ThingsMoodle and the Internet of Things
Moodle and the Internet of Things
 
Future of the Web for Ayrshire Business Week
Future of the Web for Ayrshire Business WeekFuture of the Web for Ayrshire Business Week
Future of the Web for Ayrshire Business Week
 
6 key learnings for responsive webdesign
6 key learnings for responsive webdesign6 key learnings for responsive webdesign
6 key learnings for responsive webdesign
 
Better Machine Learning with Less Data - Slater Victoroff (Indico Data)
Better Machine Learning with Less Data - Slater Victoroff (Indico Data)Better Machine Learning with Less Data - Slater Victoroff (Indico Data)
Better Machine Learning with Less Data - Slater Victoroff (Indico Data)
 
Except UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in PythonExcept UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in Python
 
Mistakes I Made Building Netflix for the iPhone
Mistakes I Made Building Netflix for the iPhoneMistakes I Made Building Netflix for the iPhone
Mistakes I Made Building Netflix for the iPhone
 
Internet of Things (2015)
Internet of Things (2015)Internet of Things (2015)
Internet of Things (2015)
 
Android dev and MWC report
Android dev and MWC reportAndroid dev and MWC report
Android dev and MWC report
 
2013 - Andrei Zmievski: Machine learning para datos
2013 - Andrei Zmievski: Machine learning para datos2013 - Andrei Zmievski: Machine learning para datos
2013 - Andrei Zmievski: Machine learning para datos
 
A living hell - lessons learned in eight years of parsing real estate data
A living hell - lessons learned in eight years of parsing real estate data  A living hell - lessons learned in eight years of parsing real estate data
A living hell - lessons learned in eight years of parsing real estate data
 
[KOR][E-Kor-Seminar 2014][8/8] Enlightenment Window Manager (Carsten Haitzler)
[KOR][E-Kor-Seminar 2014][8/8] Enlightenment Window Manager (Carsten Haitzler)[KOR][E-Kor-Seminar 2014][8/8] Enlightenment Window Manager (Carsten Haitzler)
[KOR][E-Kor-Seminar 2014][8/8] Enlightenment Window Manager (Carsten Haitzler)
 
Section 8 Programming Style and Your Brain: Douglas Crockford
Section 8 Programming Style and Your Brain: Douglas CrockfordSection 8 Programming Style and Your Brain: Douglas Crockford
Section 8 Programming Style and Your Brain: Douglas Crockford
 
MS TECH CHALLENGE 2015
MS TECH CHALLENGE 2015MS TECH CHALLENGE 2015
MS TECH CHALLENGE 2015
 
Of innovation and impatience - Future Decoded 2015
Of innovation and impatience - Future Decoded 2015Of innovation and impatience - Future Decoded 2015
Of innovation and impatience - Future Decoded 2015
 
「機械翻訳の現在と未来:機械翻訳が新たに生み出すサービスは何か?」
「機械翻訳の現在と未来:機械翻訳が新たに生み出すサービスは何か?」「機械翻訳の現在と未来:機械翻訳が新たに生み出すサービスは何か?」
「機械翻訳の現在と未来:機械翻訳が新たに生み出すサービスは何か?」
 
State of the art in Natural Language Processing (March 2019)
State of the art in Natural Language Processing (March 2019)State of the art in Natural Language Processing (March 2019)
State of the art in Natural Language Processing (March 2019)
 

Recently uploaded

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 

Internationalizing Twitter

  • 1. Internationalizing Twitter TM Matt Sanford @ IMUG // 2010-05-20
  • 2. Feedback ‣ Hashtag: #IMUG408 ‣ @Reply me: @mzsanford ‣ Email me: matt@twitter.com ‣ Or, talk to me afterward.
  • 3. Feedback ‣ Hashtag: #IMUG408 ‣ @Reply me: @mzsanford ‣ Email me: matt@twitter.com ‣ Or, talk to me afterward. "It's real, human interaction. It ain't gonna hurt you. " - @mchammer
  • 4. Agenda ‣ Twitter’s Non-US Popularity ‣ Growth & Localization ‣ Case Studies: Chile & Japan ‣ Community Translation ‣ The Good, The Bad & The Ugly ‣ Technical Hurdles
  • 6. Mr. Popular (almost) Non-US Growth for Twitter.
  • 7. International: 60+% of all accounts 100% 75% 50% 25% 0% June 2009 September 2009 December 2009 March 2010
  • 8. International: 60+% of all accounts 100% 75% 50% 25% 0% June 2009 September 2009 December 2009 March 2010
  • 9. Case Study: Chile We’re There When People Need Us.
  • 10. Twitter Signups in Chile February 21st February 24th February 27th March 2nd
  • 11. Twitter Signups in Chile February 21st February 24th February 27th March 2nd URGENTE en Constitución apareció IVAN LARA DE 8 Urgent. In Constitucion an eight-year old boy named Ivan AÑOS QUE ESTÁ ABANDONADO en esa ciudad...busca Lara showed up alone. He's looking for his family parientes en todo Chile favor copiar y pegar 10:50 AM Mar 2nd via web 10:50 AM Mar 2nd via web
  • 12. Case Study: Japan Not Godzilla Big, But We’re Working On It
  • 13. Daily Tweeters in Japan July ‘09 October ‘09 January ‘10 April ‘10
  • 14. Japanese Mobile Follow Me Localizing is more than translation
  • 15. Japanese Mobile Follow Me Localizing is more than translation
  • 16. Japanese Mobile Follow Me Localizing is more than translation
  • 17. Japanese Mobile Follow Me Localizing is more than translation
  • 18. Mobile in Japan: Galapagos Phones ‣ We have a special mobile web site ‣ Emoji support ‣ No cookies ‣ Image conversion ‣ Designed with Japanese expectations in mind ‣ In addition to Android & iPhone clients, we’re working with carriers on integrated clients
  • 19. Mobile in Japan: Galapagos Phones ‣ We have a special mobile web site ‣ Emoji support ‣ No cookies ‣ Image conversion ‣ Designed with Japanese expectations in mind ‣ In addition to Android & iPhone clients, we’re working with carriers on integrated clients
  • 20. Mobile in Japan: Galapagos Phones ‣ We have a special mobile web site ‣ Emoji support ‣ No cookies ‣ Image conversion ‣ Designed with Japanese expectations in mind ‣ In addition to Android & iPhone clients, we’re working with carriers on integrated clients
  • 21. Community Translation Why we chose it. How we do it. What works. What doesn’t.
  • 23. Don’t Panic “If you know the enemy and know yourself, you need not fear the result of a hundred battles. If you know yourself but not the enemy, for every victory gained you will also suffer a defeat. If you know neither the enemy nor yourself, you will succumb in every battle.” — Sun Tsu, The Art of War
  • 24. Why Community Translation? ‣ We didn’t have a budget ‣ We were/are a small, Open Source based business ‣ We had a large number of willing volunteers ‣ We’re committed to user involvement ‣ We have a very specific tone and vocabulary ‣ We had already tried direct translation and it didn’t mesh well with our release cycle ‣ Twitter.com is deployed several times a day
  • 25. How We Do Community Translation
  • 26. How We Do Community Translation
  • 27. How We Do Community Translation
  • 28. How We Do Community Translation
  • 29. Community Translation Stats Translators: 2,600 Strings: 3,7000 Translations: 480,000 Average /Translator: 184
  • 30. What Works Well? ‣ In-line translation (added context) ‣ Multi-level voting ‣ Discussion groups for user input ‣ French “follow” for example: ‣ Mouton – Sheep ‣ Suiveur – Stalkers ‣ Adepte – Followers
  • 31. What Works Less Well? ‣ Turn around time ‣ Long, difficult strings are often skipped ‣ Inconsistent wording choices ‣ Sensitive content, such as email notices ‣ Pre-launch project disclosure ‣ Management of the groups takes some resources
  • 32. Technical Hurdles Internationalizing is more than GetText* * but you already knew that.
  • 33. Character Counting “If you base a product on a character count, you better get it right” – @mzsanford
  • 34. Character Counting “If you base a product on a character count, you better get it right” – @mzsanford Don’t count bytes UTF-8: 0xE5 0x91 0xB3 (3 bytes) UTF-16: 0x54 0x73 (2 bytes) Human: 1 character U+5473
  • 35. Character Counting “If you base a product on a character count, you better get it right” – @mzsanford Don’t count bytes Don’t even count Unicode code points UTF-8: 0xE5 0x91 0xB3 (3 bytes) U+5473 UTF-16: 0x54 0x73 (2 bytes) Human: 1 character e + U+0065 U+0301 = é {U+0065, U+0301} OR é U+00E9
  • 36. Character Counting “If you base a product on a character count, you better get it right” – @mzsanford Don’t count bytes Don’t even count Unicode code points UTF-8: 0xE5 0x91 0xB3 (3 bytes) U+5473 UTF-16: 0x54 0x73 (2 bytes) Human: 1 character e + U+0065 U+0301 = é {U+0065, U+0301} OR é U+00E9 We try to count the shortest representation* * Unicode NFC form. See: http://unicode.org/reports/tr15/
  • 37. Tweet Processing (part 1) ‣ Auto linking ‣ Japanese, for example, has no spaces. ‣ We’ve worked out a solution that balances how people use Twitter with complete correctness ‣ We’ve Open Source our solution ‣ Language identification ‣ Traditional methods rely on more text ‣ Tweets also have a vocabulary of their own (tw*)
  • 38. Tweet Processing (part 2) ‣ Searching Tweets ‣ Per-language tokenizing is difficult given the language identification challenges ‣ Average Tweet length varies noticeably by language ‣ Trends ‣ Finding entities in Tweets requires either NLP (which is highly language dependent) or pure statistical analysis (which can produce poor quality trends) ‣ All of this is harder given the very-short nature of Tweets
  • 39. Other Technical Lessons ‣ Ruby 1.8 Unicode support is lacking ‣ MySQL before v6.0 doesn’t allow all unicode characters ‣ And 6.0 died in Alpha ‣ Memcached keys only support a subset of characters ‣ You can either validate or encode ‣ Unicode security is a real thing ‣ Directional change spoofing attack
  • 41. Appendix Slides There’s more data where that came from … TM
  • 42. Our Translation Back-End ‣ Based on the ruby FastGettext library ‣ Custom back-end ‣ Re-loaded at process start-up (~2 hours) ‣ Data is stored in memcached ‣ Loaded into memcached from our database ‣ No engineer needed to deploy ‣ Completely self-managed
  • 43. Twitter Text Libraries ‣ Provides extraction and auto-linking ‣ @user, @user/list, #hashtag, URLs ‣ Open Source* ‣ Available in Ruby and Java from Twitter ‣ Conformance Testing Data ‣ Modeled after the Unicode conformance suite ‣ YAML description of test cases for any language ‣ Assurance that you meet the same standards ‣ Many non-English test cases

Editor's Notes