International Web Application Development
ookina umi no youni gengo ga arimasu.

There is an ocean of language.


     http://www.flickr.com/photos/jimbrekke/4292920...
demo watashitachiwa webu apurike-shon o
            kaihatsusuro toki...

But when we develop web applications...
ido no nakano kaeru no youni

we are like the frog in the well,

    http://www.flickr.com/photos/clickykbd/2650909663/
jibuntachi no gengo dake o kangaemasu

only thinking about our own language

        http://www.flickr.com/photos/clickykbd...
demo webu wa ookina umi noyouina mono desu.

      but the web is a big ocean.

         http://www.flickr.com/photos/jimbr...
Sarah Allen
             @ultrasaurus




                   Mightyverse
sara aren desu. mightyverse o kaihatsu shite imasu
San Francisco




            san furanshisuko ni sunde imasu.
Mightyverse
mojibake
Character Encoding
UTF8      JIS
UTF16     Shift-JIS
UTF32     EUC
Encoding Vocabulary

Code point one or
more bytes that represent
a single character
Unicode
UTF8 - variable length
       (1, 2, 3, or 4 bytes)
UTF16 - variable length
      (2 or 4 bytes)
UTF32 - fixed widt...
UTF8
U+000 to U+127 1 byte
ASCII = UTF8
High bit indicates more bytes
High bits are used to indicate how many bytes are used to
represent a specific character. Software can easily read a
      ...
UTF8
Common for internet and file system
format
• XML: default encoding
• Flash: only encoding
UTF8 Disadvantages
UTF-8 encoded text may be larger
Possible to split a string mid-character
Excessive unification
Caution

Not all implementations are complete
For example, MySql5
       supports only 3 bytes for UTF8
Most spoken languages can be represented in 3 bytes,
           the "Basic Multilingual Plane"
                           ...
http://globalmoxie.com/blog/klingon-not-spoken-here.shtml




In May 2001, the Unicode Technical Committee rejected the Kl...
The tengwar font has been proposed for the Unicode standard. The codepoints
    are subject to change; the range U+016080 ...
You need to have an appropriate font installed
              to use unicode.




            http://en.wikipedia.org/wiki/...
Web Application Story
1. HTML Form post
2. Ruby code
3. Write to Database
4. Output HTML for Display
HTML Form Post
HTTP headers

• You can specify what character set you
  want back when you send a form post
• This is informational for t...
Ruby code
Ruby code


Most web applications don’t parse text
If yours does, you will need to think about
Ruby 1.8 vs. Ruby 1.9
Ruby 1.8
  >> name = "Yukihiro”
=> "Yukihiro”
>> name[4]
=> 104
>> name[4].chr
=> "h"

>> name = "        "
=>"34320122334...
Ruby 1.9
name = "yukihiro”
=> "yukihiro”
>> name[4]
=> "h"

>> name = "         ”
=> "          ”
>> name[2]
=> " ”
>> nam...
Ruby

       Use Ruby 1.9


For Ruby 1.8 (if you must)....
        require 'jcode'
Database
Database
A) Character encoding
   i. client
   ii. connection
   iii. server
B) Collation
SQL client   connection   database
check database settings
   always use the same character set
Collation


 Different Languages
Alphabetize Differently
Collation
Swedish          German

Alingsås         Ägypten
Borgholm         Äthiopien
Eslöv            Afghanistan
Flen  ...
Collation


1. Sorting
2. Equality
e   é
4
Output HTML for Display
Content Type

• Setting the content-type tells the browser
  how to display the text
  • meta tag
  • http header
Questions?




http://www.flickr.com/photos/daswunderkind/2689195410/
International Web Application Development
International Web Application Development
International Web Application Development
International Web Application Development
Upcoming SlideShare
Loading in...5
×

International Web Application Development

3,737

Published on

Sarah Allen's talk at Ruby Kaigi 2010 about how to deal with text in multiple languages when building web applications in Ruby.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
3,737
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide








  • Mightyverse is for people interested in language. We are collecting videos of words, phrases, and sentences translated from one language to another.





































  • International Web Application Development

    1. 1. International Web Application Development
    2. 2. ookina umi no youni gengo ga arimasu. There is an ocean of language. http://www.flickr.com/photos/jimbrekke/429292020/
    3. 3. demo watashitachiwa webu apurike-shon o kaihatsusuro toki... But when we develop web applications...
    4. 4. ido no nakano kaeru no youni we are like the frog in the well, http://www.flickr.com/photos/clickykbd/2650909663/
    5. 5. jibuntachi no gengo dake o kangaemasu only thinking about our own language http://www.flickr.com/photos/clickykbd/2650909663/
    6. 6. demo webu wa ookina umi noyouina mono desu. but the web is a big ocean. http://www.flickr.com/photos/jimbrekke/429292020/
    7. 7. Sarah Allen @ultrasaurus Mightyverse sara aren desu. mightyverse o kaihatsu shite imasu
    8. 8. San Francisco san furanshisuko ni sunde imasu.
    9. 9. Mightyverse
    10. 10. mojibake
    11. 11. Character Encoding UTF8 JIS UTF16 Shift-JIS UTF32 EUC
    12. 12. Encoding Vocabulary Code point one or more bytes that represent a single character
    13. 13. Unicode UTF8 - variable length (1, 2, 3, or 4 bytes) UTF16 - variable length (2 or 4 bytes) UTF32 - fixed width (4 bytes)
    14. 14. UTF8 U+000 to U+127 1 byte ASCII = UTF8 High bit indicates more bytes
    15. 15. High bits are used to indicate how many bytes are used to represent a specific character. Software can easily read a UTF8 stream, even starting in the middle. http://tools.ietf.org/html/rfc3629#section-3
    16. 16. UTF8 Common for internet and file system format • XML: default encoding • Flash: only encoding
    17. 17. UTF8 Disadvantages UTF-8 encoded text may be larger Possible to split a string mid-character Excessive unification
    18. 18. Caution Not all implementations are complete For example, MySql5 supports only 3 bytes for UTF8
    19. 19. Most spoken languages can be represented in 3 bytes, the "Basic Multilingual Plane" http://www.siriusict.com/2010/08/06/ character-encoding-unicode-utf-8-and-a-bit-of-chauvinism-explained-for-the-masses-2/
    20. 20. http://globalmoxie.com/blog/klingon-not-spoken-here.shtml In May 2001, the Unicode Technical Committee rejected the Klingon proposal; however, Michael Everson created a mapping of pIqaD into the Private Use Area of Unicode, which are listed in the ConScript Unicode Registry (U+F8D0 to U+F8FF). http://en.wikipedia.org/wiki/Klingon_writing_systems
    21. 21. The tengwar font has been proposed for the Unicode standard. The codepoints are subject to change; the range U+016080 to U+0160FF in the SMP is tentatively allocated for tengwar according to the current Unicode roadmap. http://en.wikipedia.org/wiki/Tengwar
    22. 22. You need to have an appropriate font installed to use unicode. http://en.wikipedia.org/wiki/Tengwar
    23. 23. Web Application Story
    24. 24. 1. HTML Form post 2. Ruby code 3. Write to Database 4. Output HTML for Display
    25. 25. HTML Form Post
    26. 26. HTTP headers • You can specify what character set you want back when you send a form post • This is informational for the server • Just setting these won’t change how your app behaves, unless your web app has code for that
    27. 27. Ruby code
    28. 28. Ruby code Most web applications don’t parse text If yours does, you will need to think about Ruby 1.8 vs. Ruby 1.9
    29. 29. Ruby 1.8 >> name = "Yukihiro” => "Yukihiro” >> name[4] => 104 >> name[4].chr => "h" >> name = " " =>"3432012233432022233432012043432012413432 257” >> name[2] => 147 >> name[2].chr => ?
    30. 30. Ruby 1.9 name = "yukihiro” => "yukihiro” >> name[4] => "h" >> name = " ” => " ” >> name[2] => " ” >> name[0] => " "
    31. 31. Ruby Use Ruby 1.9 For Ruby 1.8 (if you must).... require 'jcode'
    32. 32. Database
    33. 33. Database A) Character encoding i. client ii. connection iii. server B) Collation
    34. 34. SQL client connection database
    35. 35. check database settings always use the same character set
    36. 36. Collation Different Languages Alphabetize Differently
    37. 37. Collation Swedish German Alingsås Ägypten Borgholm Äthiopien Eslöv Afghanistan Flen Bolivien Hässleholm Dänemark Tranås Deutschland Vetlanda Jamaika Växjö Marokko Ängelholm Österreich Örnsköldsvik Venezuela Östersund
    38. 38. Collation 1. Sorting 2. Equality
    39. 39. e é
    40. 40. 4 Output HTML for Display
    41. 41. Content Type • Setting the content-type tells the browser how to display the text • meta tag • http header
    42. 42. Questions? http://www.flickr.com/photos/daswunderkind/2689195410/
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×