Encodings                          Ruby 1.8 and 1.9                                           Vlad ZLOTEANU      #ParisRB ...
Motto:                      “ There Aint No Such Thing                             As Plain Text ”                        ...
ASCII (1963)                      historically: from telegraphic codes                      7 bits to encode 128 chars    ...
iso-8859-X                      ideea: use the 8th bit -> 128 new positions                      8-bit encoding -> 256 cha...
Issues                      cant combine 2 different languages from 2                      different encodings            ...
Unicode            the goal of Unicode was literally to provide a            character set that includes all characters in...
Unicode (2)            Unicode enables processing, storage and interchange            of text data no matter what the plat...
UTF-8            encoding scheme for Unicode            every code point from 0-127 is stored in a single byte.           ...
UTF-8 pluses & minuses            ASCII extension            can encode any Unicode char            self-synchronising, ef...
What you should remember            Text CONTENT and ENCODING are two different            concepts            Unicode is ...
Ruby 1.8 Unicode Support         string is just a collection of bytes --> dealing with         encodings is for the develo...
Ruby 1.8 Unicode Support (2)         regex - aware of 4 encodings: none, EUC, Shift_JIS,         UTF-8         ways to set...
Ruby 1.8 - Transcoding            Iconv library – ships with Ruby, handles transcoding               TRANSLIT option      ...
Ruby 1.9 & M17N            multilingualization (M17N) - a CSI approach                  Localization for more than one lan...
Ruby 1.9 – source encoding            New way to set encoding: magic comment            Priority:               .rb files:...
Ruby 1.9 – String class        String – a collection of encoded data            each String object has an encoding        ...
Ruby 1.9 – String class (Transcoding)                 Strings with different encoding can ‘coexist’ in                 sam...
Ruby 1.9 - Internal and external encoding      > cat show_encodings.rb      open(__FILE__, "r:UTF-8:UTF-32") do |file| (th...
What you should remember            Ruby 1.8 has limited (regexp-only) support for            Unicode              watch o...
HTML/HTTP – declare encoding            HTML/HTTP              HTTP header              Meta tags    Content-Type: text/ht...
HTML – Encoding chars                      Encoding types                         directly in declared encoding           ...
Conclusion                      Use UTF8                      Document (declare) encodings                      Code encod...
References            James Gray’s Encodings series            Joel Spolsky’s blog post about encodings            Design ...
.end                          Merci!                        Thank you!                        Mulţumesc                   ...
Upcoming SlideShare
Loading in...5
×

Encodings - Ruby 1.8 and Ruby 1.9

4,243

Published on

- history of encodings
- encodings in ruby 1.8
- encodings in ruby 1.9

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,243
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Transcript of "Encodings - Ruby 1.8 and Ruby 1.9"

  1. 1. Encodings Ruby 1.8 and 1.9 Vlad ZLOTEANU #ParisRB Software Engineer @ Dimelo December 12, 2001 @vladzloteanuCopyright Dimelo SA www.dimelo.com
  2. 2. Motto: “ There Aint No Such Thing As Plain Text ” Joel SpolskyCopyright Dimelo SA www.dimelo.com
  3. 3. ASCII (1963) historically: from telegraphic codes 7 bits to encode 128 chars included: english alphabet, digits, punctuation marks, control chars what about chars from other languages? "A".unpack("C*") => [65] "a".unpack("C*") => [97] "c".unpack("C*") => [99]Copyright Dimelo SA www.dimelo.com
  4. 4. iso-8859-X ideea: use the 8th bit -> 128 new positions 8-bit encoding -> 256 chars iso-8859-1 (Latin-1), windows-1252 slots 160 to 255 for other chars covers most WE languages: French, German, etc default charset in many browsers iso-8859-2 most EE languagesCopyright Dimelo SA www.dimelo.com
  5. 5. Issues cant combine 2 different languages from 2 different encodings most Asian languages have more than 256 chars "café".encode(ISO-8859-1).unpack("C*") => [99, 97, 102, 233] "Ionuţ".encode(ISO-8859-2).unpack("C*") => [73, 111, 110, 117, 254] "Ionuţ aime le café".encode(ISO-8859-1).unpack("C*") Encoding::UndefinedConversionError: U+0163 from UTF-8 to ISO-8859-1Copyright Dimelo SA www.dimelo.com
  6. 6. Unicode the goal of Unicode was literally to provide a character set that includes all characters in use today each letter maps to a code point (theoretical symbol) A is the same with A and A, but different from a uppercase, lowercase, rules for normalization, decomposition, etc. codespace of 1.1M code points (from 0 to 10FFFF) (110k chars) from 0 to 255 -> same encoding as Latin-1 (we can think of it like a superset of Latin-1)Copyright Dimelo SA www.dimelo.com
  7. 7. Unicode (2) Unicode enables processing, storage and interchange of text data no matter what the platform, no matter what the program, no matter the language .. but how should we store those magical ‘code points’? "café".codepoints.to_a => [99, 97, 102, 233] "café".encode(ISO-8859-1).unpack("C*") => [99, 97, 102, 233] "Ionuţ 愛して le καφές".codepoints.to_a => [73, 111, 110, 117, 355, 32, 24859, 12375, 12390, 32, 108, 101, 32, 954, 945, 966, 941, 962]Copyright Dimelo SA www.dimelo.com
  8. 8. UTF-8 encoding scheme for Unicode every code point from 0-127 is stored in a single byte. code points 128 and above are stored using >2 bytes "Café".unpack("U*") => [67, 97, 102, 233] "Café".encode(“UTF-8”).unpack("C*") => [67, 97, 102, 195, 169]Copyright Dimelo SA www.dimelo.com
  9. 9. UTF-8 pluses & minuses ASCII extension can encode any Unicode char self-synchronising, efficient to search for byte- oriented alghs, efficient to encode rfc2277: (inet) protocols MUST declare (supported) charsets, protocols MUST support at least UTF-8 " コーヒー ".unpack(U*) => [12467, 12540, 12498, 12540] " コーヒー ".unpack(C*) => [227, 130, 179, 227, 131, 188, 227, 131, 146, 227, 131, 188] # Asian languages take 1.5x more spaceCopyright Dimelo SA www.dimelo.com
  10. 10. What you should remember Text CONTENT and ENCODING are two different concepts Unicode is a map “symbol”  ‘integer codepoint’ Latin-1 is a single byte encoding for Western languages UTF-8 is a multibyte encoding for Unicode USE UTF-8!Copyright Dimelo SA www.dimelo.com
  11. 11. Ruby 1.8 Unicode Support string is just a collection of bytes --> dealing with encodings is for the developer issues: index retrieval, slicing, regexp, etc “”.size will always count bytes(validates_size_of …) limited unicode support (/u modifier) "Café".size => 5 "Café".reverse => "251303faC" "Café".scan(/./) => ["C", "a", "f", "303", "251"] "Café".scan(/./u) => ["C", "a", "f", “é"]Copyright Dimelo SA www.dimelo.com
  12. 12. Ruby 1.8 Unicode Support (2) regex - aware of 4 encodings: none, EUC, Shift_JIS, UTF-8 ways to set source encoding: command line K param RUBYOPT ruby -e "puts Café.scan(/./).inspect" ["C", "a", "f", "303", "251"] ruby -Ku -e "puts Café.scan(/./).inspect" ["C", "a", "f", "é"] export RUBYOPT=-Ku ruby -e "puts Café.scan(/./).inspect" ["C", "a", "f", "é"]Copyright Dimelo SA www.dimelo.com
  13. 13. Ruby 1.8 - Transcoding Iconv library – ships with Ruby, handles transcoding TRANSLIT option IGNORE utf8_coffee = "Café" => "Café" utf8_to_latin1 = Iconv.new("LATIN1//TRANSLIT//IGNORE", "UTF8") => #<Iconv:0x007f8ba1930060> utf8_to_latin1.iconv(utf8_coffee).size => 4 ruby-1.9.3-p0 :049 > utf8_to_latin1.iconv("On and on… and on…") => "On and on... and on...”Copyright Dimelo SA www.dimelo.com
  14. 14. Ruby 1.9 & M17N multilingualization (M17N) - a CSI approach Localization for more than one language on single software should be available More than one language should be available to use at the same time difference from conventional languages (java, python, perl) (UCS philosophy) 1. Source encoding: all source files have an encoding new __ENCODING__ keyword Irb ruby-1.9.3-p0 :002 > __ENCODING__ => #<Encoding:UTF-8>Copyright Dimelo SA www.dimelo.com
  15. 15. Ruby 1.9 – source encoding New way to set encoding: magic comment Priority: .rb files: magic comment > command-line –K option > RUBYOPT –K > shebang –K > US-ASCII command line / standard input: magic comment > command-line –K option > RUBYOPT –K > system locale # encoding: UTF-8 puts __ENCODING__ => UTF-8Copyright Dimelo SA www.dimelo.com
  16. 16. Ruby 1.9 – String class String – a collection of encoded data each String object has an encoding size method -> multibyte 3 new enumerator methods "café".size => 4 ruby-1.9.3-p0 :025 > "café".bytesize => 5 "café".each_byte.map{|byte| byte} => [99, 97, 102, 195, 169] "café".each_char.map{|char| char} => ["c", "a", "f", "é"] "café".each_codepoint.map{|byte| byte} => [99, 97, 102, 233]Copyright Dimelo SA www.dimelo.com
  17. 17. Ruby 1.9 – String class (Transcoding) Strings with different encoding can ‘coexist’ in same program – and can be merged New way to transcode latin_1_coffee = "café".encode(ISO-8859-1) => "cafxE9" latin_1_coffee.bytesize => 4 wrong_encoded_coffee = latin_1_coffee.force_encoding(UTF-8) => "cafxE9" latin_1_coffee.encoding => #<Encoding:UTF-8> ruby-1.9.3-p0 :035 > wrong_encoded_coffee.scan /./ ArgumentError: invalid byte sequence in UTF-8Copyright Dimelo SA www.dimelo.com
  18. 18. Ruby 1.9 - Internal and external encoding > cat show_encodings.rb open(__FILE__, "r:UTF-8:UTF-32") do |file| (that What about non-literal Strings come from I/O)? puts file.external_encoding.name puts file.internal_encoding.name 2. Encoding.default_external: file.each do |line| p [line.encoding.name, line[0..3]] end default for external encoding end derived from LANG on Unix/Linux derived from legacy system encoding on Windows > ruby show_encodings.rb UTF-8 UTF-32 3. Encoding.default_internal: ["UTF-32", "uFEFF"] ["UTF-32", "x00x00x00x20"]encoding default for internal ["UTF-32", "x00x00x00x20"] ["UTF-32", "x00x00x00x20"] (≊ default external) by default undefined ["UTF-32", "x00x00x00x20"] ["UTF-32", "x00x00x00x20"] ["UTF-32", "x00x00x00x65"]Copyright Dimelo SA www.dimelo.com
  19. 19. What you should remember Ruby 1.8 has limited (regexp-only) support for Unicode watch out on slices, sizes, reverse, etc. transcode with Iconv Ruby 1.9 is encoding-aware each source file has an Encoding each String has an Encoding IO: internal and external encoding New iterators on StringCopyright Dimelo SA www.dimelo.com
  20. 20. HTML/HTTP – declare encoding HTML/HTTP HTTP header Meta tags Content-Type: text/html; charset=ISO-8859-1 # HTTP Header <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta charset="utf-8"/> <?xml version="1.0" encoding="ISO-8859-1"?>Copyright Dimelo SA www.dimelo.com
  21. 21. HTML – Encoding chars Encoding types directly in declared encoding “é’ named char entities "&eacute;” numeric char entities “é”Copyright Dimelo SA www.dimelo.com
  22. 22. Conclusion Use UTF8 Document (declare) encodings Code encoding-safeCopyright Dimelo SA www.dimelo.com
  23. 23. References James Gray’s Encodings series Joel Spolsky’s blog post about encodings Design and implementation of Ruby M17N Internationalization in Ruby 1.9Copyright Dimelo SA www.dimelo.com
  24. 24. .end Merci! Thank you! Mulţumesc ありがとう ?Copyright Dimelo SA www.dimelo.com

×