Your SlideShare is downloading. ×
Encodings - Ruby 1.8 and Ruby 1.9
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Encodings - Ruby 1.8 and Ruby 1.9

4,113
views

Published on

- history of encodings …

- history of encodings
- encodings in ruby 1.8
- encodings in ruby 1.9

Published in: Technology

0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,113
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
7
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Encodings Ruby 1.8 and 1.9 Vlad ZLOTEANU #ParisRB Software Engineer @ Dimelo December 12, 2001 @vladzloteanuCopyright Dimelo SA www.dimelo.com
  • 2. Motto: “ There Aint No Such Thing As Plain Text ” Joel SpolskyCopyright Dimelo SA www.dimelo.com
  • 3. ASCII (1963) historically: from telegraphic codes 7 bits to encode 128 chars included: english alphabet, digits, punctuation marks, control chars what about chars from other languages? "A".unpack("C*") => [65] "a".unpack("C*") => [97] "c".unpack("C*") => [99]Copyright Dimelo SA www.dimelo.com
  • 4. iso-8859-X ideea: use the 8th bit -> 128 new positions 8-bit encoding -> 256 chars iso-8859-1 (Latin-1), windows-1252 slots 160 to 255 for other chars covers most WE languages: French, German, etc default charset in many browsers iso-8859-2 most EE languagesCopyright Dimelo SA www.dimelo.com
  • 5. Issues cant combine 2 different languages from 2 different encodings most Asian languages have more than 256 chars "café".encode(ISO-8859-1).unpack("C*") => [99, 97, 102, 233] "Ionuţ".encode(ISO-8859-2).unpack("C*") => [73, 111, 110, 117, 254] "Ionuţ aime le café".encode(ISO-8859-1).unpack("C*") Encoding::UndefinedConversionError: U+0163 from UTF-8 to ISO-8859-1Copyright Dimelo SA www.dimelo.com
  • 6. Unicode the goal of Unicode was literally to provide a character set that includes all characters in use today each letter maps to a code point (theoretical symbol) A is the same with A and A, but different from a uppercase, lowercase, rules for normalization, decomposition, etc. codespace of 1.1M code points (from 0 to 10FFFF) (110k chars) from 0 to 255 -> same encoding as Latin-1 (we can think of it like a superset of Latin-1)Copyright Dimelo SA www.dimelo.com
  • 7. Unicode (2) Unicode enables processing, storage and interchange of text data no matter what the platform, no matter what the program, no matter the language .. but how should we store those magical ‘code points’? "café".codepoints.to_a => [99, 97, 102, 233] "café".encode(ISO-8859-1).unpack("C*") => [99, 97, 102, 233] "Ionuţ 愛して le καφές".codepoints.to_a => [73, 111, 110, 117, 355, 32, 24859, 12375, 12390, 32, 108, 101, 32, 954, 945, 966, 941, 962]Copyright Dimelo SA www.dimelo.com
  • 8. UTF-8 encoding scheme for Unicode every code point from 0-127 is stored in a single byte. code points 128 and above are stored using >2 bytes "Café".unpack("U*") => [67, 97, 102, 233] "Café".encode(“UTF-8”).unpack("C*") => [67, 97, 102, 195, 169]Copyright Dimelo SA www.dimelo.com
  • 9. UTF-8 pluses & minuses ASCII extension can encode any Unicode char self-synchronising, efficient to search for byte- oriented alghs, efficient to encode rfc2277: (inet) protocols MUST declare (supported) charsets, protocols MUST support at least UTF-8 " コーヒー ".unpack(U*) => [12467, 12540, 12498, 12540] " コーヒー ".unpack(C*) => [227, 130, 179, 227, 131, 188, 227, 131, 146, 227, 131, 188] # Asian languages take 1.5x more spaceCopyright Dimelo SA www.dimelo.com
  • 10. What you should remember Text CONTENT and ENCODING are two different concepts Unicode is a map “symbol”  ‘integer codepoint’ Latin-1 is a single byte encoding for Western languages UTF-8 is a multibyte encoding for Unicode USE UTF-8!Copyright Dimelo SA www.dimelo.com
  • 11. Ruby 1.8 Unicode Support string is just a collection of bytes --> dealing with encodings is for the developer issues: index retrieval, slicing, regexp, etc “”.size will always count bytes(validates_size_of …) limited unicode support (/u modifier) "Café".size => 5 "Café".reverse => "251303faC" "Café".scan(/./) => ["C", "a", "f", "303", "251"] "Café".scan(/./u) => ["C", "a", "f", “é"]Copyright Dimelo SA www.dimelo.com
  • 12. Ruby 1.8 Unicode Support (2) regex - aware of 4 encodings: none, EUC, Shift_JIS, UTF-8 ways to set source encoding: command line K param RUBYOPT ruby -e "puts Café.scan(/./).inspect" ["C", "a", "f", "303", "251"] ruby -Ku -e "puts Café.scan(/./).inspect" ["C", "a", "f", "é"] export RUBYOPT=-Ku ruby -e "puts Café.scan(/./).inspect" ["C", "a", "f", "é"]Copyright Dimelo SA www.dimelo.com
  • 13. Ruby 1.8 - Transcoding Iconv library – ships with Ruby, handles transcoding TRANSLIT option IGNORE utf8_coffee = "Café" => "Café" utf8_to_latin1 = Iconv.new("LATIN1//TRANSLIT//IGNORE", "UTF8") => #<Iconv:0x007f8ba1930060> utf8_to_latin1.iconv(utf8_coffee).size => 4 ruby-1.9.3-p0 :049 > utf8_to_latin1.iconv("On and on… and on…") => "On and on... and on...”Copyright Dimelo SA www.dimelo.com
  • 14. Ruby 1.9 & M17N multilingualization (M17N) - a CSI approach Localization for more than one language on single software should be available More than one language should be available to use at the same time difference from conventional languages (java, python, perl) (UCS philosophy) 1. Source encoding: all source files have an encoding new __ENCODING__ keyword Irb ruby-1.9.3-p0 :002 > __ENCODING__ => #<Encoding:UTF-8>Copyright Dimelo SA www.dimelo.com
  • 15. Ruby 1.9 – source encoding New way to set encoding: magic comment Priority: .rb files: magic comment > command-line –K option > RUBYOPT –K > shebang –K > US-ASCII command line / standard input: magic comment > command-line –K option > RUBYOPT –K > system locale # encoding: UTF-8 puts __ENCODING__ => UTF-8Copyright Dimelo SA www.dimelo.com
  • 16. Ruby 1.9 – String class String – a collection of encoded data each String object has an encoding size method -> multibyte 3 new enumerator methods "café".size => 4 ruby-1.9.3-p0 :025 > "café".bytesize => 5 "café".each_byte.map{|byte| byte} => [99, 97, 102, 195, 169] "café".each_char.map{|char| char} => ["c", "a", "f", "é"] "café".each_codepoint.map{|byte| byte} => [99, 97, 102, 233]Copyright Dimelo SA www.dimelo.com
  • 17. Ruby 1.9 – String class (Transcoding) Strings with different encoding can ‘coexist’ in same program – and can be merged New way to transcode latin_1_coffee = "café".encode(ISO-8859-1) => "cafxE9" latin_1_coffee.bytesize => 4 wrong_encoded_coffee = latin_1_coffee.force_encoding(UTF-8) => "cafxE9" latin_1_coffee.encoding => #<Encoding:UTF-8> ruby-1.9.3-p0 :035 > wrong_encoded_coffee.scan /./ ArgumentError: invalid byte sequence in UTF-8Copyright Dimelo SA www.dimelo.com
  • 18. Ruby 1.9 - Internal and external encoding > cat show_encodings.rb open(__FILE__, "r:UTF-8:UTF-32") do |file| (that What about non-literal Strings come from I/O)? puts file.external_encoding.name puts file.internal_encoding.name 2. Encoding.default_external: file.each do |line| p [line.encoding.name, line[0..3]] end default for external encoding end derived from LANG on Unix/Linux derived from legacy system encoding on Windows > ruby show_encodings.rb UTF-8 UTF-32 3. Encoding.default_internal: ["UTF-32", "uFEFF"] ["UTF-32", "x00x00x00x20"]encoding default for internal ["UTF-32", "x00x00x00x20"] ["UTF-32", "x00x00x00x20"] (≊ default external) by default undefined ["UTF-32", "x00x00x00x20"] ["UTF-32", "x00x00x00x20"] ["UTF-32", "x00x00x00x65"]Copyright Dimelo SA www.dimelo.com
  • 19. What you should remember Ruby 1.8 has limited (regexp-only) support for Unicode watch out on slices, sizes, reverse, etc. transcode with Iconv Ruby 1.9 is encoding-aware each source file has an Encoding each String has an Encoding IO: internal and external encoding New iterators on StringCopyright Dimelo SA www.dimelo.com
  • 20. HTML/HTTP – declare encoding HTML/HTTP HTTP header Meta tags Content-Type: text/html; charset=ISO-8859-1 # HTTP Header <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta charset="utf-8"/> <?xml version="1.0" encoding="ISO-8859-1"?>Copyright Dimelo SA www.dimelo.com
  • 21. HTML – Encoding chars Encoding types directly in declared encoding “é’ named char entities "&eacute;” numeric char entities “&#233;”Copyright Dimelo SA www.dimelo.com
  • 22. Conclusion Use UTF8 Document (declare) encodings Code encoding-safeCopyright Dimelo SA www.dimelo.com
  • 23. References James Gray’s Encodings series Joel Spolsky’s blog post about encodings Design and implementation of Ruby M17N Internationalization in Ruby 1.9Copyright Dimelo SA www.dimelo.com
  • 24. .end Merci! Thank you! Mulţumesc ありがとう ?Copyright Dimelo SA www.dimelo.com

×