• Save
Encodings - Ruby 1.8 and Ruby 1.9
Upcoming SlideShare
Loading in...5
×
 

Encodings - Ruby 1.8 and Ruby 1.9

on

  • 4,261 views

- history of encodings

- history of encodings
- encodings in ruby 1.8
- encodings in ruby 1.9

Statistics

Views

Total Views
4,261
Views on SlideShare
4,233
Embed Views
28

Actions

Likes
5
Downloads
0
Comments
0

1 Embed 28

http://localhost 28

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Encodings - Ruby 1.8 and Ruby 1.9 Encodings - Ruby 1.8 and Ruby 1.9 Presentation Transcript

  • Encodings Ruby 1.8 and 1.9 Vlad ZLOTEANU #ParisRB Software Engineer @ Dimelo December 12, 2001 @vladzloteanuCopyright Dimelo SA www.dimelo.com
  • Motto: “ There Aint No Such Thing As Plain Text ” Joel SpolskyCopyright Dimelo SA www.dimelo.com
  • ASCII (1963) historically: from telegraphic codes 7 bits to encode 128 chars included: english alphabet, digits, punctuation marks, control chars what about chars from other languages? "A".unpack("C*") => [65] "a".unpack("C*") => [97] "c".unpack("C*") => [99]Copyright Dimelo SA www.dimelo.com
  • iso-8859-X ideea: use the 8th bit -> 128 new positions 8-bit encoding -> 256 chars iso-8859-1 (Latin-1), windows-1252 slots 160 to 255 for other chars covers most WE languages: French, German, etc default charset in many browsers iso-8859-2 most EE languagesCopyright Dimelo SA www.dimelo.com
  • Issues cant combine 2 different languages from 2 different encodings most Asian languages have more than 256 chars "café".encode(ISO-8859-1).unpack("C*") => [99, 97, 102, 233] "Ionuţ".encode(ISO-8859-2).unpack("C*") => [73, 111, 110, 117, 254] "Ionuţ aime le café".encode(ISO-8859-1).unpack("C*") Encoding::UndefinedConversionError: U+0163 from UTF-8 to ISO-8859-1Copyright Dimelo SA www.dimelo.com
  • Unicode the goal of Unicode was literally to provide a character set that includes all characters in use today each letter maps to a code point (theoretical symbol) A is the same with A and A, but different from a uppercase, lowercase, rules for normalization, decomposition, etc. codespace of 1.1M code points (from 0 to 10FFFF) (110k chars) from 0 to 255 -> same encoding as Latin-1 (we can think of it like a superset of Latin-1)Copyright Dimelo SA www.dimelo.com
  • Unicode (2) Unicode enables processing, storage and interchange of text data no matter what the platform, no matter what the program, no matter the language .. but how should we store those magical ‘code points’? "café".codepoints.to_a => [99, 97, 102, 233] "café".encode(ISO-8859-1).unpack("C*") => [99, 97, 102, 233] "Ionuţ 愛して le καφές".codepoints.to_a => [73, 111, 110, 117, 355, 32, 24859, 12375, 12390, 32, 108, 101, 32, 954, 945, 966, 941, 962]Copyright Dimelo SA www.dimelo.com
  • UTF-8 encoding scheme for Unicode every code point from 0-127 is stored in a single byte. code points 128 and above are stored using >2 bytes "Café".unpack("U*") => [67, 97, 102, 233] "Café".encode(“UTF-8”).unpack("C*") => [67, 97, 102, 195, 169]Copyright Dimelo SA www.dimelo.com
  • UTF-8 pluses & minuses ASCII extension can encode any Unicode char self-synchronising, efficient to search for byte- oriented alghs, efficient to encode rfc2277: (inet) protocols MUST declare (supported) charsets, protocols MUST support at least UTF-8 " コーヒー ".unpack(U*) => [12467, 12540, 12498, 12540] " コーヒー ".unpack(C*) => [227, 130, 179, 227, 131, 188, 227, 131, 146, 227, 131, 188] # Asian languages take 1.5x more spaceCopyright Dimelo SA www.dimelo.com
  • What you should remember Text CONTENT and ENCODING are two different concepts Unicode is a map “symbol”  ‘integer codepoint’ Latin-1 is a single byte encoding for Western languages UTF-8 is a multibyte encoding for Unicode USE UTF-8!Copyright Dimelo SA www.dimelo.com
  • Ruby 1.8 Unicode Support string is just a collection of bytes --> dealing with encodings is for the developer issues: index retrieval, slicing, regexp, etc “”.size will always count bytes(validates_size_of …) limited unicode support (/u modifier) "Café".size => 5 "Café".reverse => "251303faC" "Café".scan(/./) => ["C", "a", "f", "303", "251"] "Café".scan(/./u) => ["C", "a", "f", “é"]Copyright Dimelo SA www.dimelo.com
  • Ruby 1.8 Unicode Support (2) regex - aware of 4 encodings: none, EUC, Shift_JIS, UTF-8 ways to set source encoding: command line K param RUBYOPT ruby -e "puts Café.scan(/./).inspect" ["C", "a", "f", "303", "251"] ruby -Ku -e "puts Café.scan(/./).inspect" ["C", "a", "f", "é"] export RUBYOPT=-Ku ruby -e "puts Café.scan(/./).inspect" ["C", "a", "f", "é"]Copyright Dimelo SA www.dimelo.com
  • Ruby 1.8 - Transcoding Iconv library – ships with Ruby, handles transcoding TRANSLIT option IGNORE utf8_coffee = "Café" => "Café" utf8_to_latin1 = Iconv.new("LATIN1//TRANSLIT//IGNORE", "UTF8") => #<Iconv:0x007f8ba1930060> utf8_to_latin1.iconv(utf8_coffee).size => 4 ruby-1.9.3-p0 :049 > utf8_to_latin1.iconv("On and on… and on…") => "On and on... and on...”Copyright Dimelo SA www.dimelo.com
  • Ruby 1.9 & M17N multilingualization (M17N) - a CSI approach Localization for more than one language on single software should be available More than one language should be available to use at the same time difference from conventional languages (java, python, perl) (UCS philosophy) 1. Source encoding: all source files have an encoding new __ENCODING__ keyword Irb ruby-1.9.3-p0 :002 > __ENCODING__ => #<Encoding:UTF-8>Copyright Dimelo SA www.dimelo.com
  • Ruby 1.9 – source encoding New way to set encoding: magic comment Priority: .rb files: magic comment > command-line –K option > RUBYOPT –K > shebang –K > US-ASCII command line / standard input: magic comment > command-line –K option > RUBYOPT –K > system locale # encoding: UTF-8 puts __ENCODING__ => UTF-8Copyright Dimelo SA www.dimelo.com
  • Ruby 1.9 – String class String – a collection of encoded data each String object has an encoding size method -> multibyte 3 new enumerator methods "café".size => 4 ruby-1.9.3-p0 :025 > "café".bytesize => 5 "café".each_byte.map{|byte| byte} => [99, 97, 102, 195, 169] "café".each_char.map{|char| char} => ["c", "a", "f", "é"] "café".each_codepoint.map{|byte| byte} => [99, 97, 102, 233]Copyright Dimelo SA www.dimelo.com
  • Ruby 1.9 – String class (Transcoding) Strings with different encoding can ‘coexist’ in same program – and can be merged New way to transcode latin_1_coffee = "café".encode(ISO-8859-1) => "cafxE9" latin_1_coffee.bytesize => 4 wrong_encoded_coffee = latin_1_coffee.force_encoding(UTF-8) => "cafxE9" latin_1_coffee.encoding => #<Encoding:UTF-8> ruby-1.9.3-p0 :035 > wrong_encoded_coffee.scan /./ ArgumentError: invalid byte sequence in UTF-8Copyright Dimelo SA www.dimelo.com
  • Ruby 1.9 - Internal and external encoding > cat show_encodings.rb open(__FILE__, "r:UTF-8:UTF-32") do |file| (that What about non-literal Strings come from I/O)? puts file.external_encoding.name puts file.internal_encoding.name 2. Encoding.default_external: file.each do |line| p [line.encoding.name, line[0..3]] end default for external encoding end derived from LANG on Unix/Linux derived from legacy system encoding on Windows > ruby show_encodings.rb UTF-8 UTF-32 3. Encoding.default_internal: ["UTF-32", "uFEFF"] ["UTF-32", "x00x00x00x20"]encoding default for internal ["UTF-32", "x00x00x00x20"] ["UTF-32", "x00x00x00x20"] (≊ default external) by default undefined ["UTF-32", "x00x00x00x20"] ["UTF-32", "x00x00x00x20"] ["UTF-32", "x00x00x00x65"]Copyright Dimelo SA www.dimelo.com
  • What you should remember Ruby 1.8 has limited (regexp-only) support for Unicode watch out on slices, sizes, reverse, etc. transcode with Iconv Ruby 1.9 is encoding-aware each source file has an Encoding each String has an Encoding IO: internal and external encoding New iterators on StringCopyright Dimelo SA www.dimelo.com
  • HTML/HTTP – declare encoding HTML/HTTP HTTP header Meta tags Content-Type: text/html; charset=ISO-8859-1 # HTTP Header <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta charset="utf-8"/> <?xml version="1.0" encoding="ISO-8859-1"?>Copyright Dimelo SA www.dimelo.com
  • HTML – Encoding chars Encoding types directly in declared encoding “é’ named char entities "&eacute;” numeric char entities “&#233;”Copyright Dimelo SA www.dimelo.com
  • Conclusion Use UTF8 Document (declare) encodings Code encoding-safeCopyright Dimelo SA www.dimelo.com
  • References James Gray’s Encodings series Joel Spolsky’s blog post about encodings Design and implementation of Ruby M17N Internationalization in Ruby 1.9Copyright Dimelo SA www.dimelo.com
  • .end Merci! Thank you! Mulţumesc ありがとう ?Copyright Dimelo SA www.dimelo.com