Encodings - Ruby 1.8 and Ruby 1.9

Encodings
Ruby 1.8 and 1.9

Vlad ZLOTEANU
#ParisRB Software Engineer @ Dimelo
December 12, 2001
@vladzloteanu

Copyright Dimelo SA www.dimelo.com

Motto:
“ There Ain't No Such Thing
As Plain Text ”
Joel Spolsky


ASCII (1963)

historically: from telegraphic codes
7 bits to encode 128 chars
included: english alphabet, digits, punctuation
marks, control chars
what about chars from other languages?
"A".unpack("C*")
=> [65]

"a".unpack("C*")
=> [97]

"c".unpack("C*")
=> [99]

iso-8859-X

ideea: use the 8th bit -> 128 new positions
8-bit encoding -> 256 chars

iso-8859-1 (Latin-1), windows-1252
slots 160 to 255 for other chars
covers most WE languages: French, German, etc
default charset in many browsers

iso-8859-2
most EE languages

Issues

can't combine 2 different languages from 2
different encodings
most Asian languages have more than 256 chars

"café".encode('ISO-8859-1').unpack("C*")
=> [99, 97, 102, 233]

"Ionuţ".encode('ISO-8859-2').unpack("C*")
=> [73, 111, 110, 117, 254]

"Ionuţ aime le café".encode('ISO-8859-1').unpack("C*")
Encoding::UndefinedConversionError:
U+0163 from UTF-8 to ISO-8859-1

Unicode

the goal of Unicode was literally to provide a
character set that includes all characters in use today

each letter maps to a code point (theoretical symbol)
A is the same with A and A, but different from a
uppercase, lowercase, rules for normalization,
decomposition, etc.
codespace of 1.1M code points (from 0 to 10FFFF) (110k
chars)

from 0 to 255 -> same encoding as Latin-1 (we can
think of it like a superset of Latin-1)

Unicode (2)

Unicode enables processing, storage and interchange
of text data no matter what the platform, no matter
what the program, no matter the language
.. but how should we store those magical ‘code
points’?
"café".codepoints.to_a
=> [99, 97, 102, 233]

"café".encode('ISO-8859-1').unpack("C*")
=> [99, 97, 102, 233]

"Ionuţ 愛して le καφές".codepoints.to_a
=> [73, 111, 110, 117, 355, 32, 24859, 12375, 12390, 32, 108, 101, 32, 954,
945, 966, 941, 962]

UTF-8

encoding scheme for Unicode
every code point from 0-127 is stored in a single byte.
code points 128 and above are stored using >2 bytes

"Café".unpack("U*")
=> [67, 97, 102, 233]

"Café".encode(“UTF-8”).unpack("C*")
=> [67, 97, 102, 195, 169]


UTF-8 pluses & minuses

ASCII extension
can encode any Unicode char
self-synchronising, efficient to search for byte-
oriented alghs, efficient to encode
rfc2277: (inet) protocols MUST declare (supported)
charsets, protocols MUST support at least UTF-8

" コーヒー ".unpack('U*')
=> [12467, 12540, 12498, 12540]

" コーヒー ".unpack('C*')
=> [227, 130, 179, 227, 131, 188, 227, 131, 146,
227, 131, 188] # Asian languages take 1.5x more space


What you should remember

Text CONTENT and ENCODING are two different
concepts
Unicode is a map “symbol”  ‘integer codepoint’
Latin-1 is a single byte encoding for Western
languages
UTF-8 is a multibyte encoding for Unicode

USE UTF-8!


Ruby 1.8 Unicode Support

string is just a collection of bytes --> dealing with
encodings is for the developer
issues: index retrieval, slicing, regexp, etc
“”.size will always count bytes(validates_size_of …)
limited unicode support (/u modifier)
"Café".size
=> 5

"Café".reverse
=> "251303faC"

"Café".scan(/./)
=> ["C", "a", "f", "303", "251"]

"Café".scan(/./u)
=> ["C", "a", "f", “é"]

Ruby 1.8 Unicode Support (2)

regex - aware of 4 encodings: none, EUC, Shift_JIS,
UTF-8
ways to set source encoding:
command line K param
RUBYOPT

ruby -e "puts 'Café'.scan(/./).inspect"
["C", "a", "f", "303", "251"]

ruby -Ku -e "puts 'Café'.scan(/./).inspect"
["C", "a", "f", "é"]

export RUBYOPT='-Ku'
ruby -e "puts 'Café'.scan(/./).inspect"
["C", "a", "f", "é"]

Ruby 1.8 - Transcoding

Iconv library – ships with Ruby, handles transcoding
TRANSLIT option
IGNORE
utf8_coffee = "Café"
=> "Café"

utf8_to_latin1 = Iconv.new("LATIN1//TRANSLIT//IGNORE", "UTF8")
=> #<Iconv:0x007f8ba1930060>

utf8_to_latin1.iconv(utf8_coffee).size
=> 4

ruby-1.9.3-p0 :049 > utf8_to_latin1.iconv("On and on… and on…")
=> "On and on... and on...”

Ruby 1.9 & M17N

multilingualization (M17N) - a CSI approach
Localization for more than one language on single
software should be available
More than one language should be available to use at the
same time
difference from conventional languages (java, python,
perl) (UCS philosophy)

1. Source encoding: all source files have an encoding
new __ENCODING__ keyword

Irb
ruby-1.9.3-p0 :002 > __ENCODING__
=> #<Encoding:UTF-8>

Ruby 1.9 – source encoding

New way to set encoding: magic comment

Priority:
.rb files:
magic comment > command-line –K option > RUBYOPT –K >
shebang –K > US-ASCII

command line / standard input:
magic comment > command-line –K option > RUBYOPT –K >
system locale
# encoding: UTF-8
puts __ENCODING__
=> UTF-8

Ruby 1.9 – String class

String – a collection of encoded data
each String object has an encoding
size method -> multibyte
3 new enumerator methods
"café".size
=> 4
ruby-1.9.3-p0 :025 > "café".bytesize
=> 5

"café".each_byte.map{|byte| byte}
=> [99, 97, 102, 195, 169]

"café".each_char.map{|char| char}
=> ["c", "a", "f", "é"]

"café".each_codepoint.map{|byte| byte}
=> [99, 97, 102, 233]

Ruby 1.9 – String class (Transcoding)

Strings with different encoding can ‘coexist’ in
same program – and can be merged
New way to transcode
latin_1_coffee = "café".encode('ISO-8859-1')
=> "cafxE9"

latin_1_coffee.bytesize
=> 4

wrong_encoded_coffee = latin_1_coffee.force_encoding('UTF-8')
=> "cafxE9"
latin_1_coffee.encoding
=> #<Encoding:UTF-8>
ruby-1.9.3-p0 :035 > wrong_encoded_coffee.scan /./
ArgumentError: invalid byte sequence in UTF-8

Ruby 1.9 - Internal and external encoding
> cat show_encodings.rb
open(__FILE__, "r:UTF-8:UTF-32") do |file| (that
What about non-literal Strings come from I/O)?
puts file.external_encoding.name
puts file.internal_encoding.name
2. Encoding.default_external:
file.each do |line|
p [line.encoding.name, line[0..3]]
end default for external encoding
end derived from LANG on Unix/Linux
derived from legacy system encoding on Windows
> ruby show_encodings.rb
UTF-8
UTF-32
3. Encoding.default_internal:
["UTF-32", "uFEFF"]
["UTF-32", "x00x00x00x20"]encoding
default for internal
["UTF-32", "x00x00x00x20"]
["UTF-32", "x00x00x00x20"] (≊ default external)
by default undefined
["UTF-32", "x00x00x00x20"]
["UTF-32", "x00x00x00x20"]
["UTF-32", "x00x00x00x65"]

What you should remember

Ruby 1.8 has limited (regexp-only) support for
Unicode
watch out on slices, sizes, reverse, etc.
transcode with Iconv

Ruby 1.9 is encoding-aware
each source file has an Encoding
each String has an Encoding
IO: internal and external encoding
New iterators on String

HTML/HTTP – declare encoding

HTML/HTTP
HTTP header
Meta tags

Content-Type: text/html; charset=ISO-8859-1 # HTTP Header

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

<meta charset="utf-8"/>

<?xml version="1.0" encoding="ISO-8859-1"?>

HTML – Encoding chars

Encoding types
directly in declared encoding
“é’
named char entities
"é”
numeric char entities
“é”


Conclusion

Use UTF8

Document (declare) encodings

Code encoding-safe


References

James Gray’s Encodings series

Joel Spolsky’s blog post about encodings

Design and implementation of Ruby M17N

Internationalization in Ruby 1.9


.end

Merci!
Thank you!
Mulţumesc
ありがとう

?

Encodings - Ruby 1.8 and Ruby 1.9

Recommended

Recommended

More Related Content

Similar to Encodings - Ruby 1.8 and Ruby 1.9

Similar to Encodings - Ruby 1.8 and Ruby 1.9 (20)

Recently uploaded

Recently uploaded (20)

Encodings - Ruby 1.8 and Ruby 1.9