Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Characters encoding and
how to handle it in Ruby
Paweł Cyło
Rzeszów Ruby User Group, 30.11.2016
Characters encoding and
how to handle it in Ruby
• What is a string?
(no, it is not what Google Image search suggests)
• A...
What is a String?
01001000 01100101 01101100 01101100 01101111
00100000 01010111 01101111 01110010 01101100
01100100
"Hell...
ASCII
• Initially designed to work with telegraphs - first version
released in 1963
• 7 bits used in every byte
• 95 human...
ASCII-based 8-bit encodings
• “Hey, look! There is one bit not being used in ASCII, lets
get it to work!”
• First 128 char...
The first Unicode version: UCS-2
• Fixed two bytes per character (~65k characters maximum)
• Bizarre Byte Order Mark (BOM)...
UTF-8
• Variable-width encoding!
• Code points from 0-127 are stored in a single byte
• Code points 128 and above are stor...
UTF-8
• Full backward compatibility with ASCII - code points U+0000 to U+007F
• Other variants:
• UTF-32, a 32-bit, fixed-...
Strings in Ruby
• Unicode (UTF-8) by default
• String literals allow direct inserting byte
(“xXX“) or UTF-8 code point (“u...
Strings in Ruby
Is it UTF-8? Not really, the x99 byte is not a valid unicode code
point, but it is available in some ASCII...
Strings in Ruby
Pure ASCII. But it can get more complex:
Ruby String methods to work
with encodings:
• bytes - shows you the bytes that make up a string
• encoding - returns the c...
They can be replaced with predefined default character
Some characters may not be available in other encodings
For those who want more!
• http://www.justinweiss.com/articles/3-steps-to-fix-encoding-problems-in-ruby/
• http://www.just...
Paweł Cyło
pawelcylo@gmail.com
@PawelCylo
Dziękuję za uwagę!
Characters encoding and how to handle it in Ruby
Characters encoding and how to handle it in Ruby
Characters encoding and how to handle it in Ruby
Upcoming SlideShare
Loading in …5
×

Characters encoding and how to handle it in Ruby

131 views

Published on

Presented on the Rzeszow Ruby User Group meeting

Published in: Software
  • Be the first to comment

  • Be the first to like this

Characters encoding and how to handle it in Ruby

  1. 1. Characters encoding and how to handle it in Ruby Paweł Cyło Rzeszów Ruby User Group, 30.11.2016
  2. 2. Characters encoding and how to handle it in Ruby • What is a string? (no, it is not what Google Image search suggests) • ASCII and ASCII-based 8-bit code pages • Unicode! • Working with characters encoding in Ruby
  3. 3. What is a String? 01001000 01100101 01101100 01101100 01101111 00100000 01010111 01101111 01110010 01101100 01100100 "Hello World” Data + Encoding There Ain't No Such Thing As Plain Text
  4. 4. ASCII • Initially designed to work with telegraphs - first version released in 1963 • 7 bits used in every byte • 95 human readable characters (32-126) + 33 characters in control codes block (0-31 and 127)
  5. 5. ASCII-based 8-bit encodings • “Hey, look! There is one bit not being used in ASCII, lets get it to work!” • First 128 characters == ASCII • Total freedom for last 128 characters • Unable to handle all existing characters in the world • Problematic languages mixing in single string Windows-1252, ISO-8859-1, etc...
  6. 6. The first Unicode version: UCS-2 • Fixed two bytes per character (~65k characters maximum) • Bizarre Byte Order Mark (BOM) character at the beginning (FE FF / FF FE), that was not even always present • “Look at all these wasted bytes storing zeros! Unicode sucks…” “Hello": 00 48 00 65 00 6C 00 6C 00 6F ...but it could also be: 48 00 65 00 6C 00 6C 00 6F 00
  7. 7. UTF-8 • Variable-width encoding! • Code points from 0-127 are stored in a single byte • Code points 128 and above are stored using 2, 3… in fact, up to 6 bytes • Code point - every letter is assigned a magic number by the Unicode consortium which is written like this: U+0639 • Currently supports 128 237 unique characters (and still growing) • BOM (EF BB BF) is optional and only shows that the encoding is UTF- 8
  8. 8. UTF-8 • Full backward compatibility with ASCII - code points U+0000 to U+007F • Other variants: • UTF-32, a 32-bit, fixed-width encoding • UTF-16, a 16-bit, variable-width encoding “Hello": 48 65 6C 6C 6F
  9. 9. Strings in Ruby • Unicode (UTF-8) by default • String literals allow direct inserting byte (“xXX“) or UTF-8 code point (“uXXXX“) • All strings are encoding aware
  10. 10. Strings in Ruby Is it UTF-8? Not really, the x99 byte is not a valid unicode code point, but it is available in some ASCII-based encodings. For example in Windows-1252 code page it stands for the “™”.
  11. 11. Strings in Ruby Pure ASCII. But it can get more complex:
  12. 12. Ruby String methods to work with encodings: • bytes - shows you the bytes that make up a string • encoding - returns the current encoding of a string • encode - translates a string to another encoding (converting characters to their equivalent in the new encoding) • force_encoding - shows you what those bytes would look like interpreted by a different encoding
  13. 13. They can be replaced with predefined default character Some characters may not be available in other encodings
  14. 14. For those who want more! • http://www.justinweiss.com/articles/3-steps-to-fix-encoding-problems-in-ruby/ • http://www.justinweiss.com/articles/how-to-get-from-theyre-to-theyre/ • http://www.joelonsoftware.com/articles/Unicode.html • http://kunststube.net/encoding/ • https://betterexplained.com/articles/unicode/ • http://blog.gatunka.com/2014/04/25/character-encodings-for-modern-programmers/ • https://ruby-doc.org/core-2.3.3/Encoding.html • http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/ • http://nuclearsquid.com/writings/ruby-1-9-encodings/
  15. 15. Paweł Cyło pawelcylo@gmail.com @PawelCylo Dziękuję za uwagę!

×