Successfully reported this slideshow.
Your SlideShare is downloading. ×

Characters encoding and how to handle it in Ruby

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
reveal.js 3.0.0
reveal.js 3.0.0
Loading in …3
×

Check these out next

1 of 18 Ad
Advertisement

More Related Content

Recently uploaded (20)

Advertisement

Characters encoding and how to handle it in Ruby

  1. 1. Characters encoding and how to handle it in Ruby Paweł Cyło Rzeszów Ruby User Group, 30.11.2016
  2. 2. Characters encoding and how to handle it in Ruby • What is a string? (no, it is not what Google Image search suggests) • ASCII and ASCII-based 8-bit code pages • Unicode! • Working with characters encoding in Ruby
  3. 3. What is a String? 01001000 01100101 01101100 01101100 01101111 00100000 01010111 01101111 01110010 01101100 01100100 "Hello World” Data + Encoding There Ain't No Such Thing As Plain Text
  4. 4. ASCII • Initially designed to work with telegraphs - first version released in 1963 • 7 bits used in every byte • 95 human readable characters (32-126) + 33 characters in control codes block (0-31 and 127)
  5. 5. ASCII-based 8-bit encodings • “Hey, look! There is one bit not being used in ASCII, lets get it to work!” • First 128 characters == ASCII • Total freedom for last 128 characters • Unable to handle all existing characters in the world • Problematic languages mixing in single string Windows-1252, ISO-8859-1, etc...
  6. 6. The first Unicode version: UCS-2 • Fixed two bytes per character (~65k characters maximum) • Bizarre Byte Order Mark (BOM) character at the beginning (FE FF / FF FE), that was not even always present • “Look at all these wasted bytes storing zeros! Unicode sucks…” “Hello": 00 48 00 65 00 6C 00 6C 00 6F ...but it could also be: 48 00 65 00 6C 00 6C 00 6F 00
  7. 7. UTF-8 • Variable-width encoding! • Code points from 0-127 are stored in a single byte • Code points 128 and above are stored using 2, 3… in fact, up to 6 bytes • Code point - every letter is assigned a magic number by the Unicode consortium which is written like this: U+0639 • Currently supports 128 237 unique characters (and still growing) • BOM (EF BB BF) is optional and only shows that the encoding is UTF- 8
  8. 8. UTF-8 • Full backward compatibility with ASCII - code points U+0000 to U+007F • Other variants: • UTF-32, a 32-bit, fixed-width encoding • UTF-16, a 16-bit, variable-width encoding “Hello": 48 65 6C 6C 6F
  9. 9. Strings in Ruby • Unicode (UTF-8) by default • String literals allow direct inserting byte (“xXX“) or UTF-8 code point (“uXXXX“) • All strings are encoding aware
  10. 10. Strings in Ruby Is it UTF-8? Not really, the x99 byte is not a valid unicode code point, but it is available in some ASCII-based encodings. For example in Windows-1252 code page it stands for the “™”.
  11. 11. Strings in Ruby Pure ASCII. But it can get more complex:
  12. 12. Ruby String methods to work with encodings: • bytes - shows you the bytes that make up a string • encoding - returns the current encoding of a string • encode - translates a string to another encoding (converting characters to their equivalent in the new encoding) • force_encoding - shows you what those bytes would look like interpreted by a different encoding
  13. 13. They can be replaced with predefined default character Some characters may not be available in other encodings
  14. 14. For those who want more! • http://www.justinweiss.com/articles/3-steps-to-fix-encoding-problems-in-ruby/ • http://www.justinweiss.com/articles/how-to-get-from-theyre-to-theyre/ • http://www.joelonsoftware.com/articles/Unicode.html • http://kunststube.net/encoding/ • https://betterexplained.com/articles/unicode/ • http://blog.gatunka.com/2014/04/25/character-encodings-for-modern-programmers/ • https://ruby-doc.org/core-2.3.3/Encoding.html • http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/ • http://nuclearsquid.com/writings/ruby-1-9-encodings/
  15. 15. Paweł Cyło pawelcylo@gmail.com @PawelCylo Dziękuję za uwagę!

Editor's Notes

  • 7 which made your computer beep and 12 which caused the current page of paper to go flying out of the printer and a new one to be fed in
  • ASCII chart from a 1972 printer manual

×