String Encodings

1,190 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,190
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
6
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • 128 Chars\nEnglish upper, lower symbols + some control charaters\n
  • 128 chars = 7 bits.\n1 byte = 8 bits.\nWe have a whole 128 other characters to play with.\n
  • Everyone wants a different 128 chars.\nEncodings were born.\nLatin 1 is one of the most popular, it uses the other 128 chars mostly for accented alphabets.\n
  • It would be simple if we all spoke english!\njapanese, chinese, etc.\n
  • The only way we can deal with this is to comprise characters of more than 1 byte.\nThis is a big issue for programming as we now have to be very mindful not to split in the middle of a char.\n
  • Ideally we would only ever have 1 encoding to deal with.\nThis isn’t the case but unicode is as close as it gets.\n
  • There different\n\n
  • Character sets are just internal mappings from numbers to strings\n\nhex - int - char\n
  • here we are writing a lot of null chars\n
  • Unicode encoding that is 100% compat with US-ASCII\nuses variable length characters to maintain compat\nBest current solution for compat, size, support\n
  • as we’ve seen multibyte sets are common and useful.\nRuby doesn’t care about encodings, it just displays characters based on a mapping.\nreverse(), split(), size() all break multibyte strings very easily.\nBut ruby doesn’t check so you’ll never really know!\n
  • Ruby 1.8 supports utf-8 right?\nkinda\nunicode support extends to understanding boundaries so magically, reverse(), split(), match() all work.\nUnfortunately it’s very basic support so transforms like upcase() will bite you in the arse.\nAlso kcode is global so there is no way to deal with more than 1 encoding at a time.\n
  • Ruby 1.9 brings full string encoding support for over 80 encodings.\nString objects now reference both raw bytes and an Encoding object.\n\n
  • \n
  • a string is aware of it’s encoding, character length and byte length\n
  • force_encoding will retag a string with a different encoding, however the actual bytes are not modified.\nIf the string is tagged with an invalid encoding then it can cause ArgumentException’s when trying to manipulate the strings characters.\n
  • calling encode() actually modifies the underlying string.\n
  • \n
  • String.each is gone in 1.9\nreplaced with explicit each methods and Enumerator functions\n
  • \n
  • \n
  • \n
  • http://unicode-utils.rubyforge.org/ for locale management\n
  • \n
  • \n
  • \n
  • String Encodings

    1. 1. STRING ENCODINGS
    2. 2. I’m still@malditogeek
    3. 3. We’re hiring!
    4. 4. STRING ENCODINGS
    5. 5. FIRST THERE WAS...ASCII
    6. 6. BUT WHAT ABOUT THE 8TH BIT?
    7. 7. UH OH!
    8. 8. BUT MY LANGUAGE HAS 2000CHARACTERS
    9. 9. WELCOME TO MULTIBYTE ENCODINGS
    10. 10. UNICODE
    11. 11. CHARACTER SETS vs ENCODINGS
    12. 12. CHARACTER SETSU+0061 = 97 = a
    13. 13. ENCODINGS " abc ".encode("UTF-32BE")"x00x00x00ax00x00x00bx00x00x00c"
    14. 14. UTF-8
    15. 15. RUBY 1.8STRINGS ARE JUST BYTE ARRAYS!
    16. 16. $KCODE=’u’
    17. 17. RUBY 1.9
    18. 18. M17N
    19. 19. STRING EXAMPLESe = "é"e.encoding.name # => UTF-8e.size # => 1e.bytesize # => 2
    20. 20. FORCING ENCODINGSx.encoding.name #=> ISO-8859-1x.bytesize #=> 6x.valid_encoding? #=> truex.force_encoding("UTF-8")x.encoding.name #=> UTF-8x.bytesize #=> 6x.valid_encoding? #=> falsex =~ /x/ #=> invalid byte sequence in UTF-8
    21. 21. TRANSCODINGx.encoding.name #=> ISO-8859-1x.bytesize #=> 6x.valid_encoding? #=> truex.encode!("UTF-8")x.encoding.name #=> UTF-8x.bytesize #=> 12x.valid_encoding? #=> true
    22. 22. COMPATIBILITYif Encoding.compatible?(ascii_string, utf8_string) new_string = ascii_string + utf8_string new_string.encoding.name #=> UTF-8end
    23. 23. ITERATIONx.bytesx.each_byte {|b| puts b }x.codepointsx.each_codepoint {|c| puts c }x.charsx.each_char {|c| puts c }x.linesx.each_line {|l| puts l }
    24. 24. SOURCE ENCODING# encoding: UTF-8
    25. 25. INTERNAL ENCODING> ruby -E :UTF-8Encoding.default_internal = UTF-8
    26. 26. EXTERNAL ENCODING> ruby -E UTF-8:Encoding.default_external = UTF-8File.open("file.txt", "w:UTF-8")
    27. 27. WHAT DO? USE MAGIC COMMENTS DECLARE IO ENCODINGSCONVERT BEFORE COMPARISONS UNICODE-UTILS GEM
    28. 28. C’EST FINI
    29. 29. We’re hiring!
    30. 30. Thanks!http://blog.grayproductions.net/ articles/understanding_m17n

    ×