Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How Unidecoder Transliterates UTF-8 to ASCII

561 views

Published on

Slides of my talk at Paris.rb on 2014-11-07. How does UTF-8 work? How to leverage it to convert chinese, russian or any non-ASCII character to ASCII? Here is what the Unidecoder gem does.

Published in: Technology
  • Be the first to comment

How Unidecoder Transliterates UTF-8 to ASCII

  1. 1. Unidecoder Simon Courtois - @happynoff
  2. 2. Transliteration
  3. 3. Ni Hao
  4. 4. ПРИВЕТ PRIVIeT
  5. 5. How does it work?
  6. 6. At the beginning there was ASCII
  7. 7. A 65 B 66 C 67 a 97 b 98 c 99
  8. 8. A 65 10 00001 64 32 16 8 4 2 1 a 97 11 00001 64 32 16 8 4 2 1
  9. 9. B 66 10 00010 64 32 16 8 4 2 1 b 98 11 00010 64 32 16 8 4 2 1
  10. 10. Then… 8-bit computers!
  11. 11. So every country had its own encoding(s)!
  12. 12. All was fine until…
  13. 13. The World Wide Web
  14. 14. UTF-8 to the rescue!
  15. 15. Everything on 32 bits?
  16. 16. Bad idea c a f é
  17. 17. Bad idea c a f é
  18. 18. Bad idea 0
  19. 19. A better idea A 65 010 00001 110 XXXXX 10 XXXXXX 1110 XXXX 10 XXXXXX 10 XXXXXX
  20. 20. A better idea 110 XXXXX 10 XXXXXX 110 10000 10 011111
  21. 21. A better idea 10000011111 1055 П
  22. 22. So, how does unidecoder work?
  23. 23. How do we go from П to P ?
  24. 24. Start from a string like “П”
  25. 25. Unpack it “П”.unpack(“U”) [1055] 00000100 00011111 4 31
  26. 26. 4 x04 x04.yml Ie Io Dj … P 0 1 2 31
  27. 27. How to obtain 4 and 31 ?
  28. 28. unpacked = 1055 0000010000011111 unpacked 8 0000010000011111 4
  29. 29. How to obtain 4 and 3 1 ?
  30. 30. 31 unpacked = 1055 0000010000011111 unpacked 255 0000010000011111 0000000011111111 0000010000011111
  31. 31. Brain fried yet? advertising time!
  32. 32. www.tinci.fr Web Development Software Development Consulting Support @tincihq
  33. 33. Resources Characters, Symbols and the Unicode Miracle: bit.ly/why-utf8 Unidecoder: github.com/norman/unidecoder Slides: bit.ly/unidecoder
  34. 34. Thank you! Simon Courtois - @happynoff

×