Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The str/bytes nightmare before python2 EOL

Note: download the slide will give your better experience.

This talk was giving at PyConTW 2019.
https://tw.pycon.org/2019/en-us/events/talk/839036452602904785/

Youtube video will be released later.

  • Be the first to comment

The str/bytes nightmare before python2 EOL

  1. 1. The str/bytes nightmare before python2 EOL Kir Chou A9 (Amazon Product Search) 1
  2. 2. Kir Chou 2 note35 kir.chou
  3. 3. 3
  4. 4. Google Trend 4 REF encoding > recursion
  5. 5. Outline 1. Objective 2. Background 3. Python String 101 – What is string? 4. Python String 101 – Why is this design? 5. 5 + 1 Treatments 5
  6. 6. 🚨 Confusing Terms 🚨 • Str: str() of python • Bytes: bytes() of python • Text: unicode() in Python2 or str() in Python3 • String: Text ∪ Bytes 6 You just need to know that those terms are different in this talk.
  7. 7. Objective 7
  8. 8. Dunning–Kruger effect curve 8 😃 🤨 🥺 🧐 😯 ✋😌👌 🤯
  9. 9. 9 What is it? Why is the design? Treatment? Dunning–Kruger effect curve
  10. 10. Background 10
  11. 11. Python2 End-Of-Life https://pythonclock.org/ 11
  12. 12. Optimistic Numbers 12 By Jetbrains 2018 survey Victor Stinner (PyCon 2018) Python 3: ten years later
  13. 13. 13 Python2.7 Legacy Dependencies
  14. 14. 14
  15. 15. https://www.youtube.com/watch?v=BS-HyV3V7GI 15
  16. 16. 16 Migration – Pain Points painful easy
  17. 17. Python String 101 text vs bytes 17 (unicode)
  18. 18. 18 Text: How to present info in memory? ℙ ƴ ℌ ø ἤ e2 84 99 c6 b4 e2 98 82 e2 84 8c c3 b8 e1 bc a4
  19. 19. 19 ℙ ƴ ℌ ø ἤ e2 84 99 c6 b4 e2 98 82 e2 84 8c c3 b8 e1 bc a4 Bytes: How to store info into memory?
  20. 20. Python3 >>> 'python' 'python' >>> 'ℙƴ ℌøἤ' 'ℙƴ ℌøἤ' 20 Default: text
  21. 21. Python3 >>> b'python' b'python' >>> 'python'.encode(encoding='ascii') b'python' >>> b'ℙƴ ℌøἤ' SyntaxError: bytes can only contain ASCII literal characters. >>> 'ℙƴ ℌøἤ'.encode(encoding='utf-8') b'xe2x84x99xc6xb4xe2x98x82xe2x84x8cxc3 xb8xe1xbcxa4' 21 b' ' : ascii code (0~127) text.encode(encoding)
  22. 22. Python2 >>> 'python' 'python' >>> 'ℙƴ ℌøἤ' 'xe2x84x99xc6xb4xe2x98x82xe2x84x8cx c3xb8xe1xbcxa4' 22 Default: auto-encoded bytes
  23. 23. Python2 >>> u'ℙƴ ℌøἤ' u'u2119u01b4u2602u210cxf8u1f24' 23 u' ': text
  24. 24. Python String 101 encode vs decode 24 Text -> Bytes encode(encoding) Bytes -> Text decode(encoding)
  25. 25. Python3 >>> dir(str()) […, 'encode', …] >>> dir(bytes()) […, 'decode', …] 25 Text -> Bytes encode(encoding) Bytes -> Text decode(encoding)
  26. 26. Python2 >>> dir(str()) […, 'decode', 'encode', …] >>> dir(bytes()) […, 'decode', 'encode', …] >>> dir(unicode()) […, 'decode', 'encode', …] 26
  27. 27. Python2 >>> dir(str()) […, 'decode', 'encode', …] >>> dir(bytes()) […, 'decode', 'encode', …] >>> dir(unicode()) […, 'decode', 'encode', …] 27
  28. 28. Python String 101 Python2 Design Philosophy 28
  29. 29. 29
  30. 30. Python2 is well-designed for ascii code BUT… • Most of encoding only support one way: utf-8, latin-1 >>> 'ℙƴ ℌøἤ'.encode() Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128) 30
  31. 31. >>> dir(str()) […, 'decode', 'encode', …] >>> dir(bytes()) […, 'decode', 'encode', …] >>> dir(unicode()) […, 'decode', 'encode', …] 31 But why?
  32. 32. Exceptions • base64, rot13 >>> 'python'.encode('base64') 'cHl0aG9un' >>> 'cHl0aG9un'.decode('base64') 'python' >>> import base64 >>> base64.b64encode('python'.encode('utf-8')) b'cHl0aG9u' 32 2 3
  33. 33. Python String 101 Fundamental Difference - Encoding 33
  34. 34. You must be told, or you have to guess By Ned Batchelder 34
  35. 35. ASCII encoding behave similar, BUT… >>> b'python' == u'python' False 35 >>> b'python' == u'python' True 3 2
  36. 36. Treatment 1 Knowing the Fact of String 36
  37. 37. The Fact of String By Ned Batchelder • Python3: string = unicode • Encoding needs to be handled manually • One-way encode/decode behind str/bytes (Good) • Python2: string = auto encoded bytes • Ascii codes are perfectly handled • Two-way encode/decode behind str/bytes (Broken after python is broadly used in many different human languages.) 37
  38. 38. Treatment 2 Explicit Declaration (From community approach to supporting python3) 38
  39. 39. Consistent IO • Standard I/O (bytes) • Python2: sys.stdin / sys.stdout • Python3: sys.buffer.stdin / sys.buffer.stdout • File IO import io # consistent api in both versions io.open('path/to/file', 'wt') # text, bytes 39
  40. 40. Bytes or Text (with Encoding) • u'' or b''? • No more raw string '' • Which encoding is used for the text? • No more guess, always provide encoding: latin-1, utf-8… 40 def my_encrypt(text, encoding='utf-8'): … 32
  41. 41. 41 >>> # -*- coding: utf-8 -*- sys.getdefaultencoding() Encoding of your Operating System 32
  42. 42. Treatments 3 Unicode Sandwich (From community approach to supporting python3) 42
  43. 43. Unicode Sandwich Decode ASAP, Encode ALAP 43 Don’t care about encoding!
  44. 44. Treatments 4 (minor) Python2/3 compatibility (From community approach to supporting python3) 44
  45. 45. 45 >>> from __future__ import unicode_literals Add unicode_literals after porting to python3 2
  46. 46. Treatments 5 Typing (Learn from dropbox migration notes) 46
  47. 47. Typing + [mypy] [pyre] def encrypt(data, key): return cipher 47 def encrypt(data: bytes, key: bytes) -> bytes: return cipher def encrypt(data, key): # type: (bytes, bytes) -> bytes return cipher 3 3 2 2
  48. 48. Treatments SP C-API 48
  49. 49. C-API 49 3 2 #if PY_MAJOR_VERSION >= 3 #define Py2str_FromStringAndSize(x, y) PyBytes_FromStringAndSize(x, y) #else #define Py2str_FromStringAndSize(x, y) PyString_FromStringAndSize(x, y) #endif Example @ pyconjp 2019 sprint https://github.com/note35/SupportingPython3-notes
  50. 50. Conclusion 50
  51. 51. 51 Q: Supporting python3? A: ✋😌👌
  52. 52. Take-Home Messages Write text/bytes explicitly with typing Always provide encoding for bytes Apply Unicode Sandwich if possible Copy encode/decode from StackOverflow 52 Python String is no longer a nightmare! 🎉 (Even you only write Python3)
  53. 53. Verify Your Understanding! https://stackoverflow.com/search?q=UnicodeEncodeError 53
  54. 54. 54
  55. 55. Reference: Docs • Python2 unicode • Python3 unicode • pyporting • python-future.org/compatible_idioms.pdf • 2018 Jetbrains Survey • Dropbox Migration Notes 55
  56. 56. Reference: Talks • Ned Batchelder: Pragmatic Unicode, or, How do I stop the pain? • Guido van Rossum: BDFL Python 3 retrospective • Brett Cannon - How to make your code Python 2/3 compatible - PyCon 2015 • Edward Schofield - Writing Python 2/3 compatible code • Victor Stinner - Python 3: ten years later - PyCon 2018 56

×