Character Encoding & Unicode - How to (╯°□°）╯︵ ┻━┻ with dignity

1.
ODE TO ASHIPPING LABEL! by Carlos Bueno! ! Once there was a little o,! with an accent on top like só.! ! It started out as UTF8,! (universal since '98),! but the program only knew latin1,! and changed little ó to "Ã³" for fun.! ! A second program saw the "Ã³"! and said "I know HTML entity!"! So "Ã³" was smartened to "&ATILDE;&SUP3;"! and passed on through happily.! ! Another program saw the tangle! (more precisely, ampersands to mangle)! and thus the humble "&ATILDE;&SUP3;"! became "&AMP;AMP;ATILDE;&AMP;AMP;SUP3;"

2.
Character Encoding & Unicode Howto (╯°□°）╯︵ ┻━┻ with dignity Esther Nam & Travis Fischer! PyCon US 2014, Montréal

6.
Uni-wat?!

7.
┻━┻ ︵ヽﾉ︵┻━┻

8.
How to (╯°□°）╯︵┻━┻ with dignity

9.
– Luke Sneeringer| Program Committee Chair “You'll be pleased to know that your talk title crashed our meeting robot, which is a great argument for the relevance of this talk. :-) ...”

10.
Python 3 is outof scope

11.
The Fundamentals of Unicode

12.
Humans use text. Computersspeak bytes.

13.
a -> 01100001

14.
ASCII ISO-8859-15! (latin-9) CP-1252! (Windows 1252) UTF-8 a 0110000101100001 01100001 01100001 € NA 10100100 10000000 11100010 10000010 10101100 ¤ NA NA 10100100 11000010 10100100

15.

16.

17.

18.
π — ‽☠ ☁ ☂ ☃ ☄ ★ ☆ ☇ ☈ ☉ ☊ ☋ ☌ ☍ ☎ ☏ ☐ ☑ ☒ ☓ ☘ ☙ ☚ ☛ ☜ ☝ ☞ ☟ ☠ ☡ ☢ ☣ ☤ ☥ ☦ ☧ ☨ ☩ ☪ ☫ ☬ ☭ ☮ ☯ ☸ ☹ ☺ ☻ ☼ ☽ ☾ ☿ ♀ ♁ ♂ ♃ ♄ ♅ ♆ ♇ ♔ ♕ ♖ ♗ ♘ ♙ ♚ ♛ ♜ ♝ ♞ ♟ ♠ ♡ ♢ ♣ ♤ ♥ ♦ ♧ ♨ ♩ ♪ ♫ ♬ ♭ ♯ ♰ ♾ ⚀ ⚁ ⚂ ⚃ ⚄ ⚅ ⚆ ⚇ ⚈

21.
a -> U+0061 CharacterUnicode Code Point

22.
! Unicode a -> U+0061 CharacterUnicode Code Point

23.
! Unicode a -> U+0061 CharacterLATIN SMALL LETTER A

24.
Computers speak bytes.

25.
! Unicode a ! U+0061 -> 01100001 UnicodeCode Point Binary Encoding

26.
! Unicode U+0061 -> 01100001 UnicodeCode Point Binary Encodinga

27.
UTF-8 Unicode Transformation Format

28.
Unicode != UTF-8 CodePoints Binary Encoding U+0061 01100001

29.
Layers of Abstraction

30.
• Display (Glyphs| Fonts) Let them eat cake!

31.
• Display (Glyphs| Fonts) Let them eat cake! ! • Text (Unicode | Code Points) U+0061

32.
• Display (Glyphs| Fonts) Let them eat cake! ! • Text (Unicode | Code Points) U+0061 ! • Storage (Binary | UTF-8) 01100001

33.
Unicode & Python [Python2.7]

34.
str type >>>euro_bytestring ='€' ! >>>type(euro_bytestring) <type 'str'> [Python 2.7]

35.
unicode type # €code point >>>euro_unicode = u'u20ac' ! >>>type(euro_unicode) <type 'unicode'> [Python 2.7]

36.
Unicode Code points u'u20ac' ! Bytes UTF-8 'xe2x82xac' ! [Python 2.7]

37.
Unicode Code points u'u20ac' 'xe2x82xac'.decode('utf8') ! Bytes UTF-8 'xe2x82xac' ! [Python 2.7]

38.
Unicode Code points u'u20ac' 'xe2x82xac'.decode('utf8') ! Bytes UTF-8 'xe2x82xac' ! [Python 2.7]

39.
Unicode Code points u'u20ac' 'xe2x82xac'.decode('utf8') u'u20ac'.encode('utf8') ! Bytes UTF-8 'xe2x82xac' ! [Python 2.7]

40.
Unicode Code points u'u20ac' 'xe2x82xac'.becode('utf8') u'u20ac'.uncode('utf8') ! Bytes UTF-8 'xe2x82xac' ! [Python 2.7]

41.
You CANNOT inferan encoding from a bytestring

42.
#! /usr/bin/python # -*-coding: utf8 -*- ! # Opened file should be latin-1 encoded! # If it’s not, call tech support ASAP with open("input_file.csv") as input_file: Date: Wed, 11 Apr 2014 11:15:55 -0600  To: foo@bar.com  From: bar@foo.com Subject: Character encoding MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" <?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE html PUBLIC “-//W3C//DTD …> <html xmlns="http://www.w3.org/1999/xhtml" …>

43.
Best Practices

44.
Example Application

45.
Author Review G. vanRossum If you decide to design your own car there are thousands sort of car… R. Ebert Every great car should feel new every time you drive it. L. Torvalds Volvo isn’t evil, they just make really crappy cars.

46.
Author Review G. vanRossum If you decide to design your own car there are thousands sort of car… R. Ebert Every great car should feel new every time you drive it. L. Torvalds Volvo isn’t evil, they just make really crappy cars. Application Processes Text

47.
Author Review G. vanRossum If you decide to design your own car there are thousands sort of car… R. Ebert Every great car should feel new every time you drive it. L. Torvalds Volvo isn’t evil, they just make really crappy cars. Application Processes Text PSQL

48.

49.
Encoding: Windows 1252(CP-1252)

50.
Montreal -> Montréal

51.
psql=# set server_encoding to"utf-8";

52.
My friend said:“I cannot believe this is a Volvo! I had a car just like this when I lived in Montreal.” He told me he had paid 9400€ for his. Sample Review Text

53.

54.

55.
# -*- coding:utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]

56.

57.

58.

59.

60.

61.

62.
# -*- coding:utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)

63.
My friend said:�I cannot believe this is a Volvo! I had a car just like this when I lived in Montréal.� He told me he had paid 9400� for his. Output from UTF-8 encoded PSQL database

64.

65.

66.

67.
[Python 2.7] # -*-coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)

68.

69.

70.

71.

72.
My friend said:“I cannot believe this is a Volvo! I had a car just like this when I lived in Montreal.” He told me he had paid 9400€ for his. Original CP-1252 Data

73.
My friend said:“I cannot believe this is a Volvo! I had a car just like this when I lived in Montréal.” He told me he had paid 9400€ for his. Mixed CP-1252 & UTF-8

74.
My friend said:�I cannot believe this is a Volvo! I had a car just like this when I lived in Montréal.� He told me he had paid 9400� for his. Interpreted as UTF-8 by database

75.
Know your encodings BestPractice #1

76.

77.

78.
# -*- coding:utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode() author, date, review_text = unicode_row.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]

79.
# -*- coding:utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode() author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]

80.
# -*- coding:utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode() author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review)

81.
Traceback (most recentcall last): File "...", line ..., in <module> unicode_row = row_text.decode() UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 31: ordinal not in range(128)

82.
Traceback (most recentcall last): File "...", line ..., in <module> unicode_row = row_text.decode() UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 31: ordinal not in range(128)

83.
# -*- coding:utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode() author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]

84.
# -*- coding:utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode() author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review)

85.
# -*- coding:utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]

86.
# -*- coding:utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]

87.
# -*- coding:utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author.encode("utf8"), date.encode("utf8"), converted_review.encode("utf8")) [Python 2.7]

88.
# -*- coding:utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable author.encode("utf8" date.encode("utf8"), converted_review.encode("utf8")

89.
My friend said:“I cannot believe this is a Volvo! I had a car just like this when I lived in Montréal.” He told me he had paid 9400€ for his.

90.
Use the Unicode Sandwich BestPractice #2

91.
Decode as earlyas possible.! Unicode everywhere in the middle.! Encode as late as possible.

92.

93.

94.
# -*- coding:utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u”Montreal", u"Montréal") DB.insert(ReviewTable, author.encode("utf8"), date.encode("utf8"), converted_review.encode("utf8")) [Python 2.7]

95.
Test Your (Text Related)Code Best Practice #3

96.
Test encoding ranges &boundaries test_strings = ['Hello Montreal!', '¡‫ן‬ɐǝɹʇuoɯ o‫ןן‬ǝɥ', 'ђєɭɭ๏ ๓๏ภՇгєค !'] ! func_under_test(test_strings)

97.
test_bytes = 'Iam a bytestring mwahaha' ! test_unicode = u'ι αм υηι¢σ∂є!' ! ! i_expect_unicode(test_bytes) ! i_expect_bytes(test_unicode) Test interfaces against both Python text types

98.
def ascii_handling_function(ascii_str): ... ascii_str.decode('ascii') ... Test handlingof incorrect encoding

99.
utf8_str = u'UՇF-8ՇєsՇ'.encode('utf8') ! with assertRaises(UnicodeDecodeError): line = ascii_handling_function(utf8_str) Test handling of incorrect encoding

100.
Best Practices 1. Knowyour encodings 2. Use the Unicode sandwich 3. Test your (text related) code

101.
Issues We Can’t Control

102.
Incorrect encoding

103.

104.
Declared as “CP-1252”! ! ! ! ! Isactually “UTF-8”

105.
# -*- coding:utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable author.encode("utf8" date.encode("utf8"), converted_review.encode("utf8")

106.
UnicodeDecodeError

107.
How to Deal •Ask

108.
How to Deal •Ask • Guess (with chardet library)

109.
How to Deal •Ask • Guess (with chardet library) • You wrote tests, right?

110.
Mixed encodings or corruptedbytes

111.
John Smithâ€™s Autoplex ! Brokentext… it’s fantastic! ! Hello ^[[30m; World

112.
John Smithâ€™s Autoplex ! Brokentext… it’s fantastic! ! Hello ^[[30m; World MOJIBAKE

113.
u"John Smithâ€™s Autoplex"

114.
u"John Smithâ€™s Autoplex" ! >>>u'JohnSmithâ€™sAutoplex'.encode('cp1252')

115.
u"John Smithâ€™s Autoplex" ! >>>u'JohnSmithâ€™sAutoplex'.encode('cp1252') ! 'John Smithxe2x80x99s Autoplex' (bytestring)

118.
'John Smithxe2x80x99s Autoplex' (bytestring)

119.
'John Smithxe2x80x99s Autoplex' (bytestring) ! >>>'JohnSmithxe2x80x99s Autoplex' .decode('utf8') ! ! u'John Smith’s Autoplex'

120.
UTF8 U+2019 ! ’

121.
UTF8 xe2x80x99 U+2019 ! ’

122.
UTF8 xe2x80x99 U+2019 ! ’ U+00e2 ! â U+20ac ! € U+2122 ! ™ CP1252

123.
str_dealer = u"JohnSmithâ€™s Autoplex" ! ! def manually_convert_encoding(str_dealer): """ Manually replace incorrect, UTF8-encoded bytes with CP1252 bytes for the same character """ ! str_dealer.replace('xe2x80x98', 'x91') # ‘ str_dealer.replace('xe2x80x99', 'x92') # ’ str_dealer.replace('xe2x80x9c', 'x93') # “ str_dealer.replace('xe2x80x9d', 'x94') # ” str_dealer.replace('xe2x80x94', 'x97') # — str_dealer.replace('xe2x84xa2', 'x99') # ™ str_dealer.replace('xe2x82xac', 'x80') # €

124.
dealer_name = u"JohnSmithâ€™s Autoplex" ! >>> from ftfy import fix_text ! >>> fix_text(dealer_name) ! u"John Smith's Autoplex" python-ftfy ﬁxes mojibake

125.
Target encoding can’t handle sourcedata

126.
Source Data (UTF-8) Target Application Data (CP-1252) ?

127.
>>>u'☃ Brrrr!'.encode('cp1252', 'strict') ! Traceback(most recent call last): File "<stdin>", line 1, in <module> File "/Users/esther/ENV/lib/python2.7/ encodings/cp1252.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_ table) UnicodeEncodeError: 'charmap' codec can't encode character u'u2603' in position 0: character maps to <undefined> [Python 2.7]

128.
>>>u'☃ Brrrr!'.encode('cp1252', 'ignore') ! 'Brrrr!' [Python 2.7]

129.
>>>u'☃ Brrrr!'.encode('cp1252', 'replace') ! '?Brrrr!' [Python 2.7]

130.
! ! U+0004 END OF TRANSMISSION

131.
Cars.com / NewCars.comTech Team ! SoCal Piggies ! Ned Batchelder (for his Pragmatic Unicode talk) Thank you ツ

132.
Pragmatic Unicode http://nedbatchelder.com/text/unipain.html ! The AbsoluteMinimum You Must Know http://www.joelonsoftware.com/articles/Unicode.html ! Chapter on Strings in “Dive into Python” by Mark Pilgrim http://getpython3.com/diveintopython3/strings.html ! General questions, relating to UTF or Encoding Form http://www.unicode.org/faq/utf_bom.html ! Unicode HOWTO (Python 2.7) http://docs.python.org/2/howto/unicode.html The fundamentals

133.
“Just what thedickens is ‘Unicode’?” https://pythonhosted.org/kitchen/unicode-frustrations.html  Differences between these commonly confused encodings http://www.i18nqa.com/debug/table-iso8859-1-vs- windows-1252.html ! “Latin-1” in MySQL is more like “CP-1252” https://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html ! Why it's important to write tests with character boundary values http://labs.spotify.com/2013/06/18/creative-usernames/ Further reading

134.
chardet https://pypi.python.org/pypi/chardet ! python-ftfy https://github.com/LuminosoInsight/python-ftfy Tools

135.
@estherbester @travisﬁscher Slides athttp://bit.ly/ﬂip_tables IRC

Character Encoding & Unicode - How to (╯°□°）╯︵ ┻━┻ with dignity

More Related Content

Viewers also liked

Similar to Character Encoding & Unicode - How to (╯°□°）╯︵ ┻━┻ with dignity

Recently uploaded

Character Encoding & Unicode - How to (╯°□°）╯︵ ┻━┻ with dignity