ODE TO A SHIPPING LABEL!
by Carlos Bueno!
!
Once there was a little o,!
with an accent on top like só.!
!
It started out a...
Character Encoding
& Unicode
How to (╯°□°)╯︵ ┻━┻ with dignity
Esther Nam & Travis Fischer!
PyCon US 2014, Montréal
Uni-wat?!
┻━┻ ︵ヽ ノ︵ ┻━┻
How to (╯°□°)╯︵ ┻━┻
with dignity
– Luke Sneeringer | Program Committee Chair
“You'll be pleased to know that your talk title
crashed our meeting robot, whi...
Python 3
is out of scope
The Fundamentals
of Unicode
Humans use text.
Computers speak bytes.
a -> 01100001
ASCII
ISO-8859-15!
(latin-9)
CP-1252!
(Windows 1252)
UTF-8
a 01100001 01100001 01100001 01100001
€ NA 10100100 10000000
11...
ASCII
ISO-8859-15!
(latin-9)
CP-1252!
(Windows 1252)
UTF-8
a 01100001 01100001 01100001 01100001
€ NA 10100100 10000000
11...
ASCII
ISO-8859-15!
(latin-9)
CP-1252!
(Windows 1252)
UTF-8
a 01100001 01100001 01100001 01100001
€ NA 10100100 10000000
11...
ASCII
ISO-8859-15!
(latin-9)
CP-1252!
(Windows 1252)
UTF-8
a 01100001 01100001 01100001 01100001
€ NA 10100100 10000000
11...
π — ‽ ☠ ☁ ☂ ☃ ☄ ★ ☆ ☇ ☈ ☉ ☊
☋ ☌ ☍ ☎ ☏ ☐ ☑ ☒ ☓ ☘ ☙
☚ ☛ ☜ ☝ ☞ ☟ ☠ ☡ ☢ ☣ ☤ ☥ ☦ ☧
☨ ☩ ☪ ☫ ☬ ☭ ☮ ☯ ☸ ☹ ☺ ☻ ☼ ☽ ☾
☿ ♀ ♁ ♂ ♃ ♄ ♅ ...
a -> U+0061
Character Unicode Code Point
!
Unicode
a -> U+0061
Character Unicode Code Point
!
Unicode
a -> U+0061
Character LATIN SMALL LETTER A
Computers speak bytes.
!
Unicode
a
!
U+0061 -> 01100001
Unicode Code Point Binary Encoding
!
Unicode
U+0061 -> 01100001
Unicode Code Point Binary Encodinga
UTF-8
Unicode Transformation Format
Unicode != UTF-8
Code Points Binary Encoding
U+0061 01100001
Layers of Abstraction
• Display (Glyphs | Fonts)
Let them eat cake!
• Display (Glyphs | Fonts)
Let them eat cake!

!
• Text (Unicode | Code Points)
U+0061
• Display (Glyphs | Fonts)
Let them eat cake!

!
• Text (Unicode | Code Points)
U+0061
!
• Storage (Binary | UTF-8)
011000...
Unicode & Python
[Python 2.7]
str type
>>>euro_bytestring = '€'
!
>>>type(euro_bytestring)
<type 'str'>
[Python 2.7]
unicode type
# € code point
>>>euro_unicode = u'u20ac'
!
>>>type(euro_unicode)
<type 'unicode'>
[Python 2.7]
Unicode
Code points
u'u20ac'
!
Bytes
UTF-8
'xe2x82xac'
!
[Python 2.7]
Unicode
Code points
u'u20ac'
'xe2x82xac'.decode('utf8')
!
Bytes
UTF-8
'xe2x82xac'
!
[Python 2.7]
Unicode
Code points
u'u20ac'
'xe2x82xac'.decode('utf8')
!
Bytes
UTF-8
'xe2x82xac'
!
[Python 2.7]
Unicode
Code points
u'u20ac'
'xe2x82xac'.decode('utf8')
u'u20ac'.encode('utf8')
!
Bytes
UTF-8
'xe2x82xac'
!
[Python 2.7]
Unicode
Code points
u'u20ac'
'xe2x82xac'.becode('utf8')
u'u20ac'.uncode('utf8')
!
Bytes
UTF-8
'xe2x82xac'
!
[Python 2.7]
You CANNOT infer an
encoding from a bytestring
#! /usr/bin/python
# -*- coding: utf8 -*-
!
# Opened file should be latin-1 encoded!
# If it’s not, call tech support ASAP...
Best Practices
Example Application
Author Review
G. van Rossum
If you decide to design your own car
there are thousands sort of car…
R. Ebert
Every great car...
Author Review
G. van Rossum
If you decide to design your own car
there are thousands sort of car…
R. Ebert
Every great car...
Author Review
G. van Rossum
If you decide to design your own car
there are thousands sort of car…
R. Ebert
Every great car...
Author Review
G. van Rossum
If you decide to design your own car
there are thousands sort of car…
R. Ebert
Every great car...
Encoding: Windows 1252 (CP-1252)
Montreal -> Montréal
psql=# set server_encoding
to "utf-8";
My friend said: “I cannot
believe this is a Volvo! I
had a car just like this
when I lived in Montreal.”
He told me he had...
My friend said: “I cannot
believe this is a Volvo! I
had a car just like this
when I lived in Montreal.”
He told me he had...
My friend said: “I cannot
believe this is a Volvo! I
had a car just like this
when I lived in Montreal.”
He told me he had...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_t...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_t...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_t...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_t...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_t...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_t...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_t...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_t...
My friend said: �I cannot
believe this is a Volvo! I
had a car just like this
when I lived in Montréal.�
He told me he had...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_t...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_t...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_t...
[Python 2.7]
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, d...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_t...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_t...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_t...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_t...
My friend said: “I cannot
believe this is a Volvo! I
had a car just like this
when I lived in Montreal.”
He told me he had...
My friend said: “I cannot
believe this is a Volvo! I
had a car just like this
when I lived in Montréal.”
He told me he had...
My friend said: �I cannot
believe this is a Volvo! I
had a car just like this
when I lived in Montréal.�
He told me he had...
Know your encodings
Best Practice #1
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_t...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
!
author, date, review_t...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.d...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.d...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.d...
Traceback (most recent call last):
File "...", line ..., in <module>
unicode_row = row_text.decode()
UnicodeDecodeError: '...
Traceback (most recent call last):
File "...", line ..., in <module>
unicode_row = row_text.decode()
UnicodeDecodeError: '...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.d...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.d...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.d...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.d...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.d...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.d...
My friend said: “I cannot
believe this is a Volvo! I
had a car just like this
when I lived in Montréal.”
He told me he had...
Use the Unicode
Sandwich
Best Practice #2
Decode as early as possible.!
Unicode everywhere in the middle.!
Encode as late as possible.
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.d...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.d...
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.d...
Test Your
(Text Related) Code
Best Practice #3
Test encoding ranges
& boundaries
test_strings = ['Hello Montreal!',
'¡‫ן‬ɐǝɹʇuoɯ o‫ןן‬ǝɥ',
'ђєɭɭ๏ ๓๏ภՇгєค !']
!
func_unde...
test_bytes = 'I am a bytestring mwahaha'
!
test_unicode = u'ι αм υηι¢σ∂є!'
!
!
i_expect_unicode(test_bytes)
!
i_expect_byt...
def ascii_handling_function(ascii_str):
...
ascii_str.decode('ascii')
...
Test handling of
incorrect encoding
utf8_str = u'UՇF-8 ՇєsՇ'.encode('utf8')
!
with assertRaises(UnicodeDecodeError):
line = ascii_handling_function(utf8_str)
...
Best Practices
1. Know your encodings
2. Use the Unicode sandwich
3. Test your (text related) code
Issues We Can’t
Control
Incorrect encoding
Author Review
G. van Rossum
If you decide to design your own car
there are thousands sort of car…
R. Ebert
Every great car...
Declared as “CP-1252”!
!
!
!
!
Is actually “UTF-8”
# -*- coding: utf-8 -*-
!
reviews_file = open("reviews_file.csv")
!
for row_text in reviews_file:
unicode_row = row_text.d...
UnicodeDecodeError
How to Deal
• Ask
How to Deal
• Ask
• Guess (with chardet library)
How to Deal
• Ask
• Guess (with chardet library)
• You wrote tests, right?
Mixed encodings or
corrupted bytes
John Smith’s Autoplex
!
Broken text&hellip; it’s fantastic!
!
Hello ^[[30m; World
John Smith’s Autoplex
!
Broken text&hellip; it’s fantastic!
!
Hello ^[[30m; World
MOJIBAKE
u"John Smith’s Autoplex"
u"John Smith’s Autoplex"
!
>>>u'John Smith’sAutoplex'.encode('cp1252')
u"John Smith’s Autoplex"
!
>>>u'John Smith’sAutoplex'.encode('cp1252')
!
'John Smithxe2x80x99s Autoplex'
(bytestring)
'John Smithxe2x80x99s Autoplex'
(bytestring)
'John Smithxe2x80x99s Autoplex'
(bytestring)
!
>>>'John Smithxe2x80x99s Autoplex' 
.decode('utf8')
!
!
u'John Smith’s Auto...
UTF8
U+2019
!
’
UTF8
xe2x80x99
U+2019
!
’
UTF8
xe2x80x99
U+2019
!
’
U+00e2
!
â
U+20ac
!
€
U+2122
!
™
CP1252
str_dealer = u"John Smith’s Autoplex"
!
!
def manually_convert_encoding(str_dealer):
"""
Manually replace incorrect, UTF...
dealer_name = u"John Smith’s Autoplex"
!
>>> from ftfy import fix_text
!
>>> fix_text(dealer_name)
!
u"John Smith's Auto...
Target encoding
can’t handle
source data
Source
Data
(UTF-8)
Target
Application
Data
(CP-1252)
?
>>>u'☃ Brrrr!'.encode('cp1252', 'strict')
!
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/...
>>>u'☃ Brrrr!'.encode('cp1252', 'ignore')
!
' Brrrr!'
[Python 2.7]
>>>u'☃ Brrrr!'.encode('cp1252', 'replace')
!
'? Brrrr!'
[Python 2.7]
!
!
U+0004
END OF TRANSMISSION
Cars.com / NewCars.com Tech Team
!
SoCal Piggies
!
Ned Batchelder
(for his Pragmatic Unicode talk)
Thank you ツ
Pragmatic Unicode
http://nedbatchelder.com/text/unipain.html
!
The Absolute Minimum You Must Know
http://www.joelonsoftwar...
“Just what the dickens is ‘Unicode’?”
https://pythonhosted.org/kitchen/unicode-frustrations.html

Differences between thes...
chardet
https://pypi.python.org/pypi/chardet
!
python-ftfy
https://github.com/LuminosoInsight/python-ftfy
Tools
@estherbester @travisfischer
Slides at http://bit.ly/flip_tables
IRC
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Upcoming SlideShare
Loading in …5
×

Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

3,594 views

Published on

Every developer will inevitably feel the pain of character encoding issues. We will cover the fundamentals every Python developer should know on character encoding and Unicode. We will teach you how to identify the types of problems that occur when dealing with character encoding and outline a set of best practices and useful libraries which can be used to avoid and fix character encoding issues.

Published in: Software
1 Comment
5 Likes
Statistics
Notes
  • Just saw the talk and it was great, even after having seen Ned's talk. Just a couple of comments: First, you both talked fairly quickly. In a couple of cases, key points sailed right by me. Fortunately I could replay a few minutes but this was probably harder to do in person. Second, Esther, please stand still on stage. I was trying to concentrate on the screen and my eyes kept being drawn to a person swaying on the side. I know that public speaking makes my heart run about 160 mph, and I assume it's the same for others, but maybe knowing about this will help. Again, the talk was great; I may even watch it again. So much clear information in such a short time - many thanks.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
3,594
On SlideShare
0
From Embeds
0
Number of Embeds
133
Actions
Shares
0
Downloads
49
Comments
1
Likes
5
Embeds 0
No embeds

No notes for slide

Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

  1. 1. ODE TO A SHIPPING LABEL! by Carlos Bueno! ! Once there was a little o,! with an accent on top like só.! ! It started out as UTF8,! (universal since '98),! but the program only knew latin1,! and changed little ó to "ó" for fun.! ! A second program saw the "ó"! and said "I know HTML entity!"! So "ó" was smartened to "&ATILDE;&SUP3;"! and passed on through happily.! ! Another program saw the tangle! (more precisely, ampersands to mangle)! and thus the humble "&ATILDE;&SUP3;"! became "&AMP;AMP;ATILDE;&AMP;AMP;SUP3;"
  2. 2. Character Encoding & Unicode How to (╯°□°)╯︵ ┻━┻ with dignity Esther Nam & Travis Fischer! PyCon US 2014, Montréal
  3. 3. Uni-wat?!
  4. 4. ┻━┻ ︵ヽ ノ︵ ┻━┻
  5. 5. How to (╯°□°)╯︵ ┻━┻ with dignity
  6. 6. – Luke Sneeringer | Program Committee Chair “You'll be pleased to know that your talk title crashed our meeting robot, which is a great argument for the relevance of this talk. :-) ...”
  7. 7. Python 3 is out of scope
  8. 8. The Fundamentals of Unicode
  9. 9. Humans use text. Computers speak bytes.
  10. 10. a -> 01100001
  11. 11. ASCII ISO-8859-15! (latin-9) CP-1252! (Windows 1252) UTF-8 a 01100001 01100001 01100001 01100001 € NA 10100100 10000000 11100010 10000010 10101100 ¤ NA NA 10100100 11000010 10100100
  12. 12. ASCII ISO-8859-15! (latin-9) CP-1252! (Windows 1252) UTF-8 a 01100001 01100001 01100001 01100001 € NA 10100100 10000000 11100010 10000010 10101100 ¤ NA NA 10100100 11000010 10100100
  13. 13. ASCII ISO-8859-15! (latin-9) CP-1252! (Windows 1252) UTF-8 a 01100001 01100001 01100001 01100001 € NA 10100100 10000000 11100010 10000010 10101100 ¤ NA NA 10100100 11000010 10100100
  14. 14. ASCII ISO-8859-15! (latin-9) CP-1252! (Windows 1252) UTF-8 a 01100001 01100001 01100001 01100001 € NA 10100100 10000000 11100010 10000010 10101100 ¤ NA NA 10100100 11000010 10100100
  15. 15. π — ‽ ☠ ☁ ☂ ☃ ☄ ★ ☆ ☇ ☈ ☉ ☊ ☋ ☌ ☍ ☎ ☏ ☐ ☑ ☒ ☓ ☘ ☙ ☚ ☛ ☜ ☝ ☞ ☟ ☠ ☡ ☢ ☣ ☤ ☥ ☦ ☧ ☨ ☩ ☪ ☫ ☬ ☭ ☮ ☯ ☸ ☹ ☺ ☻ ☼ ☽ ☾ ☿ ♀ ♁ ♂ ♃ ♄ ♅ ♆ ♇ ♔ ♕ ♖ ♗ ♘ ♙ ♚ ♛ ♜ ♝ ♞ ♟ ♠ ♡ ♢ ♣ ♤ ♥ ♦ ♧ ♨ ♩ ♪ ♫ ♬ ♭ ♯ ♰ ♾ ⚀ ⚁ ⚂ ⚃ ⚄ ⚅ ⚆ ⚇ ⚈
  16. 16. a -> U+0061 Character Unicode Code Point
  17. 17. ! Unicode a -> U+0061 Character Unicode Code Point
  18. 18. ! Unicode a -> U+0061 Character LATIN SMALL LETTER A
  19. 19. Computers speak bytes.
  20. 20. ! Unicode a ! U+0061 -> 01100001 Unicode Code Point Binary Encoding
  21. 21. ! Unicode U+0061 -> 01100001 Unicode Code Point Binary Encodinga
  22. 22. UTF-8 Unicode Transformation Format
  23. 23. Unicode != UTF-8 Code Points Binary Encoding U+0061 01100001
  24. 24. Layers of Abstraction
  25. 25. • Display (Glyphs | Fonts) Let them eat cake!
  26. 26. • Display (Glyphs | Fonts) Let them eat cake! ! • Text (Unicode | Code Points) U+0061
  27. 27. • Display (Glyphs | Fonts) Let them eat cake! ! • Text (Unicode | Code Points) U+0061 ! • Storage (Binary | UTF-8) 01100001
  28. 28. Unicode & Python [Python 2.7]
  29. 29. str type >>>euro_bytestring = '€' ! >>>type(euro_bytestring) <type 'str'> [Python 2.7]
  30. 30. unicode type # € code point >>>euro_unicode = u'u20ac' ! >>>type(euro_unicode) <type 'unicode'> [Python 2.7]
  31. 31. Unicode Code points u'u20ac' ! Bytes UTF-8 'xe2x82xac' ! [Python 2.7]
  32. 32. Unicode Code points u'u20ac' 'xe2x82xac'.decode('utf8') ! Bytes UTF-8 'xe2x82xac' ! [Python 2.7]
  33. 33. Unicode Code points u'u20ac' 'xe2x82xac'.decode('utf8') ! Bytes UTF-8 'xe2x82xac' ! [Python 2.7]
  34. 34. Unicode Code points u'u20ac' 'xe2x82xac'.decode('utf8') u'u20ac'.encode('utf8') ! Bytes UTF-8 'xe2x82xac' ! [Python 2.7]
  35. 35. Unicode Code points u'u20ac' 'xe2x82xac'.becode('utf8') u'u20ac'.uncode('utf8') ! Bytes UTF-8 'xe2x82xac' ! [Python 2.7]
  36. 36. You CANNOT infer an encoding from a bytestring
  37. 37. #! /usr/bin/python # -*- coding: utf8 -*- ! # Opened file should be latin-1 encoded! # If it’s not, call tech support ASAP with open("input_file.csv") as input_file: Date: Wed, 11 Apr 2014 11:15:55 -0600
 To: foo@bar.com
 From: bar@foo.com Subject: Character encoding MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" <?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE html PUBLIC “-//W3C//DTD …> <html xmlns="http://www.w3.org/1999/xhtml" …>
  38. 38. Best Practices
  39. 39. Example Application
  40. 40. Author Review G. van Rossum If you decide to design your own car there are thousands sort of car… R. Ebert Every great car should feel new every time you drive it. L. Torvalds Volvo isn’t evil, they just make really crappy cars.
  41. 41. Author Review G. van Rossum If you decide to design your own car there are thousands sort of car… R. Ebert Every great car should feel new every time you drive it. L. Torvalds Volvo isn’t evil, they just make really crappy cars. Application Processes Text
  42. 42. Author Review G. van Rossum If you decide to design your own car there are thousands sort of car… R. Ebert Every great car should feel new every time you drive it. L. Torvalds Volvo isn’t evil, they just make really crappy cars. Application Processes Text PSQL
  43. 43. Author Review G. van Rossum If you decide to design your own car there are thousands sort of car… R. Ebert Every great car should feel new every time you drive it. L. Torvalds Volvo isn’t evil, they just make really crappy cars. Application Processes Text PSQL
  44. 44. Encoding: Windows 1252 (CP-1252)
  45. 45. Montreal -> Montréal
  46. 46. psql=# set server_encoding to "utf-8";
  47. 47. My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montreal.” He told me he had paid 9400€ for his. Sample Review Text
  48. 48. My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montreal.” He told me he had paid 9400€ for his. Sample Review Text
  49. 49. My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montreal.” He told me he had paid 9400€ for his. Sample Review Text
  50. 50. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  51. 51. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  52. 52. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  53. 53. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  54. 54. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  55. 55. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  56. 56. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  57. 57. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)
  58. 58. My friend said: �I cannot believe this is a Volvo! I had a car just like this when I lived in Montréal.� He told me he had paid 9400� for his. Output from UTF-8 encoded PSQL database
  59. 59. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  60. 60. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  61. 61. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  62. 62. [Python 2.7] # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)
  63. 63. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  64. 64. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  65. 65. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  66. 66. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  67. 67. My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montreal.” He told me he had paid 9400€ for his. Original CP-1252 Data
  68. 68. My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montréal.” He told me he had paid 9400€ for his. Mixed CP-1252 & UTF-8
  69. 69. My friend said: �I cannot believe this is a Volvo! I had a car just like this when I lived in Montréal.� He told me he had paid 9400� for his. Interpreted as UTF-8 by database
  70. 70. Know your encodings Best Practice #1
  71. 71. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  72. 72. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  73. 73. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode() author, date, review_text = unicode_row.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  74. 74. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode() author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  75. 75. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode() author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review)
  76. 76. Traceback (most recent call last): File "...", line ..., in <module> unicode_row = row_text.decode() UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 31: ordinal not in range(128)
  77. 77. Traceback (most recent call last): File "...", line ..., in <module> unicode_row = row_text.decode() UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 31: ordinal not in range(128)
  78. 78. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode() author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  79. 79. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode() author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review)
  80. 80. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  81. 81. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review) [Python 2.7]
  82. 82. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author.encode("utf8"), date.encode("utf8"), converted_review.encode("utf8")) [Python 2.7]
  83. 83. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable author.encode("utf8" date.encode("utf8"), converted_review.encode("utf8")
  84. 84. My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montréal.” He told me he had paid 9400€ for his.
  85. 85. Use the Unicode Sandwich Best Practice #2
  86. 86. Decode as early as possible.! Unicode everywhere in the middle.! Encode as late as possible.
  87. 87. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author.encode("utf8"), date.encode("utf8"), converted_review.encode("utf8")) [Python 2.7]
  88. 88. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author.encode("utf8"), date.encode("utf8"), converted_review.encode("utf8")) [Python 2.7]
  89. 89. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u”Montreal", u"Montréal") DB.insert(ReviewTable, author.encode("utf8"), date.encode("utf8"), converted_review.encode("utf8")) [Python 2.7]
  90. 90. Test Your (Text Related) Code Best Practice #3
  91. 91. Test encoding ranges & boundaries test_strings = ['Hello Montreal!', '¡‫ן‬ɐǝɹʇuoɯ o‫ןן‬ǝɥ', 'ђєɭɭ๏ ๓๏ภՇгєค !'] ! func_under_test(test_strings)
  92. 92. test_bytes = 'I am a bytestring mwahaha' ! test_unicode = u'ι αм υηι¢σ∂є!' ! ! i_expect_unicode(test_bytes) ! i_expect_bytes(test_unicode) Test interfaces against both Python text types
  93. 93. def ascii_handling_function(ascii_str): ... ascii_str.decode('ascii') ... Test handling of incorrect encoding
  94. 94. utf8_str = u'UՇF-8 ՇєsՇ'.encode('utf8') ! with assertRaises(UnicodeDecodeError): line = ascii_handling_function(utf8_str) Test handling of incorrect encoding
  95. 95. Best Practices 1. Know your encodings 2. Use the Unicode sandwich 3. Test your (text related) code
  96. 96. Issues We Can’t Control
  97. 97. Incorrect encoding
  98. 98. Author Review G. van Rossum If you decide to design your own car there are thousands sort of car… R. Ebert Every great car should feel new every time you drive it. L. Torvalds Volvo isn’t evil, they just make really crappy cars. Application Processes Text PSQL
  99. 99. Declared as “CP-1252”! ! ! ! ! Is actually “UTF-8”
  100. 100. # -*- coding: utf-8 -*- ! reviews_file = open("reviews_file.csv") ! for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable author.encode("utf8" date.encode("utf8"), converted_review.encode("utf8")
  101. 101. UnicodeDecodeError
  102. 102. How to Deal • Ask
  103. 103. How to Deal • Ask • Guess (with chardet library)
  104. 104. How to Deal • Ask • Guess (with chardet library) • You wrote tests, right?
  105. 105. Mixed encodings or corrupted bytes
  106. 106. John Smith’s Autoplex ! Broken text&hellip; it’s fantastic! ! Hello ^[[30m; World
  107. 107. John Smith’s Autoplex ! Broken text&hellip; it’s fantastic! ! Hello ^[[30m; World MOJIBAKE
  108. 108. u"John Smith’s Autoplex"
  109. 109. u"John Smith’s Autoplex" ! >>>u'John Smith’sAutoplex'.encode('cp1252')
  110. 110. u"John Smith’s Autoplex" ! >>>u'John Smith’sAutoplex'.encode('cp1252') ! 'John Smithxe2x80x99s Autoplex' (bytestring)
  111. 111. 'John Smithxe2x80x99s Autoplex' (bytestring)
  112. 112. 'John Smithxe2x80x99s Autoplex' (bytestring) ! >>>'John Smithxe2x80x99s Autoplex' .decode('utf8') ! ! u'John Smith’s Autoplex'
  113. 113. UTF8 U+2019 ! ’
  114. 114. UTF8 xe2x80x99 U+2019 ! ’
  115. 115. UTF8 xe2x80x99 U+2019 ! ’ U+00e2 ! â U+20ac ! € U+2122 ! ™ CP1252
  116. 116. str_dealer = u"John Smith’s Autoplex" ! ! def manually_convert_encoding(str_dealer): """ Manually replace incorrect, UTF8-encoded bytes with CP1252 bytes for the same character """ ! str_dealer.replace('xe2x80x98', 'x91') # ‘ str_dealer.replace('xe2x80x99', 'x92') # ’ str_dealer.replace('xe2x80x9c', 'x93') # “ str_dealer.replace('xe2x80x9d', 'x94') # ” str_dealer.replace('xe2x80x94', 'x97') # — str_dealer.replace('xe2x84xa2', 'x99') # ™ str_dealer.replace('xe2x82xac', 'x80') # €
  117. 117. dealer_name = u"John Smith’s Autoplex" ! >>> from ftfy import fix_text ! >>> fix_text(dealer_name) ! u"John Smith's Autoplex" python-ftfy fixes mojibake
  118. 118. Target encoding can’t handle source data
  119. 119. Source Data (UTF-8) Target Application Data (CP-1252) ?
  120. 120. >>>u'☃ Brrrr!'.encode('cp1252', 'strict') ! Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/esther/ENV/lib/python2.7/ encodings/cp1252.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_ table) UnicodeEncodeError: 'charmap' codec can't encode character u'u2603' in position 0: character maps to <undefined> [Python 2.7]
  121. 121. >>>u'☃ Brrrr!'.encode('cp1252', 'ignore') ! ' Brrrr!' [Python 2.7]
  122. 122. >>>u'☃ Brrrr!'.encode('cp1252', 'replace') ! '? Brrrr!' [Python 2.7]
  123. 123. ! ! U+0004 END OF TRANSMISSION
  124. 124. Cars.com / NewCars.com Tech Team ! SoCal Piggies ! Ned Batchelder (for his Pragmatic Unicode talk) Thank you ツ
  125. 125. Pragmatic Unicode http://nedbatchelder.com/text/unipain.html ! The Absolute Minimum You Must Know http://www.joelonsoftware.com/articles/Unicode.html ! Chapter on Strings in “Dive into Python” by Mark Pilgrim http://getpython3.com/diveintopython3/strings.html ! General questions, relating to UTF or Encoding Form http://www.unicode.org/faq/utf_bom.html ! Unicode HOWTO (Python 2.7) http://docs.python.org/2/howto/unicode.html The fundamentals
  126. 126. “Just what the dickens is ‘Unicode’?” https://pythonhosted.org/kitchen/unicode-frustrations.html
 Differences between these commonly confused encodings http://www.i18nqa.com/debug/table-iso8859-1-vs- windows-1252.html ! “Latin-1” in MySQL is more like “CP-1252” https://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html ! Why it's important to write tests with character boundary values http://labs.spotify.com/2013/06/18/creative-usernames/ Further reading
  127. 127. chardet https://pypi.python.org/pypi/chardet ! python-ftfy https://github.com/LuminosoInsight/python-ftfy Tools
  128. 128. @estherbester @travisfischer Slides at http://bit.ly/flip_tables IRC

×