SlideShare a Scribd company logo
Unicode 101
How to avoid corrupting
international text
ß
�!
David Foster
Goal
Learn just enough to:
– Avoid corrupting international text in your code
Out of Scope
• Internationalization (i18n)
– Extending a program to emit messages in
multiple languages
• Localization (l10n)
– Extending a program to emit messages in a
specific language, such as German
• Manipulating Unicode characters within strings
Problems
• Customer A writes some text to a file or app.
Customer B reads it back, but it is different.
In particular it has a bunch of ??? or ���.
– ß ➔ �
• UnicodeEncodeError: 'ascii' codec can't
encode character 'ua000' in position
0: ordinal not in range(128)
Bytes vs. Characters
77
10
1
10
5
11
0
32 70
11
7
19
5
15
9
M e i n F u ß
Byte
Stream
Decode utf-8
Character
Stream
Character
Encoding
Bytes vs. Characters
77
10
1
10
5
11
0
32 70
11
7
19
5
15
9
M e i n F u ß
Byte
Stream
Decode utf-8
Character
Stream
Character
Encoding
︎Multiple bytes wide!
☝
︎Often
forgotten!
☟
What is the character encoding?
• There is usually some signal (sometimes out-of-
band) that specifies the encoding that should be
used to interpret a byte stream as characters.
– HTTP: Content-Type: text/html; charset=UTF-8
– HTML: <meta charset="UTF-8"/>
– XML: <?xml encoding="UTF-8">
– Python: # -*- coding: utf-8 -*-
– POSIX: LANG=en_US.UTF-8
What is the character encoding?
• Unfortunately some types of files don't contain any
information about their encoding.
– Text files (*.txt)
• Usually the OS default character encoding is assumed,
which depends on its locale. Yikes.
– JSON files (*.json)
• Usually UTF-8 is assumed, but other Unicode encodings are
permitted by RFC 4627.
– Java source files (*.java)
• Encoding is derived from the -encoding compiler flag.
Big Mistake #1
You cannot interpret a
byte sequence as a
character sequence
without knowing the
character encoding.
What's wrong with this code? (A1)
#!/usr/bin/python2.7
with open("names.txt", "r") as f:
for name in f:
print('Hello ' + name.strip())
What's wrong with this code? (A1)
#!/usr/bin/python2.7
with open("names.txt", "r") as f:
for name in f:
print('Hello ' + name.strip())
• No character encoding is specified!
– Python will fallback to the OS default character encoding,
which depends on its locale.
– Therefore a customer running this program on a
Japanese OS will read different text than an English OS!
• Reads byte strings instead of character strings!
What's wrong with this code? (A1)
#!/usr/bin/python2.7
import codecs
with codecs.open("names.txt", "r",
"utf-8") as f:
for name in f:
print(u'Hello ' + name.strip())
• Fixed. Will always read character strings, and as UTF-8.
What's wrong with this code? (A2)
#!/usr/bin/python3.4
with open("names.txt", "r") as f:
for name in f:
print('Hello ' + name.strip())
What's wrong with this code? (A2)
#!/usr/bin/python3.4
with open("names.txt", "r") as f:
for name in f:
print('Hello ' + name.strip())
• No character encoding is specified!
What's wrong with this code? (A2)
#!/usr/bin/python3.4
with open("names.txt", "r",
encoding="utf-8") as f:
for name in f:
print('Hello ' + name.strip())
• Fixed. Will always read as UTF-8.
What's wrong with this code? (B)
<!DOCTYPE html>
<html>
<head>
<title>Krankenzimmer</title>
</head>
<body>Mein Fuß tut weh!</body>
</html>
What's wrong with this code? (B)
<!DOCTYPE html>
<html>
<head>
<title>Krankenzimmer</title>
</head>
<body>Mein Fuß tut weh!</body>
</html>
• No character encoding is specified!
What's wrong with this code? (B)
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8"/>
<title>Krankenzimmer</title>
</head>
<body>Mein Fuß tut weh!</body>
</html>
• Fixed. Declares self as UTF-8 encoded.
What's wrong with this code? (C)
<?xml version="1.0">
<messages>
<message>Mein Fuß tut weh!</message>
</messages>
What's wrong with this code? (C)
<?xml version="1.0">
<messages>
<message>Mein Fuß tut weh!</message>
</messages>
• No character encoding is specified!
What's wrong with this code? (C)
<?xml version="1.0" encoding="UTF-8">
<messages>
<message>Mein Fuß tut weh!</message>
</messages>
• Fixed. Declares self as UTF-8 encoded.
What's wrong with this code? (D)
// C#
// TextReader is a character stream
// OpenText always assumes UTF-8 encoding
using (TextReader r = File.OpenText("names.xml"))
{
XmlDocument doc = new XmlDocument();
doc.Load(r);
...
}
What's wrong with this code? (D)
// C#
// TextReader is a character stream
// OpenText always assumes UTF-8 encoding
using (TextReader r = File.OpenText("names.xml"))
{
XmlDocument doc = new XmlDocument();
doc.Load(r);
...
}
• The encoding declaration in the XML is ignored!
UTF-8 is always forced.
What's wrong with this code? (D)
// C#
// Stream is a byte stream
using (Stream s = File.OpenRead("names.xml"))
{
XmlDocument doc = new XmlDocument();
doc.Load(s);
...
}
• Fixed. XmlDocument will internally determine the
encoding based on the declaration in the byte stream.
Big Mistake #2
Bytes and characters
are not the same thing.
Do not mix them.
Unfortunately many languages blur the line
between byte strings and character strings.
– Python 2.x
• All strings are byte strings by default.
• Byte and ASCII character strings are implicitly convertible.
– C / C++
• String functions in the C standard library manipulate
byte strings by default.
What's wrong with this code? (E1)
#!/usr/bin/python2.7
# -*- coding: windows-1252 -*-
print('Mein Fuß tut weh!')
What's wrong with this code? (E1)
#!/usr/bin/python2.7
# -*- coding: windows-1252 -*-
print('Mein Fuß tut weh!')
• A byte string (with international chars) was printed.
Only character strings should be printed.
– On OS X, which has the UTF-8 locale by default rather than
Windows-1252, the second word will be printed as "Fu?"
instead of "Fuß".
What's wrong with this code? (E1)
#!/usr/bin/python2.7
# -*- coding: windows-1252 -*-
print(u'Mein Fuß tut weh!')
• This is the smallest possible fix.
What's wrong with this code? (E1)
#!/usr/bin/python2.7
# -*- coding: windows-1252 -*-
from __future__ import unicode_literals
print('Mein Fuß tut weh!')
• A better fix, since it avoids adding u'…' everywhere.
What's wrong with this code? (E2)
#!/usr/bin/python3.4
# -*- coding: windows-1252 -*-
print('Mein Fuß tut weh!')
What's wrong with this code? (E2)
#!/usr/bin/python3.4
# -*- coding: windows-1252 -*-
print('Mein Fuß tut weh!')
• Nothing!
– Python 3.x interprets string literals as character strings
by default.
What's wrong with this code? (F)
#!/usr/bin/python2.7
# -*- coding: utf-8 -*-
import codecs
with codecs.open('hurts.txt', 'r', 'utf-8') as f:
status = f.read().strip()
print('Schädigung: ' + status)
What's wrong with this code? (F)
#!/usr/bin/python2.7
# -*- coding: utf-8 -*-
import codecs
with codecs.open('hurts.txt', 'r', 'utf-8') as f:
status = f.read().strip()
print('Schädigung: ' + status)
• Mixing a byte string literal with character input.
– Python 2.x interprets string literals as bytes by default.
What's wrong with this code? (F)
#!/usr/bin/python2.7
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import codecs
with codecs.open('hurts.txt', 'r', 'utf-8') as f:
status = f.read().strip()
print('Schädigung: ' + status)
• Fixed. All strings are character strings now.
Summary: Special Considerations
• Python 2.x
– String literals are byte strings by default rather than characters.
– Implicitly converts between byte strings and ASCII character strings.
• HTML, CSS, JavaScript
– Must declare an encoding in HTML.
• XML files
– Must declare an encoding in XML. Must honor such a declaration.
– Feed bytes to XML parsers rather than characters.
• Text files
– Must always assume an encoding. Usually UTF-8.
Don't Forget
1. You cannot interpret a byte sequence as a
character sequence without knowing the
character encoding.
2. Bytes and characters are not the same thing.
Do not mix them.
Thank You
More broken programs…
What's wrong with this code? (#1)
// Java
Reader r = new FileReader("names.txt");
What's wrong with this code? (#1)
// Java
Reader r = new FileReader("names.txt");
• No character encoding is specified!
– Java will fallback to the OS default character encoding,
which depends on its locale.
– Therefore a customer running this program on a
Japanese OS will read different text than an English OS!
What's wrong with this code? (#1)
// Java
Reader r = new FileReader(
"names.txt", "UTF-8");
• Fixed. Will always read as UTF-8.
What's wrong with this code? (#2)
// C#
Reader r = new StreamReader("names.txt");
What's wrong with this code? (#2)
// C#
Reader r = new StreamReader("names.txt");
• Nothing!
– C#'s StreamReader always uses UTF-8 encoding if no
encoding is specified.
– You must always read the documentation. Don't assume.
What's wrong with this code? (#2)
// C#
Reader r = new StreamReader(
"names.txt", Encoding.UTF8);
• Nevertheless, always explicitly specifying the encoding is still a
good idea.

More Related Content

What's hot

Farhana shaikh webinar_huffman coding
Farhana shaikh webinar_huffman codingFarhana shaikh webinar_huffman coding
Farhana shaikh webinar_huffman coding
Farhana Shaikh
 
Extracting text from PDF (iOS)
Extracting text from PDF (iOS)Extracting text from PDF (iOS)
Extracting text from PDF (iOS)
Kaz Yoshikawa
 
Python introduction towards data science
Python introduction towards data sciencePython introduction towards data science
Python introduction towards data science
deepak teja
 
ITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and GrammarsITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and Grammars
Tonny Madsen
 
introduction to python
 introduction to python introduction to python
introduction to python
Jincy Nelson
 
Python programming introduction
Python programming introductionPython programming introduction
Python programming introduction
Siddique Ibrahim
 
F# in MonoDevelop
F# in MonoDevelopF# in MonoDevelop
F# in MonoDevelop
Tomas Petricek
 
Introduction to python programming
Introduction to python programmingIntroduction to python programming
Introduction to python programming
Srinivas Narasegouda
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesMilind Patil
 
1.1.2 HEXADECIMAL
1.1.2 HEXADECIMAL1.1.2 HEXADECIMAL
1.1.2 HEXADECIMAL
Buxoo Abdullah
 

What's hot (12)

Farhana shaikh webinar_huffman coding
Farhana shaikh webinar_huffman codingFarhana shaikh webinar_huffman coding
Farhana shaikh webinar_huffman coding
 
Extracting text from PDF (iOS)
Extracting text from PDF (iOS)Extracting text from PDF (iOS)
Extracting text from PDF (iOS)
 
Python introduction towards data science
Python introduction towards data sciencePython introduction towards data science
Python introduction towards data science
 
ITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and GrammarsITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and Grammars
 
introduction to python
 introduction to python introduction to python
introduction to python
 
Python programming introduction
Python programming introductionPython programming introduction
Python programming introduction
 
CLI313
CLI313CLI313
CLI313
 
F# in MonoDevelop
F# in MonoDevelopF# in MonoDevelop
F# in MonoDevelop
 
Introduction to python programming
Introduction to python programmingIntroduction to python programming
Introduction to python programming
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfiles
 
1.1.2 HEXADECIMAL
1.1.2 HEXADECIMAL1.1.2 HEXADECIMAL
1.1.2 HEXADECIMAL
 
mdc_ppt
mdc_pptmdc_ppt
mdc_ppt
 

Viewers also liked

An Introduction To Python - Strings & I/O
An Introduction To Python - Strings & I/OAn Introduction To Python - Strings & I/O
An Introduction To Python - Strings & I/O
Blue Elephant Consulting
 
Python iteration
Python iterationPython iteration
Python iterationdietbuddha
 
파이선 문법 조금만더
파이선 문법 조금만더파이선 문법 조금만더
파이선 문법 조금만더Woojing Seok
 
Strings
StringsStrings
The Awesome Python Class Part-3
The Awesome Python Class Part-3The Awesome Python Class Part-3
The Awesome Python Class Part-3
Binay Kumar Ray
 
Python Programming Essentials - M35 - Iterators & Generators
Python Programming Essentials - M35 - Iterators & GeneratorsPython Programming Essentials - M35 - Iterators & Generators
Python Programming Essentials - M35 - Iterators & Generators
P3 InfoTech Solutions Pvt. Ltd.
 
Promoting Polymorphism
Promoting PolymorphismPromoting Polymorphism
Promoting Polymorphism
Kevlin Henney
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
Leslie Samuel
 

Viewers also liked (9)

An Introduction To Python - Strings & I/O
An Introduction To Python - Strings & I/OAn Introduction To Python - Strings & I/O
An Introduction To Python - Strings & I/O
 
Python iteration
Python iterationPython iteration
Python iteration
 
파이선 문법 조금만더
파이선 문법 조금만더파이선 문법 조금만더
파이선 문법 조금만더
 
Strings
StringsStrings
Strings
 
Unicode basics in python
Unicode basics in pythonUnicode basics in python
Unicode basics in python
 
The Awesome Python Class Part-3
The Awesome Python Class Part-3The Awesome Python Class Part-3
The Awesome Python Class Part-3
 
Python Programming Essentials - M35 - Iterators & Generators
Python Programming Essentials - M35 - Iterators & GeneratorsPython Programming Essentials - M35 - Iterators & Generators
Python Programming Essentials - M35 - Iterators & Generators
 
Promoting Polymorphism
Promoting PolymorphismPromoting Polymorphism
Promoting Polymorphism
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 

Similar to Unicode 101

Software Internationalization Crash Course
Software Internationalization Crash CourseSoftware Internationalization Crash Course
Software Internationalization Crash Course
Will Iverson
 
How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...
agileware
 
Unicode and character sets
Unicode and character setsUnicode and character sets
Unicode and character setsrenchenyu
 
Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsUnicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set Collisions
Ray Paseur
 
Encoding Nightmares (and how to avoid them)
Encoding Nightmares (and how to avoid them)Encoding Nightmares (and how to avoid them)
Encoding Nightmares (and how to avoid them)
Kenneth Farrall
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingBert Pattyn
 
Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)
Jerome Eteve
 
Character Encoding issue with PHP
Character Encoding issue with PHPCharacter Encoding issue with PHP
Character Encoding issue with PHP
Ravi Raj
 
Except UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in PythonExcept UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in Python
Aram Dulyan
 
Understanding Character Encodings
Understanding Character EncodingsUnderstanding Character Encodings
Understanding Character Encodings
Mobisoft Infotech
 
PHP for Grown-ups
PHP for Grown-upsPHP for Grown-ups
PHP for Grown-ups
Manuel Lemos
 
Chapter 01 Java Programming Basic Java IDE JAVA INTELLIEJ
Chapter 01 Java Programming Basic Java IDE JAVA INTELLIEJChapter 01 Java Programming Basic Java IDE JAVA INTELLIEJ
Chapter 01 Java Programming Basic Java IDE JAVA INTELLIEJ
IMPERIALXGAMING
 
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - IT
guest6ddfb98
 
HTTP 완벽가이드 16장
HTTP 완벽가이드 16장HTTP 완벽가이드 16장
HTTP 완벽가이드 16장
HyeonSeok Choi
 
Common mistakes in C programming
Common mistakes in C programmingCommon mistakes in C programming
Common mistakes in C programming
Larion
 
Lecture 1 introduction to language processors
Lecture 1  introduction to language processorsLecture 1  introduction to language processors
Lecture 1 introduction to language processors
Rebaz Najeeb
 
Encodings - Ruby 1.8 and Ruby 1.9
Encodings - Ruby 1.8 and Ruby 1.9Encodings - Ruby 1.8 and Ruby 1.9
Encodings - Ruby 1.8 and Ruby 1.9
Dimelo R&D Team
 
Msc prev completed
Msc prev completedMsc prev completed
Msc prev completed
mshoaib15
 
Msc prev updated
Msc prev updatedMsc prev updated
Msc prev updated
mshoaib15
 
Unit 1 c - all topics
Unit 1   c - all topicsUnit 1   c - all topics
Unit 1 c - all topics
veningstonk
 

Similar to Unicode 101 (20)

Software Internationalization Crash Course
Software Internationalization Crash CourseSoftware Internationalization Crash Course
Software Internationalization Crash Course
 
How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...
 
Unicode and character sets
Unicode and character setsUnicode and character sets
Unicode and character sets
 
Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsUnicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set Collisions
 
Encoding Nightmares (and how to avoid them)
Encoding Nightmares (and how to avoid them)Encoding Nightmares (and how to avoid them)
Encoding Nightmares (and how to avoid them)
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character Encoding
 
Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)
 
Character Encoding issue with PHP
Character Encoding issue with PHPCharacter Encoding issue with PHP
Character Encoding issue with PHP
 
Except UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in PythonExcept UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in Python
 
Understanding Character Encodings
Understanding Character EncodingsUnderstanding Character Encodings
Understanding Character Encodings
 
PHP for Grown-ups
PHP for Grown-upsPHP for Grown-ups
PHP for Grown-ups
 
Chapter 01 Java Programming Basic Java IDE JAVA INTELLIEJ
Chapter 01 Java Programming Basic Java IDE JAVA INTELLIEJChapter 01 Java Programming Basic Java IDE JAVA INTELLIEJ
Chapter 01 Java Programming Basic Java IDE JAVA INTELLIEJ
 
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - IT
 
HTTP 완벽가이드 16장
HTTP 완벽가이드 16장HTTP 완벽가이드 16장
HTTP 완벽가이드 16장
 
Common mistakes in C programming
Common mistakes in C programmingCommon mistakes in C programming
Common mistakes in C programming
 
Lecture 1 introduction to language processors
Lecture 1  introduction to language processorsLecture 1  introduction to language processors
Lecture 1 introduction to language processors
 
Encodings - Ruby 1.8 and Ruby 1.9
Encodings - Ruby 1.8 and Ruby 1.9Encodings - Ruby 1.8 and Ruby 1.9
Encodings - Ruby 1.8 and Ruby 1.9
 
Msc prev completed
Msc prev completedMsc prev completed
Msc prev completed
 
Msc prev updated
Msc prev updatedMsc prev updated
Msc prev updated
 
Unit 1 c - all topics
Unit 1   c - all topicsUnit 1   c - all topics
Unit 1 c - all topics
 

Recently uploaded

FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 

Unicode 101

  • 1. Unicode 101 How to avoid corrupting international text ß �! David Foster
  • 2. Goal Learn just enough to: – Avoid corrupting international text in your code
  • 3. Out of Scope • Internationalization (i18n) – Extending a program to emit messages in multiple languages • Localization (l10n) – Extending a program to emit messages in a specific language, such as German • Manipulating Unicode characters within strings
  • 4. Problems • Customer A writes some text to a file or app. Customer B reads it back, but it is different. In particular it has a bunch of ??? or ���. – ß ➔ � • UnicodeEncodeError: 'ascii' codec can't encode character 'ua000' in position 0: ordinal not in range(128)
  • 5. Bytes vs. Characters 77 10 1 10 5 11 0 32 70 11 7 19 5 15 9 M e i n F u ß Byte Stream Decode utf-8 Character Stream Character Encoding
  • 6. Bytes vs. Characters 77 10 1 10 5 11 0 32 70 11 7 19 5 15 9 M e i n F u ß Byte Stream Decode utf-8 Character Stream Character Encoding ︎Multiple bytes wide! ☝ ︎Often forgotten! ☟
  • 7. What is the character encoding? • There is usually some signal (sometimes out-of- band) that specifies the encoding that should be used to interpret a byte stream as characters. – HTTP: Content-Type: text/html; charset=UTF-8 – HTML: <meta charset="UTF-8"/> – XML: <?xml encoding="UTF-8"> – Python: # -*- coding: utf-8 -*- – POSIX: LANG=en_US.UTF-8
  • 8. What is the character encoding? • Unfortunately some types of files don't contain any information about their encoding. – Text files (*.txt) • Usually the OS default character encoding is assumed, which depends on its locale. Yikes. – JSON files (*.json) • Usually UTF-8 is assumed, but other Unicode encodings are permitted by RFC 4627. – Java source files (*.java) • Encoding is derived from the -encoding compiler flag.
  • 9. Big Mistake #1 You cannot interpret a byte sequence as a character sequence without knowing the character encoding.
  • 10. What's wrong with this code? (A1) #!/usr/bin/python2.7 with open("names.txt", "r") as f: for name in f: print('Hello ' + name.strip())
  • 11. What's wrong with this code? (A1) #!/usr/bin/python2.7 with open("names.txt", "r") as f: for name in f: print('Hello ' + name.strip()) • No character encoding is specified! – Python will fallback to the OS default character encoding, which depends on its locale. – Therefore a customer running this program on a Japanese OS will read different text than an English OS! • Reads byte strings instead of character strings!
  • 12. What's wrong with this code? (A1) #!/usr/bin/python2.7 import codecs with codecs.open("names.txt", "r", "utf-8") as f: for name in f: print(u'Hello ' + name.strip()) • Fixed. Will always read character strings, and as UTF-8.
  • 13. What's wrong with this code? (A2) #!/usr/bin/python3.4 with open("names.txt", "r") as f: for name in f: print('Hello ' + name.strip())
  • 14. What's wrong with this code? (A2) #!/usr/bin/python3.4 with open("names.txt", "r") as f: for name in f: print('Hello ' + name.strip()) • No character encoding is specified!
  • 15. What's wrong with this code? (A2) #!/usr/bin/python3.4 with open("names.txt", "r", encoding="utf-8") as f: for name in f: print('Hello ' + name.strip()) • Fixed. Will always read as UTF-8.
  • 16. What's wrong with this code? (B) <!DOCTYPE html> <html> <head> <title>Krankenzimmer</title> </head> <body>Mein Fuß tut weh!</body> </html>
  • 17. What's wrong with this code? (B) <!DOCTYPE html> <html> <head> <title>Krankenzimmer</title> </head> <body>Mein Fuß tut weh!</body> </html> • No character encoding is specified!
  • 18. What's wrong with this code? (B) <!DOCTYPE html> <html> <head> <meta charset="UTF-8"/> <title>Krankenzimmer</title> </head> <body>Mein Fuß tut weh!</body> </html> • Fixed. Declares self as UTF-8 encoded.
  • 19. What's wrong with this code? (C) <?xml version="1.0"> <messages> <message>Mein Fuß tut weh!</message> </messages>
  • 20. What's wrong with this code? (C) <?xml version="1.0"> <messages> <message>Mein Fuß tut weh!</message> </messages> • No character encoding is specified!
  • 21. What's wrong with this code? (C) <?xml version="1.0" encoding="UTF-8"> <messages> <message>Mein Fuß tut weh!</message> </messages> • Fixed. Declares self as UTF-8 encoded.
  • 22. What's wrong with this code? (D) // C# // TextReader is a character stream // OpenText always assumes UTF-8 encoding using (TextReader r = File.OpenText("names.xml")) { XmlDocument doc = new XmlDocument(); doc.Load(r); ... }
  • 23. What's wrong with this code? (D) // C# // TextReader is a character stream // OpenText always assumes UTF-8 encoding using (TextReader r = File.OpenText("names.xml")) { XmlDocument doc = new XmlDocument(); doc.Load(r); ... } • The encoding declaration in the XML is ignored! UTF-8 is always forced.
  • 24. What's wrong with this code? (D) // C# // Stream is a byte stream using (Stream s = File.OpenRead("names.xml")) { XmlDocument doc = new XmlDocument(); doc.Load(s); ... } • Fixed. XmlDocument will internally determine the encoding based on the declaration in the byte stream.
  • 25. Big Mistake #2 Bytes and characters are not the same thing. Do not mix them.
  • 26. Unfortunately many languages blur the line between byte strings and character strings. – Python 2.x • All strings are byte strings by default. • Byte and ASCII character strings are implicitly convertible. – C / C++ • String functions in the C standard library manipulate byte strings by default.
  • 27. What's wrong with this code? (E1) #!/usr/bin/python2.7 # -*- coding: windows-1252 -*- print('Mein Fuß tut weh!')
  • 28. What's wrong with this code? (E1) #!/usr/bin/python2.7 # -*- coding: windows-1252 -*- print('Mein Fuß tut weh!') • A byte string (with international chars) was printed. Only character strings should be printed. – On OS X, which has the UTF-8 locale by default rather than Windows-1252, the second word will be printed as "Fu?" instead of "Fuß".
  • 29. What's wrong with this code? (E1) #!/usr/bin/python2.7 # -*- coding: windows-1252 -*- print(u'Mein Fuß tut weh!') • This is the smallest possible fix.
  • 30. What's wrong with this code? (E1) #!/usr/bin/python2.7 # -*- coding: windows-1252 -*- from __future__ import unicode_literals print('Mein Fuß tut weh!') • A better fix, since it avoids adding u'…' everywhere.
  • 31. What's wrong with this code? (E2) #!/usr/bin/python3.4 # -*- coding: windows-1252 -*- print('Mein Fuß tut weh!')
  • 32. What's wrong with this code? (E2) #!/usr/bin/python3.4 # -*- coding: windows-1252 -*- print('Mein Fuß tut weh!') • Nothing! – Python 3.x interprets string literals as character strings by default.
  • 33. What's wrong with this code? (F) #!/usr/bin/python2.7 # -*- coding: utf-8 -*- import codecs with codecs.open('hurts.txt', 'r', 'utf-8') as f: status = f.read().strip() print('Schädigung: ' + status)
  • 34. What's wrong with this code? (F) #!/usr/bin/python2.7 # -*- coding: utf-8 -*- import codecs with codecs.open('hurts.txt', 'r', 'utf-8') as f: status = f.read().strip() print('Schädigung: ' + status) • Mixing a byte string literal with character input. – Python 2.x interprets string literals as bytes by default.
  • 35. What's wrong with this code? (F) #!/usr/bin/python2.7 # -*- coding: utf-8 -*- from __future__ import unicode_literals import codecs with codecs.open('hurts.txt', 'r', 'utf-8') as f: status = f.read().strip() print('Schädigung: ' + status) • Fixed. All strings are character strings now.
  • 36. Summary: Special Considerations • Python 2.x – String literals are byte strings by default rather than characters. – Implicitly converts between byte strings and ASCII character strings. • HTML, CSS, JavaScript – Must declare an encoding in HTML. • XML files – Must declare an encoding in XML. Must honor such a declaration. – Feed bytes to XML parsers rather than characters. • Text files – Must always assume an encoding. Usually UTF-8.
  • 37. Don't Forget 1. You cannot interpret a byte sequence as a character sequence without knowing the character encoding. 2. Bytes and characters are not the same thing. Do not mix them.
  • 40. What's wrong with this code? (#1) // Java Reader r = new FileReader("names.txt");
  • 41. What's wrong with this code? (#1) // Java Reader r = new FileReader("names.txt"); • No character encoding is specified! – Java will fallback to the OS default character encoding, which depends on its locale. – Therefore a customer running this program on a Japanese OS will read different text than an English OS!
  • 42. What's wrong with this code? (#1) // Java Reader r = new FileReader( "names.txt", "UTF-8"); • Fixed. Will always read as UTF-8.
  • 43. What's wrong with this code? (#2) // C# Reader r = new StreamReader("names.txt");
  • 44. What's wrong with this code? (#2) // C# Reader r = new StreamReader("names.txt"); • Nothing! – C#'s StreamReader always uses UTF-8 encoding if no encoding is specified. – You must always read the documentation. Don't assume.
  • 45. What's wrong with this code? (#2) // C# Reader r = new StreamReader( "names.txt", Encoding.UTF8); • Nevertheless, always explicitly specifying the encoding is still a good idea.