Unicode 101

Unicode 101
How to avoid corrupting
international text
ß
�!
David Foster

Goal
Learn just enough to:
– Avoid corrupting international text in your code

Out of Scope
• Internationalization (i18n)
– Extending a program to emit messages in
multiple languages
• Localization (l10n)
– Extending a program to emit messages in a
specific language, such as German
• Manipulating Unicode characters within strings

Problems
• Customer A writes some text to a file or app.
Customer B reads it back, but it is different.
In particular it has a bunch of ??? or ��.
– ß ➔ �
• UnicodeEncodeError: 'ascii' codec can't
encode character 'ua000' in position
0: ordinal not in range(128)

Bytes vs. Characters
77
10
1
10
5
11
0
32 70
11
7
19
5
15
9
M e i n F u ß
Byte
Stream
Decode utf-8
Character
Stream
Character
Encoding

Bytes vs. Characters
77
10
1
10
5
11
0
32 70
11
7
19
5
15
9
M e i n F u ß
Byte
Stream
Decode utf-8
Character
Stream
Character
Encoding
︎Multiple bytes wide!
☝
︎Often
forgotten!
☟

What is the character encoding?
• There is usually some signal (sometimes out-of-
band) that specifies the encoding that should be
used to interpret a byte stream as characters.
– HTTP: Content-Type: text/html; charset=UTF-8
– HTML: <meta charset="UTF-8"/>
– XML: <?xml encoding="UTF-8">
– Python: # -*- coding: utf-8 -*-
– POSIX: LANG=en_US.UTF-8

What is the character encoding?
• Unfortunately some types of files don't contain any
information about their encoding.
– Text files (*.txt)
• Usually the OS default character encoding is assumed,
which depends on its locale. Yikes.
– JSON files (*.json)
• Usually UTF-8 is assumed, but other Unicode encodings are
permitted by RFC 4627.
– Java source files (*.java)
• Encoding is derived from the -encoding compiler flag.

Big Mistake #1
You cannot interpret a
byte sequence as a
character sequence
without knowing the
character encoding.

What's wrong with this code? (A1)
#!/usr/bin/python2.7
with open("names.txt", "r") as f:
for name in f:
print('Hello ' + name.strip())

for name in f:
• No character encoding is specified!
– Python will fallback to the OS default character encoding,
which depends on its locale.
– Therefore a customer running this program on a
Japanese OS will read different text than an English OS!
• Reads byte strings instead of character strings!

import codecs
with codecs.open("names.txt", "r",
"utf-8") as f:
for name in f:
print(u'Hello ' + name.strip())
• Fixed. Will always read character strings, and as UTF-8.

for name in f:

with open("names.txt", "r",
encoding="utf-8") as f:
for name in f:
• Fixed. Will always read as UTF-8.

What's wrong with this code? (B)
<!DOCTYPE html>
<html>
<head>
<title>Krankenzimmer</title>
</head>
<body>Mein Fuß tut weh!</body>
</html>

<!DOCTYPE html>
<html>
<head>
</head>
</html>

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8"/>
</head>
</html>
• Fixed. Declares self as UTF-8 encoded.

What's wrong with this code? (C)
<?xml version="1.0">
<messages>
<message>Mein Fuß tut weh!</message>
</messages>

<?xml version="1.0">
<messages>
</messages>

<?xml version="1.0" encoding="UTF-8">
<messages>
</messages>
• Fixed. Declares self as UTF-8 encoded.

What's wrong with this code? (D)
// C#
// TextReader is a character stream
// OpenText always assumes UTF-8 encoding
using (TextReader r = File.OpenText("names.xml"))
{
XmlDocument doc = new XmlDocument();
doc.Load(r);
...
}

// C#
// TextReader is a character stream
// OpenText always assumes UTF-8 encoding
using (TextReader r = File.OpenText("names.xml"))
{
doc.Load(r);
...
}
• The encoding declaration in the XML is ignored!
UTF-8 is always forced.

// C#
// Stream is a byte stream
using (Stream s = File.OpenRead("names.xml"))
{
doc.Load(s);
...
}
• Fixed. XmlDocument will internally determine the
encoding based on the declaration in the byte stream.

Big Mistake #2
Bytes and characters
are not the same thing.
Do not mix them.

Unfortunately many languages blur the line
between byte strings and character strings.
– Python 2.x
• All strings are byte strings by default.
• Byte and ASCII character strings are implicitly convertible.
– C / C++
• String functions in the C standard library manipulate
byte strings by default.

What's wrong with this code? (E1)
# -*- coding: windows-1252 -*-
print('Mein Fuß tut weh!')

• A byte string (with international chars) was printed.
Only character strings should be printed.
– On OS X, which has the UTF-8 locale by default rather than
Windows-1252, the second word will be printed as "Fu?"
instead of "Fuß".

print(u'Mein Fuß tut weh!')
• This is the smallest possible fix.

from __future__ import unicode_literals
• A better fix, since it avoids adding u'…' everywhere.

• Nothing!
– Python 3.x interprets string literals as character strings
by default.

What's wrong with this code? (F)
# -*- coding: utf-8 -*-
import codecs
with codecs.open('hurts.txt', 'r', 'utf-8') as f:
status = f.read().strip()
print('Schädigung: ' + status)

# -*- coding: utf-8 -*-
import codecs
• Mixing a byte string literal with character input.
– Python 2.x interprets string literals as bytes by default.

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import codecs
• Fixed. All strings are character strings now.

Summary: Special Considerations
• Python 2.x
– String literals are byte strings by default rather than characters.
– Implicitly converts between byte strings and ASCII character strings.
• HTML, CSS, JavaScript
– Must declare an encoding in HTML.
• XML files
– Must declare an encoding in XML. Must honor such a declaration.
– Feed bytes to XML parsers rather than characters.
• Text files
– Must always assume an encoding. Usually UTF-8.

Don't Forget
1. You cannot interpret a byte sequence as a
character sequence without knowing the
character encoding.
2. Bytes and characters are not the same thing.
Do not mix them.

What's wrong with this code? (#1)
// Java
Reader r = new FileReader("names.txt");

// Java
Reader r = new FileReader("names.txt");
– Java will fallback to the OS default character encoding,
which depends on its locale.
– Therefore a customer running this program on a
Japanese OS will read different text than an English OS!

// Java
Reader r = new FileReader(
"names.txt", "UTF-8");
• Fixed. Will always read as UTF-8.

// C#
Reader r = new StreamReader("names.txt");

// C#
Reader r = new StreamReader("names.txt");
• Nothing!
– C#'s StreamReader always uses UTF-8 encoding if no
encoding is specified.
– You must always read the documentation. Don't assume.

// C#
Reader r = new StreamReader(
"names.txt", Encoding.UTF8);
• Nevertheless, always explicitly specifying the encoding is still a
good idea.

Unicode 101

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Viewers also liked

Viewers also liked (9)

Similar to Unicode 101

Similar to Unicode 101 (20)

Recently uploaded

Recently uploaded (20)

Unicode 101