Avoid Encoding Nightmares with Unicode and UTF-8

ENCODING
NIGHTMARES And how to avoid them

PHILADELPHIA SOFTWARE
LOCALIZATION MEETUP
 Welcome to our kickoff event!
 For more information, visit the meetup site at:
 https://www.meetup.com/Philadelphia-
Software-Localization-Meetup/

PLAN OF TALK
 Encoding Nightmares
 Character Encoding and the Modern Tower of
Babel
 Rise of Unicode
 Rules of Thumb to Avoid Nightmares
 Tricks of the Trade
 Discussion

DZONGKHA (BHUTANESE) AS
WINDOWS-1252

ENCODING NIGHTMARES CAN LEAD
TO …
 Confusion
 Missed deadlines
 Software Bugs
 Data corruption
 Embarrassment

CHARACTER ENCODING
AND THE MODERN TOWER
OF BABEL

BINARY LANGUAGE
 The Bit, Two States (0, 1)
 Represented by switches “on” (1) or
“off” (0) (Yes, No)
 Grouped Together, Represent More
States
 n bits = 2n States
 8 bits = 1 byte = 256 states

BINARY CHARACTER ENCODING
 ASCII Character Encoding
 Associate Binary string with
English, letters, numbers, etc.
 How Many Needed?
 Used 127 distinct binary
numbers, each mapped to a
member of the ASCII character
set
 Defined in the ASCII “Code
Page”

EUROPEAN LANGUAGES NEED
MORE SPACE
 German, French, other
languages needed more
than 128 characters
 Started to use the 8th
bit (doubles the
possibilities)
 256 spaces in these 8
bit character maps

CHINESE, JAPANESE, KOREAN (CJK)
NEED EVEN MORE
 In Chinese, 2,000 distinct characters
is often considered a minimum
threshold for literacy. 40,000
characters are in common use and tens
of thousands more in rare, historical
literature.
 Japanese uses 2,000 characters,
mixing their own phonetic scripts
comprising the phonetic and
ideographic characters borrowed from
the Chinese
 Modern Korean tends toward more
phonetic language and relies much less
on the broader set of Chinese
characters

DOUBLE BYTE CHARACTER
ENCODINGS
 Two Bytes, 16 Bits
 216 = 65,536 possible
characters
 some bits used as signals, so
can’t actually store 65,000 total
https://r12a.github.io/scripts/tutorial/part2 / (Creative
Commons license)

NUMBER OF ENCODINGS
EXPLODE
ISO 646, ASCII, EBCDIC, CP37, CP930, CP1047, ISO 8859-1,
ISO 8859-2, ISO 8859-3, ISO 8859-4, ISO 8859-5, ISO 8859-
6, ISO 8859-7, ISO 8859-8, ISO 8859-9, ISO 8859-10 , ISO
8859-11 , ISO 8859-12 , ISO 8859-13 , ISO 8859-14 , ISO
8859-15 ,CP437, CP720, CP737, CP850, CP852, CP855,
CP857, CP858, CP860, CP861, CP862, CP863, CP865, CP866,
CP869, CP872, Windows-1250, Windows-1251, Windows-
1252, Windows-1253 , Windows-1254 , Windows-1255 ,
Windows-1256, Windows-1257, Windows-1258, Mac OS
Roman, Shift JIS, GB 2312, GBK, Taiwan Big5, KOI8-R, KOI8-U,
KOI7 ….

1980S: THE COMPUTING TOWER
OF BABEL
Same binary sequence
represents entirely different
characters
 Sharing documents across
borders becomes very difficult
 Unintelligible Files (common
experience during early days of
web)
 Hard to create a document
containing multiple languages.
 Double-byte encodings
increase likelihood of and add

THE NIGHTMARE
If you open and
save a file with the
wrong character
encoding, you can
change it
permanently.
Important data
may then be
irretrievable.

WHAT IS UNICODE?
 Global, unified “solution” to
character encoding tower of babel
 One big encoding table for all
world’s characters
 All linguistic symbols have a
unique, defined “code point”
 Capacity for 1 million characters

UNICODE CONSORTIUM
 Non-profit corporation with
global members from industry,
government, academia, and
other NGOs
 Approve new characters for
registration as official Unicode
 Works closely with W3C and
ISO

MORE ON UNICODE
 Abstract characters, not
glyphs
 Broken Into Planes (each
with 65,536 characters):
 Basic Multilingual Plane +
16 other planes
 Room for more than 1
million individual characters
NOT a specific binary
encoding of that number
(UTF-8 differs from UTF-
16)

VERSION 9.0 (JUNE 21 2016)
 Adds exactly 7,500 characters, for
a total of 128,172 characters:
 Osage, a Native American language
 Nepal Bhasa, a language of Nepal
 Fulani and other African languages
 Tangut, a major historic script of China
 72 emoji characters, such as new
smilies and people, animals and nature,
and food and drink

STILL NOT UBIQUITOUS!
 Pre-Unicode encodings very much still in use.
 Legacy operating systems
 Popular applications
 MS Office Products
 And even within Unicode, nightmares still
possible (UTF-?)

LIMIT YOUR APPLICATIONS
 Every app in chain
has potential to
corrupt.
Make sure nobody
opens the file “just to
take a look.”

USE UTF-8
 For websites and
mobile apps, almost
always the right
choice
 If resource uses
different encoding,
use ICONV or similar
tool to convert

KNOW YOUR METADATA
<head>
<meta http-equiv="Content-
Type" content="text/html;
charset=UTF-8">
</head>
<head>
<meta charset="UTF-8">
</head>

KNOW THE DIFFERENCE BETWEEN
CHARACTERS
AND GLYPHS
 technically, Unicode encodes characters, not
glyphs or fonts
 characters can be thought of as the base shape
while glyphs and fonts are particular
appearances of those characters, including
combination of “root characters which appear as
one symbol, like the é
 this distinction can be important when you are
diagnosing a character display problem; but the
boundary can be fuzzy . Ä, for example is
actually a complete character with unique code
point, but is can also be stored as two code
points, which combine the base character A with
the umlaut in combination
 you may have correct encoding, but the
particular font you are using to display the
characters may not have the appropriate glyphs
to display the encoded character.

CHECK AND CONVERT ENCODING
 Some text editors and stand alone utilities (like
ICONV) guess and convert the encoding
 Libraries available (Mozilla Universal Charset
Detector, International Components for Unicode)
 Can often guess correctly, but they are imperfect
 Some tools allow you to check large sets of files
in batches

UTF-8 WITH BOM?
 BOM = Byte Order Mark
 Essentially a signal to receiver of message
that the string is Unicode
 Can be appended to binary strings by
otherwise “neutral” apps like Windows Notepad
 Can trip up various programming languages
and introduce garbage (PHP, for example)
 Could show up in text editor (if
misinterpreted) as series of characters to right
 Use editor (such as Sublime Text) or
encoding converter to convert to straight UTF-
8
ï»¿

SPREADSHEET TIP
 Careful with CSV and Excel
 Excel often mangles CSV
encoding
 Use Google Docs (or MAC) to
save CSV as Excel and then
convert back to CSV

TOOLS
 Will post to our
discussion page at
the Meetup site.
 Add your own!

DISCUSSION
 Questions?
 Tips?
 Horror Stories?

THANK YOU!
Merci – Gracias – Danke
Grazie – Obrigado
‫شكرا‬
谢谢
당신을 감사하십시오
ありがとう
www.mtmlinguasoft.com

Avoid Encoding Nightmares with Unicode and UTF-8

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (13)

Similar to Avoid Encoding Nightmares with Unicode and UTF-8

Similar to Avoid Encoding Nightmares with Unicode and UTF-8 (20)

Recently uploaded

Recently uploaded (20)

Avoid Encoding Nightmares with Unicode and UTF-8

Editor's Notes