This document discusses character encoding and localization nightmares. It begins with examples of encoding issues causing problems like corrupted documents and data loss. It then explains the history of character encoding as the need for encoding non-English languages grew. This led to a "Tower of Babel" effect with different encodings representing different characters. The document outlines the rise of Unicode as a unified standard and provides four rules of thumb to avoid encoding nightmares, such as limiting applications and using UTF-8 encoding when possible. It concludes with tricks and tools for checking, converting, and working with encodings.
2. PHILADELPHIA SOFTWARE
LOCALIZATION MEETUP
Welcome to our kickoff event!
For more information, visit the meetup site at:
https://www.meetup.com/Philadelphia-
Software-Localization-Meetup/
3. PLAN OF TALK
Encoding Nightmares
Character Encoding and the Modern Tower of
Babel
Rise of Unicode
Rules of Thumb to Avoid Nightmares
Tricks of the Trade
Discussion
10. BINARY LANGUAGE
The Bit, Two States (0, 1)
Represented by switches “on” (1) or
“off” (0) (Yes, No)
Grouped Together, Represent More
States
n bits = 2n States
8 bits = 1 byte = 256 states
11. BINARY CHARACTER ENCODING
ASCII Character Encoding
Associate Binary string with
English, letters, numbers, etc.
How Many Needed?
Used 127 distinct binary
numbers, each mapped to a
member of the ASCII character
set
Defined in the ASCII “Code
Page”
12. EUROPEAN LANGUAGES NEED
MORE SPACE
German, French, other
languages needed more
than 128 characters
Started to use the 8th
bit (doubles the
possibilities)
256 spaces in these 8
bit character maps
13. CHINESE, JAPANESE, KOREAN (CJK)
NEED EVEN MORE
In Chinese, 2,000 distinct characters
is often considered a minimum
threshold for literacy. 40,000
characters are in common use and tens
of thousands more in rare, historical
literature.
Japanese uses 2,000 characters,
mixing their own phonetic scripts
comprising the phonetic and
ideographic characters borrowed from
the Chinese
Modern Korean tends toward more
phonetic language and relies much less
on the broader set of Chinese
characters
14. DOUBLE BYTE CHARACTER
ENCODINGS
Two Bytes, 16 Bits
216 = 65,536 possible
characters
some bits used as signals, so
can’t actually store 65,000 total
https://r12a.github.io/scripts/tutorial/part2 / (Creative
Commons license)
15. NUMBER OF ENCODINGS
EXPLODE
ISO 646, ASCII, EBCDIC, CP37, CP930, CP1047, ISO 8859-1,
ISO 8859-2, ISO 8859-3, ISO 8859-4, ISO 8859-5, ISO 8859-
6, ISO 8859-7, ISO 8859-8, ISO 8859-9, ISO 8859-10 , ISO
8859-11 , ISO 8859-12 , ISO 8859-13 , ISO 8859-14 , ISO
8859-15 ,CP437, CP720, CP737, CP850, CP852, CP855,
CP857, CP858, CP860, CP861, CP862, CP863, CP865, CP866,
CP869, CP872, Windows-1250, Windows-1251, Windows-
1252, Windows-1253 , Windows-1254 , Windows-1255 ,
Windows-1256, Windows-1257, Windows-1258, Mac OS
Roman, Shift JIS, GB 2312, GBK, Taiwan Big5, KOI8-R, KOI8-U,
KOI7 ….
16. 1980S: THE COMPUTING TOWER
OF BABEL
Same binary sequence
represents entirely different
characters
Sharing documents across
borders becomes very difficult
Unintelligible Files (common
experience during early days of
web)
Hard to create a document
containing multiple languages.
Double-byte encodings
increase likelihood of and add
17. THE NIGHTMARE
If you open and
save a file with the
wrong character
encoding, you can
change it
permanently.
Important data
may then be
irretrievable.
19. WHAT IS UNICODE?
Global, unified “solution” to
character encoding tower of babel
One big encoding table for all
world’s characters
All linguistic symbols have a
unique, defined “code point”
Capacity for 1 million characters
20. UNICODE CONSORTIUM
Non-profit corporation with
global members from industry,
government, academia, and
other NGOs
Approve new characters for
registration as official Unicode
Works closely with W3C and
ISO
21. MORE ON UNICODE
Abstract characters, not
glyphs
Broken Into Planes (each
with 65,536 characters):
Basic Multilingual Plane +
16 other planes
Room for more than 1
million individual characters
NOT a specific binary
encoding of that number
(UTF-8 differs from UTF-
16)
22. VERSION 9.0 (JUNE 21 2016)
Adds exactly 7,500 characters, for
a total of 128,172 characters:
Osage, a Native American language
Nepal Bhasa, a language of Nepal
Fulani and other African languages
Tangut, a major historic script of China
72 emoji characters, such as new
smilies and people, animals and nature,
and food and drink
23. STILL NOT UBIQUITOUS!
Pre-Unicode encodings very much still in use.
Legacy operating systems
Popular applications
MS Office Products
And even within Unicode, nightmares still
possible (UTF-?)
25. LIMIT YOUR APPLICATIONS
Every app in chain
has potential to
corrupt.
Make sure nobody
opens the file “just to
take a look.”
26. USE UTF-8
For websites and
mobile apps, almost
always the right
choice
If resource uses
different encoding,
use ICONV or similar
tool to convert
28. KNOW THE DIFFERENCE BETWEEN
CHARACTERS
AND GLYPHS
technically, Unicode encodes characters, not
glyphs or fonts
characters can be thought of as the base shape
while glyphs and fonts are particular
appearances of those characters, including
combination of “root characters which appear as
one symbol, like the é
this distinction can be important when you are
diagnosing a character display problem; but the
boundary can be fuzzy . Ä, for example is
actually a complete character with unique code
point, but is can also be stored as two code
points, which combine the base character A with
the umlaut in combination
you may have correct encoding, but the
particular font you are using to display the
characters may not have the appropriate glyphs
to display the encoded character.
30. CHECK AND CONVERT ENCODING
Some text editors and stand alone utilities (like
ICONV) guess and convert the encoding
Libraries available (Mozilla Universal Charset
Detector, International Components for Unicode)
Can often guess correctly, but they are imperfect
Some tools allow you to check large sets of files
in batches
31. UTF-8 WITH BOM?
BOM = Byte Order Mark
Essentially a signal to receiver of message
that the string is Unicode
Can be appended to binary strings by
otherwise “neutral” apps like Windows Notepad
Can trip up various programming languages
and introduce garbage (PHP, for example)
Could show up in text editor (if
misinterpreted) as series of characters to right
Use editor (such as Sublime Text) or
encoding converter to convert to straight UTF-
8

32. SPREADSHEET TIP
Careful with CSV and Excel
Excel often mangles CSV
encoding
Use Google Docs (or MAC) to
save CSV as Excel and then
convert back to CSV
33. TOOLS
Will post to our
discussion page at
the Meetup site.
Add your own!
2016 June 21
Unicode 9.0 adds exactly 7,500 characters, for a total of 128,172 characters. These additions include six new scripts and 72 new emoji characters.
The new scripts and characters in Version 9.0 add support for lesser-used languages worldwide, including:
Osage, a Native American language
Nepal Bhasa, a language of Nepal
Fulani and other African languages
The Bravanese dialect of Swahili, used in Somalia
The Warsh orthography for Arabic, used in North and West Africa
Tangut, a major historic script of China
Important symbol additions include:
19 symbols for the new 4K TV standard
72 emoji characters, such as new smilies and people, animals and nature, and food and drink
-charset Photoshop=CHARSET"
-charset or -L PDF
XML declaration
<?xml version="1.0" encoding="UTF-8"?>
Mention special character codes
https://dev.w3.org/html5/html-author/charref