SlideShare a Scribd company logo
1 of 35
ENCODING
NIGHTMARES And how to avoid them
PHILADELPHIA SOFTWARE
LOCALIZATION MEETUP
 Welcome to our kickoff event!
 For more information, visit the meetup site at:
 https://www.meetup.com/Philadelphia-
Software-Localization-Meetup/
PLAN OF TALK
 Encoding Nightmares
 Character Encoding and the Modern Tower of
Babel
 Rise of Unicode
 Rules of Thumb to Avoid Nightmares
 Tricks of the Trade
 Discussion
TAIWANESE WEBSITE FAIL
DZONGKHA (BHUTANESE) AS
WINDOWS-1252
CORRUPTED DOCUMENT, DATA
LOSS
ENCODING NIGHTMARES CAN LEAD
TO …
 Confusion
 Missed deadlines
 Software Bugs
 Data corruption
 Embarrassment
CHARACTER ENCODING
AND THE MODERN TOWER
OF BABEL
BINARY LANGUAGE
 The Bit, Two States (0, 1)
 Represented by switches “on” (1) or
“off” (0) (Yes, No)
 Grouped Together, Represent More
States
 n bits = 2n States
 8 bits = 1 byte = 256 states
BINARY CHARACTER ENCODING
 ASCII Character Encoding
 Associate Binary string with
English, letters, numbers, etc.
 How Many Needed?
 Used 127 distinct binary
numbers, each mapped to a
member of the ASCII character
set
 Defined in the ASCII “Code
Page”
EUROPEAN LANGUAGES NEED
MORE SPACE
 German, French, other
languages needed more
than 128 characters
 Started to use the 8th
bit (doubles the
possibilities)
 256 spaces in these 8
bit character maps
CHINESE, JAPANESE, KOREAN (CJK)
NEED EVEN MORE
 In Chinese, 2,000 distinct characters
is often considered a minimum
threshold for literacy. 40,000
characters are in common use and tens
of thousands more in rare, historical
literature.
 Japanese uses 2,000 characters,
mixing their own phonetic scripts
comprising the phonetic and
ideographic characters borrowed from
the Chinese
 Modern Korean tends toward more
phonetic language and relies much less
on the broader set of Chinese
characters
DOUBLE BYTE CHARACTER
ENCODINGS
 Two Bytes, 16 Bits
 216 = 65,536 possible
characters
 some bits used as signals, so
can’t actually store 65,000 total
https://r12a.github.io/scripts/tutorial/part2 / (Creative
Commons license)
NUMBER OF ENCODINGS
EXPLODE
ISO 646, ASCII, EBCDIC, CP37, CP930, CP1047, ISO 8859-1,
ISO 8859-2, ISO 8859-3, ISO 8859-4, ISO 8859-5, ISO 8859-
6, ISO 8859-7, ISO 8859-8, ISO 8859-9, ISO 8859-10 , ISO
8859-11 , ISO 8859-12 , ISO 8859-13 , ISO 8859-14 , ISO
8859-15 ,CP437, CP720, CP737, CP850, CP852, CP855,
CP857, CP858, CP860, CP861, CP862, CP863, CP865, CP866,
CP869, CP872, Windows-1250, Windows-1251, Windows-
1252, Windows-1253 , Windows-1254 , Windows-1255 ,
Windows-1256, Windows-1257, Windows-1258, Mac OS
Roman, Shift JIS, GB 2312, GBK, Taiwan Big5, KOI8-R, KOI8-U,
KOI7 ….
1980S: THE COMPUTING TOWER
OF BABEL
Same binary sequence
represents entirely different
characters
 Sharing documents across
borders becomes very difficult
 Unintelligible Files (common
experience during early days of
web)
 Hard to create a document
containing multiple languages.
 Double-byte encodings
increase likelihood of and add
THE NIGHTMARE
If you open and
save a file with the
wrong character
encoding, you can
change it
permanently.
Important data
may then be
irretrievable.
RISE OF UNICODE
WHAT IS UNICODE?
 Global, unified “solution” to
character encoding tower of babel
 One big encoding table for all
world’s characters
 All linguistic symbols have a
unique, defined “code point”
 Capacity for 1 million characters
UNICODE CONSORTIUM
 Non-profit corporation with
global members from industry,
government, academia, and
other NGOs
 Approve new characters for
registration as official Unicode
 Works closely with W3C and
ISO
MORE ON UNICODE
 Abstract characters, not
glyphs
 Broken Into Planes (each
with 65,536 characters):
 Basic Multilingual Plane +
16 other planes
 Room for more than 1
million individual characters
NOT a specific binary
encoding of that number
(UTF-8 differs from UTF-
16)
VERSION 9.0 (JUNE 21 2016)
 Adds exactly 7,500 characters, for
a total of 128,172 characters:
 Osage, a Native American language
 Nepal Bhasa, a language of Nepal
 Fulani and other African languages
 Tangut, a major historic script of China
 72 emoji characters, such as new
smilies and people, animals and nature,
and food and drink
STILL NOT UBIQUITOUS!
 Pre-Unicode encodings very much still in use.
 Legacy operating systems
 Popular applications
 MS Office Products
 And even within Unicode, nightmares still
possible (UTF-?)
4 RULES OF THUMB
LIMIT YOUR APPLICATIONS
 Every app in chain
has potential to
corrupt.
Make sure nobody
opens the file “just to
take a look.”
USE UTF-8
 For websites and
mobile apps, almost
always the right
choice
 If resource uses
different encoding,
use ICONV or similar
tool to convert
KNOW YOUR METADATA
<head>
<meta http-equiv="Content-
Type" content="text/html;
charset=UTF-8">
</head>
<head>
<meta charset="UTF-8">
</head>
KNOW THE DIFFERENCE BETWEEN
CHARACTERS
AND GLYPHS
 technically, Unicode encodes characters, not
glyphs or fonts
 characters can be thought of as the base shape
while glyphs and fonts are particular
appearances of those characters, including
combination of “root characters which appear as
one symbol, like the é
 this distinction can be important when you are
diagnosing a character display problem; but the
boundary can be fuzzy . Ä, for example is
actually a complete character with unique code
point, but is can also be stored as two code
points, which combine the base character A with
the umlaut in combination
 you may have correct encoding, but the
particular font you are using to display the
characters may not have the appropriate glyphs
to display the encoded character.
TRICKS OF THE TRADE
CHECK AND CONVERT ENCODING
 Some text editors and stand alone utilities (like
ICONV) guess and convert the encoding
 Libraries available (Mozilla Universal Charset
Detector, International Components for Unicode)
 Can often guess correctly, but they are imperfect
 Some tools allow you to check large sets of files
in batches
UTF-8 WITH BOM?
 BOM = Byte Order Mark
 Essentially a signal to receiver of message
that the string is Unicode
 Can be appended to binary strings by
otherwise “neutral” apps like Windows Notepad
 Can trip up various programming languages
and introduce garbage (PHP, for example)
 Could show up in text editor (if
misinterpreted) as series of characters to right
 Use editor (such as Sublime Text) or
encoding converter to convert to straight UTF-
8

SPREADSHEET TIP
 Careful with CSV and Excel
 Excel often mangles CSV
encoding
 Use Google Docs (or MAC) to
save CSV as Excel and then
convert back to CSV
TOOLS
 Will post to our
discussion page at
the Meetup site.
 Add your own!
DISCUSSION
 Questions?
 Tips?
 Horror Stories?
THANK YOU!
Merci – Gracias – Danke
Grazie – Obrigado
‫شكرا‬
谢谢
당신을 감사하십시오
ありがとう
www.mtmlinguasoft.com

More Related Content

Viewers also liked

Encoding and Decoding
Encoding and DecodingEncoding and Decoding
Encoding and Decodingmrhaken
 
China Maize Processing Machine
China Maize Processing MachineChina Maize Processing Machine
China Maize Processing MachinePenny Hou
 
Pharma franchise company in chandigarh /India
Pharma franchise company in chandigarh /IndiaPharma franchise company in chandigarh /India
Pharma franchise company in chandigarh /IndiaPankaj Goyal
 
Change Day, looking back, looking forward
Change Day, looking back, looking forwardChange Day, looking back, looking forward
Change Day, looking back, looking forwardNHS Horizons
 
Reading in the future2
Reading in the future2Reading in the future2
Reading in the future2Mohammed Awad
 
Sobreexposición personal en la red libro completo
Sobreexposición personal en la red   libro completoSobreexposición personal en la red   libro completo
Sobreexposición personal en la red libro completoLeonel Erlichman
 
TSM: управление временными интервалами поставок
TSM: управление временными интервалами поставокTSM: управление временными интервалами поставок
TSM: управление временными интервалами поставокInna Kotykova
 
LinkedIn: Position Yourself as an Expert & Attract Opportunities
LinkedIn: Position Yourself as an Expert & Attract OpportunitiesLinkedIn: Position Yourself as an Expert & Attract Opportunities
LinkedIn: Position Yourself as an Expert & Attract OpportunitiesChristine Hueber
 

Viewers also liked (13)

Modul
ModulModul
Modul
 
Data encoding
Data encodingData encoding
Data encoding
 
Encoding and Decoding
Encoding and DecodingEncoding and Decoding
Encoding and Decoding
 
China Maize Processing Machine
China Maize Processing MachineChina Maize Processing Machine
China Maize Processing Machine
 
Pharma franchise company in chandigarh /India
Pharma franchise company in chandigarh /IndiaPharma franchise company in chandigarh /India
Pharma franchise company in chandigarh /India
 
UHPP 2016 Annual Report
UHPP 2016 Annual ReportUHPP 2016 Annual Report
UHPP 2016 Annual Report
 
Change Day, looking back, looking forward
Change Day, looking back, looking forwardChange Day, looking back, looking forward
Change Day, looking back, looking forward
 
Reading in the future2
Reading in the future2Reading in the future2
Reading in the future2
 
Atrc dcs xtuple_presentation_10_april_2013-1
Atrc dcs xtuple_presentation_10_april_2013-1Atrc dcs xtuple_presentation_10_april_2013-1
Atrc dcs xtuple_presentation_10_april_2013-1
 
Afroditta
AfrodittaAfroditta
Afroditta
 
Sobreexposición personal en la red libro completo
Sobreexposición personal en la red   libro completoSobreexposición personal en la red   libro completo
Sobreexposición personal en la red libro completo
 
TSM: управление временными интервалами поставок
TSM: управление временными интервалами поставокTSM: управление временными интервалами поставок
TSM: управление временными интервалами поставок
 
LinkedIn: Position Yourself as an Expert & Attract Opportunities
LinkedIn: Position Yourself as an Expert & Attract OpportunitiesLinkedIn: Position Yourself as an Expert & Attract Opportunities
LinkedIn: Position Yourself as an Expert & Attract Opportunities
 

Similar to Avoid Encoding Nightmares with Unicode and UTF-8

How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...agileware
 
Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsUnicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsRay Paseur
 
Encodings - Ruby 1.8 and Ruby 1.9
Encodings - Ruby 1.8 and Ruby 1.9Encodings - Ruby 1.8 and Ruby 1.9
Encodings - Ruby 1.8 and Ruby 1.9Dimelo R&D Team
 
Internationalisation And Globalisation
Internationalisation And GlobalisationInternationalisation And Globalisation
Internationalisation And GlobalisationAlan Dean
 
Introduction to W3C I18N Best Practices
Introduction to W3C I18N Best PracticesIntroduction to W3C I18N Best Practices
Introduction to W3C I18N Best PracticesGopal Venkatesan
 
Xml For Dummies Chapter 6 Adding Character(S) To Xml
Xml For Dummies   Chapter 6 Adding Character(S) To XmlXml For Dummies   Chapter 6 Adding Character(S) To Xml
Xml For Dummies Chapter 6 Adding Character(S) To Xmlphanleson
 
Character sets and iconv
Character sets and iconvCharacter sets and iconv
Character sets and iconvDaniel_Rhodes
 
Unicode and character sets
Unicode and character setsUnicode and character sets
Unicode and character setsrenchenyu
 
Character Encoding issue with PHP
Character Encoding issue with PHPCharacter Encoding issue with PHP
Character Encoding issue with PHPRavi Raj
 
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - ITguest6ddfb98
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingBert Pattyn
 
Unicode and Collations in MySQL 8.0
Unicode and Collations in MySQL 8.0Unicode and Collations in MySQL 8.0
Unicode and Collations in MySQL 8.0Bernt Marius Johnsen
 
Data encryption and tokenization for international unicode
Data encryption and tokenization for international unicodeData encryption and tokenization for international unicode
Data encryption and tokenization for international unicodeUlf Mattsson
 
[CocoaHeads Tricity] Do not reinvent the wheel
[CocoaHeads Tricity] Do not reinvent the wheel[CocoaHeads Tricity] Do not reinvent the wheel
[CocoaHeads Tricity] Do not reinvent the wheelMateusz Klimczak
 
MySQL 8.0 & Unicode: Why, what & how
MySQL 8.0 & Unicode: Why, what & howMySQL 8.0 & Unicode: Why, what & how
MySQL 8.0 & Unicode: Why, what & howBernt Marius Johnsen
 
Character encoding standard(1)
Character encoding standard(1)Character encoding standard(1)
Character encoding standard(1)Pramila Selvaraj
 

Similar to Avoid Encoding Nightmares with Unicode and UTF-8 (20)

Unicode 101
Unicode 101Unicode 101
Unicode 101
 
How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...
 
Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsUnicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set Collisions
 
Encodings - Ruby 1.8 and Ruby 1.9
Encodings - Ruby 1.8 and Ruby 1.9Encodings - Ruby 1.8 and Ruby 1.9
Encodings - Ruby 1.8 and Ruby 1.9
 
multilanguage.pdf
multilanguage.pdfmultilanguage.pdf
multilanguage.pdf
 
Internationalisation And Globalisation
Internationalisation And GlobalisationInternationalisation And Globalisation
Internationalisation And Globalisation
 
Uncdtalk
UncdtalkUncdtalk
Uncdtalk
 
Introduction to W3C I18N Best Practices
Introduction to W3C I18N Best PracticesIntroduction to W3C I18N Best Practices
Introduction to W3C I18N Best Practices
 
Xml For Dummies Chapter 6 Adding Character(S) To Xml
Xml For Dummies   Chapter 6 Adding Character(S) To XmlXml For Dummies   Chapter 6 Adding Character(S) To Xml
Xml For Dummies Chapter 6 Adding Character(S) To Xml
 
Character sets and iconv
Character sets and iconvCharacter sets and iconv
Character sets and iconv
 
Unicode Primer for the Uninitiated
Unicode Primer for the UninitiatedUnicode Primer for the Uninitiated
Unicode Primer for the Uninitiated
 
Unicode and character sets
Unicode and character setsUnicode and character sets
Unicode and character sets
 
Character Encoding issue with PHP
Character Encoding issue with PHPCharacter Encoding issue with PHP
Character Encoding issue with PHP
 
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - IT
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character Encoding
 
Unicode and Collations in MySQL 8.0
Unicode and Collations in MySQL 8.0Unicode and Collations in MySQL 8.0
Unicode and Collations in MySQL 8.0
 
Data encryption and tokenization for international unicode
Data encryption and tokenization for international unicodeData encryption and tokenization for international unicode
Data encryption and tokenization for international unicode
 
[CocoaHeads Tricity] Do not reinvent the wheel
[CocoaHeads Tricity] Do not reinvent the wheel[CocoaHeads Tricity] Do not reinvent the wheel
[CocoaHeads Tricity] Do not reinvent the wheel
 
MySQL 8.0 & Unicode: Why, what & how
MySQL 8.0 & Unicode: Why, what & howMySQL 8.0 & Unicode: Why, what & how
MySQL 8.0 & Unicode: Why, what & how
 
Character encoding standard(1)
Character encoding standard(1)Character encoding standard(1)
Character encoding standard(1)
 

Recently uploaded

Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 

Recently uploaded (20)

Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 

Avoid Encoding Nightmares with Unicode and UTF-8

  • 2. PHILADELPHIA SOFTWARE LOCALIZATION MEETUP  Welcome to our kickoff event!  For more information, visit the meetup site at:  https://www.meetup.com/Philadelphia- Software-Localization-Meetup/
  • 3. PLAN OF TALK  Encoding Nightmares  Character Encoding and the Modern Tower of Babel  Rise of Unicode  Rules of Thumb to Avoid Nightmares  Tricks of the Trade  Discussion
  • 6.
  • 8. ENCODING NIGHTMARES CAN LEAD TO …  Confusion  Missed deadlines  Software Bugs  Data corruption  Embarrassment
  • 9. CHARACTER ENCODING AND THE MODERN TOWER OF BABEL
  • 10. BINARY LANGUAGE  The Bit, Two States (0, 1)  Represented by switches “on” (1) or “off” (0) (Yes, No)  Grouped Together, Represent More States  n bits = 2n States  8 bits = 1 byte = 256 states
  • 11. BINARY CHARACTER ENCODING  ASCII Character Encoding  Associate Binary string with English, letters, numbers, etc.  How Many Needed?  Used 127 distinct binary numbers, each mapped to a member of the ASCII character set  Defined in the ASCII “Code Page”
  • 12. EUROPEAN LANGUAGES NEED MORE SPACE  German, French, other languages needed more than 128 characters  Started to use the 8th bit (doubles the possibilities)  256 spaces in these 8 bit character maps
  • 13. CHINESE, JAPANESE, KOREAN (CJK) NEED EVEN MORE  In Chinese, 2,000 distinct characters is often considered a minimum threshold for literacy. 40,000 characters are in common use and tens of thousands more in rare, historical literature.  Japanese uses 2,000 characters, mixing their own phonetic scripts comprising the phonetic and ideographic characters borrowed from the Chinese  Modern Korean tends toward more phonetic language and relies much less on the broader set of Chinese characters
  • 14. DOUBLE BYTE CHARACTER ENCODINGS  Two Bytes, 16 Bits  216 = 65,536 possible characters  some bits used as signals, so can’t actually store 65,000 total https://r12a.github.io/scripts/tutorial/part2 / (Creative Commons license)
  • 15. NUMBER OF ENCODINGS EXPLODE ISO 646, ASCII, EBCDIC, CP37, CP930, CP1047, ISO 8859-1, ISO 8859-2, ISO 8859-3, ISO 8859-4, ISO 8859-5, ISO 8859- 6, ISO 8859-7, ISO 8859-8, ISO 8859-9, ISO 8859-10 , ISO 8859-11 , ISO 8859-12 , ISO 8859-13 , ISO 8859-14 , ISO 8859-15 ,CP437, CP720, CP737, CP850, CP852, CP855, CP857, CP858, CP860, CP861, CP862, CP863, CP865, CP866, CP869, CP872, Windows-1250, Windows-1251, Windows- 1252, Windows-1253 , Windows-1254 , Windows-1255 , Windows-1256, Windows-1257, Windows-1258, Mac OS Roman, Shift JIS, GB 2312, GBK, Taiwan Big5, KOI8-R, KOI8-U, KOI7 ….
  • 16. 1980S: THE COMPUTING TOWER OF BABEL Same binary sequence represents entirely different characters  Sharing documents across borders becomes very difficult  Unintelligible Files (common experience during early days of web)  Hard to create a document containing multiple languages.  Double-byte encodings increase likelihood of and add
  • 17. THE NIGHTMARE If you open and save a file with the wrong character encoding, you can change it permanently. Important data may then be irretrievable.
  • 19. WHAT IS UNICODE?  Global, unified “solution” to character encoding tower of babel  One big encoding table for all world’s characters  All linguistic symbols have a unique, defined “code point”  Capacity for 1 million characters
  • 20. UNICODE CONSORTIUM  Non-profit corporation with global members from industry, government, academia, and other NGOs  Approve new characters for registration as official Unicode  Works closely with W3C and ISO
  • 21. MORE ON UNICODE  Abstract characters, not glyphs  Broken Into Planes (each with 65,536 characters):  Basic Multilingual Plane + 16 other planes  Room for more than 1 million individual characters NOT a specific binary encoding of that number (UTF-8 differs from UTF- 16)
  • 22. VERSION 9.0 (JUNE 21 2016)  Adds exactly 7,500 characters, for a total of 128,172 characters:  Osage, a Native American language  Nepal Bhasa, a language of Nepal  Fulani and other African languages  Tangut, a major historic script of China  72 emoji characters, such as new smilies and people, animals and nature, and food and drink
  • 23. STILL NOT UBIQUITOUS!  Pre-Unicode encodings very much still in use.  Legacy operating systems  Popular applications  MS Office Products  And even within Unicode, nightmares still possible (UTF-?)
  • 24. 4 RULES OF THUMB
  • 25. LIMIT YOUR APPLICATIONS  Every app in chain has potential to corrupt. Make sure nobody opens the file “just to take a look.”
  • 26. USE UTF-8  For websites and mobile apps, almost always the right choice  If resource uses different encoding, use ICONV or similar tool to convert
  • 27. KNOW YOUR METADATA <head> <meta http-equiv="Content- Type" content="text/html; charset=UTF-8"> </head> <head> <meta charset="UTF-8"> </head>
  • 28. KNOW THE DIFFERENCE BETWEEN CHARACTERS AND GLYPHS  technically, Unicode encodes characters, not glyphs or fonts  characters can be thought of as the base shape while glyphs and fonts are particular appearances of those characters, including combination of “root characters which appear as one symbol, like the é  this distinction can be important when you are diagnosing a character display problem; but the boundary can be fuzzy . Ä, for example is actually a complete character with unique code point, but is can also be stored as two code points, which combine the base character A with the umlaut in combination  you may have correct encoding, but the particular font you are using to display the characters may not have the appropriate glyphs to display the encoded character.
  • 29. TRICKS OF THE TRADE
  • 30. CHECK AND CONVERT ENCODING  Some text editors and stand alone utilities (like ICONV) guess and convert the encoding  Libraries available (Mozilla Universal Charset Detector, International Components for Unicode)  Can often guess correctly, but they are imperfect  Some tools allow you to check large sets of files in batches
  • 31. UTF-8 WITH BOM?  BOM = Byte Order Mark  Essentially a signal to receiver of message that the string is Unicode  Can be appended to binary strings by otherwise “neutral” apps like Windows Notepad  Can trip up various programming languages and introduce garbage (PHP, for example)  Could show up in text editor (if misinterpreted) as series of characters to right  Use editor (such as Sublime Text) or encoding converter to convert to straight UTF- 8 
  • 32. SPREADSHEET TIP  Careful with CSV and Excel  Excel often mangles CSV encoding  Use Google Docs (or MAC) to save CSV as Excel and then convert back to CSV
  • 33. TOOLS  Will post to our discussion page at the Meetup site.  Add your own!
  • 35. THANK YOU! Merci – Gracias – Danke Grazie – Obrigado ‫شكرا‬ 谢谢 당신을 감사하십시오 ありがとう www.mtmlinguasoft.com

Editor's Notes

  1. http://chinesehacks.com/resources/software/change-the-character-encoding-for-a-website/
  2. http://www.dzongkha.gov.bt/IT/ie-intsr.en.php (Dzongkha text displays as meaningless Latin characters)
  3. https://community.spiceworks.com/topic/1565360-email-character-encoding-problem
  4. https://creativecommons.org/licenses/by/4.0/ https://r12a.github.io/scripts/tutorial/part2 /
  5. 2016 June 21 Unicode 9.0 adds exactly 7,500 characters, for a total of 128,172 characters. These additions include six new scripts and 72 new emoji characters. The new scripts and characters in Version 9.0 add support for lesser-used languages worldwide, including: Osage, a Native American language Nepal Bhasa, a language of Nepal Fulani and other African languages The Bravanese dialect of Swahili, used in Somalia The Warsh orthography for Arabic, used in North and West Africa Tangut, a major historic script of China Important symbol additions include: 19 symbols for the new 4K TV standard 72 emoji characters, such as new smilies and people, animals and nature, and food and drink
  6. -charset Photoshop=CHARSET"  -charset or -L PDF XML declaration <?xml version="1.0" encoding="UTF-8"?> Mention special character codes https://dev.w3.org/html5/html-author/charref
  7. http://www.alanwood.net/unicode/utilities_editors.html
  8. http://coq.no/character-tables/en