SlideShare a Scribd company logo
1 of 35
ENCODING
NIGHTMARES And how to avoid them
PHILADELPHIA SOFTWARE
LOCALIZATION MEETUP
 Welcome to our kickoff event!
 For more information, visit the meetup site at:
 https://www.meetup.com/Philadelphia-
Software-Localization-Meetup/
PLAN OF TALK
 Encoding Nightmares
 Character Encoding and the Modern Tower of
Babel
 Rise of Unicode
 Rules of Thumb to Avoid Nightmares
 Tricks of the Trade
 Discussion
TAIWANESE WEBSITE FAIL
DZONGKHA (BHUTANESE) AS
WINDOWS-1252
CORRUPTED DOCUMENT, DATA
LOSS
ENCODING NIGHTMARES CAN LEAD
TO …
 Confusion
 Missed deadlines
 Software Bugs
 Data corruption
 Embarrassment
CHARACTER ENCODING
AND THE MODERN TOWER
OF BABEL
BINARY LANGUAGE
 The Bit, Two States (0, 1)
 Represented by switches “on” (1) or
“off” (0) (Yes, No)
 Grouped Together, Represent More
States
 n bits = 2n States
 8 bits = 1 byte = 256 states
BINARY CHARACTER ENCODING
 ASCII Character Encoding
 Associate Binary string with
English, letters, numbers, etc.
 How Many Needed?
 Used 127 distinct binary
numbers, each mapped to a
member of the ASCII character
set
 Defined in the ASCII “Code
Page”
EUROPEAN LANGUAGES NEED
MORE SPACE
 German, French, other
languages needed more
than 128 characters
 Started to use the 8th
bit (doubles the
possibilities)
 256 spaces in these 8
bit character maps
CHINESE, JAPANESE, KOREAN (CJK)
NEED EVEN MORE
 In Chinese, 2,000 distinct characters
is often considered a minimum
threshold for literacy. 40,000
characters are in common use and tens
of thousands more in rare, historical
literature.
 Japanese uses 2,000 characters,
mixing their own phonetic scripts
comprising the phonetic and
ideographic characters borrowed from
the Chinese
 Modern Korean tends toward more
phonetic language and relies much less
on the broader set of Chinese
characters
DOUBLE BYTE CHARACTER
ENCODINGS
 Two Bytes, 16 Bits
 216 = 65,536 possible
characters
 some bits used as signals, so
can’t actually store 65,000 total
https://r12a.github.io/scripts/tutorial/part2 / (Creative
Commons license)
NUMBER OF ENCODINGS
EXPLODE
ISO 646, ASCII, EBCDIC, CP37, CP930, CP1047, ISO 8859-1,
ISO 8859-2, ISO 8859-3, ISO 8859-4, ISO 8859-5, ISO 8859-
6, ISO 8859-7, ISO 8859-8, ISO 8859-9, ISO 8859-10 , ISO
8859-11 , ISO 8859-12 , ISO 8859-13 , ISO 8859-14 , ISO
8859-15 ,CP437, CP720, CP737, CP850, CP852, CP855,
CP857, CP858, CP860, CP861, CP862, CP863, CP865, CP866,
CP869, CP872, Windows-1250, Windows-1251, Windows-
1252, Windows-1253 , Windows-1254 , Windows-1255 ,
Windows-1256, Windows-1257, Windows-1258, Mac OS
Roman, Shift JIS, GB 2312, GBK, Taiwan Big5, KOI8-R, KOI8-U,
KOI7 ….
1980S: THE COMPUTING TOWER
OF BABEL
Same binary sequence
represents entirely different
characters
 Sharing documents across
borders becomes very difficult
 Unintelligible Files (common
experience during early days of
web)
 Hard to create a document
containing multiple languages.
 Double-byte encodings
increase likelihood of and add
THE NIGHTMARE
If you open and
save a file with the
wrong character
encoding, you can
change it
permanently.
Important data
may then be
irretrievable.
RISE OF UNICODE
WHAT IS UNICODE?
 Global, unified “solution” to
character encoding tower of babel
 One big encoding table for all
world’s characters
 All linguistic symbols have a
unique, defined “code point”
 Capacity for 1 million characters
UNICODE CONSORTIUM
 Non-profit corporation with
global members from industry,
government, academia, and
other NGOs
 Approve new characters for
registration as official Unicode
 Works closely with W3C and
ISO
MORE ON UNICODE
 Abstract characters, not
glyphs
 Broken Into Planes (each
with 65,536 characters):
 Basic Multilingual Plane +
16 other planes
 Room for more than 1
million individual characters
NOT a specific binary
encoding of that number
(UTF-8 differs from UTF-
16)
VERSION 9.0 (JUNE 21 2016)
 Adds exactly 7,500 characters, for
a total of 128,172 characters:
 Osage, a Native American language
 Nepal Bhasa, a language of Nepal
 Fulani and other African languages
 Tangut, a major historic script of China
 72 emoji characters, such as new
smilies and people, animals and nature,
and food and drink
STILL NOT UBIQUITOUS!
 Pre-Unicode encodings very much still in use.
 Legacy operating systems
 Popular applications
 MS Office Products
 And even within Unicode, nightmares still
possible (UTF-?)
4 RULES OF THUMB
LIMIT YOUR APPLICATIONS
 Every app in chain
has potential to
corrupt.
Make sure nobody
opens the file “just to
take a look.”
USE UTF-8
 For websites and
mobile apps, almost
always the right
choice
 If resource uses
different encoding,
use ICONV or similar
tool to convert
KNOW YOUR METADATA
<head>
<meta http-equiv="Content-
Type" content="text/html;
charset=UTF-8">
</head>
<head>
<meta charset="UTF-8">
</head>
KNOW THE DIFFERENCE BETWEEN
CHARACTERS
AND GLYPHS
 technically, Unicode encodes characters, not
glyphs or fonts
 characters can be thought of as the base shape
while glyphs and fonts are particular
appearances of those characters, including
combination of “root characters which appear as
one symbol, like the é
 this distinction can be important when you are
diagnosing a character display problem; but the
boundary can be fuzzy . Ä, for example is
actually a complete character with unique code
point, but is can also be stored as two code
points, which combine the base character A with
the umlaut in combination
 you may have correct encoding, but the
particular font you are using to display the
characters may not have the appropriate glyphs
to display the encoded character.
TRICKS OF THE TRADE
CHECK AND CONVERT ENCODING
 Some text editors and stand alone utilities (like
ICONV) guess and convert the encoding
 Libraries available (Mozilla Universal Charset
Detector, International Components for Unicode)
 Can often guess correctly, but they are imperfect
 Some tools allow you to check large sets of files
in batches
UTF-8 WITH BOM?
 BOM = Byte Order Mark
 Essentially a signal to receiver of message
that the string is Unicode
 Can be appended to binary strings by
otherwise “neutral” apps like Windows Notepad
 Can trip up various programming languages
and introduce garbage (PHP, for example)
 Could show up in text editor (if
misinterpreted) as series of characters to right
 Use editor (such as Sublime Text) or
encoding converter to convert to straight UTF-
8

SPREADSHEET TIP
 Careful with CSV and Excel
 Excel often mangles CSV
encoding
 Use Google Docs (or MAC) to
save CSV as Excel and then
convert back to CSV
TOOLS
 Will post to our
discussion page at
the Meetup site.
 Add your own!
DISCUSSION
 Questions?
 Tips?
 Horror Stories?
THANK YOU!
Merci – Gracias – Danke
Grazie – Obrigado
‫شكرا‬
谢谢
당신을 감사하십시오
ありがとう
www.mtmlinguasoft.com

More Related Content

Viewers also liked

Encoding and Decoding
Encoding and DecodingEncoding and Decoding
Encoding and Decodingmrhaken
 
China Maize Processing Machine
China Maize Processing MachineChina Maize Processing Machine
China Maize Processing MachinePenny Hou
 
Pharma franchise company in chandigarh /India
Pharma franchise company in chandigarh /IndiaPharma franchise company in chandigarh /India
Pharma franchise company in chandigarh /IndiaPankaj Goyal
 
Change Day, looking back, looking forward
Change Day, looking back, looking forwardChange Day, looking back, looking forward
Change Day, looking back, looking forwardNHS Horizons
 
Reading in the future2
Reading in the future2Reading in the future2
Reading in the future2Mohammed Awad
 
Sobreexposición personal en la red libro completo
Sobreexposición personal en la red   libro completoSobreexposición personal en la red   libro completo
Sobreexposición personal en la red libro completoLeonel Erlichman
 
TSM: управление временными интервалами поставок
TSM: управление временными интервалами поставокTSM: управление временными интервалами поставок
TSM: управление временными интервалами поставокInna Kotykova
 
LinkedIn: Position Yourself as an Expert & Attract Opportunities
LinkedIn: Position Yourself as an Expert & Attract OpportunitiesLinkedIn: Position Yourself as an Expert & Attract Opportunities
LinkedIn: Position Yourself as an Expert & Attract OpportunitiesChristine Hueber
 

Viewers also liked (13)

Modul
ModulModul
Modul
 
Data encoding
Data encodingData encoding
Data encoding
 
Encoding and Decoding
Encoding and DecodingEncoding and Decoding
Encoding and Decoding
 
China Maize Processing Machine
China Maize Processing MachineChina Maize Processing Machine
China Maize Processing Machine
 
Pharma franchise company in chandigarh /India
Pharma franchise company in chandigarh /IndiaPharma franchise company in chandigarh /India
Pharma franchise company in chandigarh /India
 
UHPP 2016 Annual Report
UHPP 2016 Annual ReportUHPP 2016 Annual Report
UHPP 2016 Annual Report
 
Change Day, looking back, looking forward
Change Day, looking back, looking forwardChange Day, looking back, looking forward
Change Day, looking back, looking forward
 
Reading in the future2
Reading in the future2Reading in the future2
Reading in the future2
 
Atrc dcs xtuple_presentation_10_april_2013-1
Atrc dcs xtuple_presentation_10_april_2013-1Atrc dcs xtuple_presentation_10_april_2013-1
Atrc dcs xtuple_presentation_10_april_2013-1
 
Afroditta
AfrodittaAfroditta
Afroditta
 
Sobreexposición personal en la red libro completo
Sobreexposición personal en la red   libro completoSobreexposición personal en la red   libro completo
Sobreexposición personal en la red libro completo
 
TSM: управление временными интервалами поставок
TSM: управление временными интервалами поставокTSM: управление временными интервалами поставок
TSM: управление временными интервалами поставок
 
LinkedIn: Position Yourself as an Expert & Attract Opportunities
LinkedIn: Position Yourself as an Expert & Attract OpportunitiesLinkedIn: Position Yourself as an Expert & Attract Opportunities
LinkedIn: Position Yourself as an Expert & Attract Opportunities
 

Similar to Avoid Encoding Nightmares with Unicode and UTF-8

How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...agileware
 
Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsUnicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsRay Paseur
 
Encodings - Ruby 1.8 and Ruby 1.9
Encodings - Ruby 1.8 and Ruby 1.9Encodings - Ruby 1.8 and Ruby 1.9
Encodings - Ruby 1.8 and Ruby 1.9Dimelo R&D Team
 
Internationalisation And Globalisation
Internationalisation And GlobalisationInternationalisation And Globalisation
Internationalisation And GlobalisationAlan Dean
 
Introduction to W3C I18N Best Practices
Introduction to W3C I18N Best PracticesIntroduction to W3C I18N Best Practices
Introduction to W3C I18N Best PracticesGopal Venkatesan
 
Xml For Dummies Chapter 6 Adding Character(S) To Xml
Xml For Dummies   Chapter 6 Adding Character(S) To XmlXml For Dummies   Chapter 6 Adding Character(S) To Xml
Xml For Dummies Chapter 6 Adding Character(S) To Xmlphanleson
 
Character sets and iconv
Character sets and iconvCharacter sets and iconv
Character sets and iconvDaniel_Rhodes
 
Unicode and character sets
Unicode and character setsUnicode and character sets
Unicode and character setsrenchenyu
 
Character Encoding issue with PHP
Character Encoding issue with PHPCharacter Encoding issue with PHP
Character Encoding issue with PHPRavi Raj
 
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - ITguest6ddfb98
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingBert Pattyn
 
Unicode and Collations in MySQL 8.0
Unicode and Collations in MySQL 8.0Unicode and Collations in MySQL 8.0
Unicode and Collations in MySQL 8.0Bernt Marius Johnsen
 
Data encryption and tokenization for international unicode
Data encryption and tokenization for international unicodeData encryption and tokenization for international unicode
Data encryption and tokenization for international unicodeUlf Mattsson
 
[CocoaHeads Tricity] Do not reinvent the wheel
[CocoaHeads Tricity] Do not reinvent the wheel[CocoaHeads Tricity] Do not reinvent the wheel
[CocoaHeads Tricity] Do not reinvent the wheelMateusz Klimczak
 
MySQL 8.0 & Unicode: Why, what & how
MySQL 8.0 & Unicode: Why, what & howMySQL 8.0 & Unicode: Why, what & how
MySQL 8.0 & Unicode: Why, what & howBernt Marius Johnsen
 
Character encoding standard(1)
Character encoding standard(1)Character encoding standard(1)
Character encoding standard(1)Pramila Selvaraj
 

Similar to Avoid Encoding Nightmares with Unicode and UTF-8 (20)

Unicode 101
Unicode 101Unicode 101
Unicode 101
 
How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...
 
Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsUnicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set Collisions
 
Encodings - Ruby 1.8 and Ruby 1.9
Encodings - Ruby 1.8 and Ruby 1.9Encodings - Ruby 1.8 and Ruby 1.9
Encodings - Ruby 1.8 and Ruby 1.9
 
multilanguage.pdf
multilanguage.pdfmultilanguage.pdf
multilanguage.pdf
 
Internationalisation And Globalisation
Internationalisation And GlobalisationInternationalisation And Globalisation
Internationalisation And Globalisation
 
Uncdtalk
UncdtalkUncdtalk
Uncdtalk
 
Introduction to W3C I18N Best Practices
Introduction to W3C I18N Best PracticesIntroduction to W3C I18N Best Practices
Introduction to W3C I18N Best Practices
 
Xml For Dummies Chapter 6 Adding Character(S) To Xml
Xml For Dummies   Chapter 6 Adding Character(S) To XmlXml For Dummies   Chapter 6 Adding Character(S) To Xml
Xml For Dummies Chapter 6 Adding Character(S) To Xml
 
Character sets and iconv
Character sets and iconvCharacter sets and iconv
Character sets and iconv
 
Unicode Primer for the Uninitiated
Unicode Primer for the UninitiatedUnicode Primer for the Uninitiated
Unicode Primer for the Uninitiated
 
Unicode and character sets
Unicode and character setsUnicode and character sets
Unicode and character sets
 
Character Encoding issue with PHP
Character Encoding issue with PHPCharacter Encoding issue with PHP
Character Encoding issue with PHP
 
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - IT
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character Encoding
 
Unicode and Collations in MySQL 8.0
Unicode and Collations in MySQL 8.0Unicode and Collations in MySQL 8.0
Unicode and Collations in MySQL 8.0
 
Data encryption and tokenization for international unicode
Data encryption and tokenization for international unicodeData encryption and tokenization for international unicode
Data encryption and tokenization for international unicode
 
[CocoaHeads Tricity] Do not reinvent the wheel
[CocoaHeads Tricity] Do not reinvent the wheel[CocoaHeads Tricity] Do not reinvent the wheel
[CocoaHeads Tricity] Do not reinvent the wheel
 
MySQL 8.0 & Unicode: Why, what & how
MySQL 8.0 & Unicode: Why, what & howMySQL 8.0 & Unicode: Why, what & how
MySQL 8.0 & Unicode: Why, what & how
 
Character encoding standard(1)
Character encoding standard(1)Character encoding standard(1)
Character encoding standard(1)
 

Recently uploaded

SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 

Recently uploaded (20)

SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 

Avoid Encoding Nightmares with Unicode and UTF-8

  • 2. PHILADELPHIA SOFTWARE LOCALIZATION MEETUP  Welcome to our kickoff event!  For more information, visit the meetup site at:  https://www.meetup.com/Philadelphia- Software-Localization-Meetup/
  • 3. PLAN OF TALK  Encoding Nightmares  Character Encoding and the Modern Tower of Babel  Rise of Unicode  Rules of Thumb to Avoid Nightmares  Tricks of the Trade  Discussion
  • 6.
  • 8. ENCODING NIGHTMARES CAN LEAD TO …  Confusion  Missed deadlines  Software Bugs  Data corruption  Embarrassment
  • 9. CHARACTER ENCODING AND THE MODERN TOWER OF BABEL
  • 10. BINARY LANGUAGE  The Bit, Two States (0, 1)  Represented by switches “on” (1) or “off” (0) (Yes, No)  Grouped Together, Represent More States  n bits = 2n States  8 bits = 1 byte = 256 states
  • 11. BINARY CHARACTER ENCODING  ASCII Character Encoding  Associate Binary string with English, letters, numbers, etc.  How Many Needed?  Used 127 distinct binary numbers, each mapped to a member of the ASCII character set  Defined in the ASCII “Code Page”
  • 12. EUROPEAN LANGUAGES NEED MORE SPACE  German, French, other languages needed more than 128 characters  Started to use the 8th bit (doubles the possibilities)  256 spaces in these 8 bit character maps
  • 13. CHINESE, JAPANESE, KOREAN (CJK) NEED EVEN MORE  In Chinese, 2,000 distinct characters is often considered a minimum threshold for literacy. 40,000 characters are in common use and tens of thousands more in rare, historical literature.  Japanese uses 2,000 characters, mixing their own phonetic scripts comprising the phonetic and ideographic characters borrowed from the Chinese  Modern Korean tends toward more phonetic language and relies much less on the broader set of Chinese characters
  • 14. DOUBLE BYTE CHARACTER ENCODINGS  Two Bytes, 16 Bits  216 = 65,536 possible characters  some bits used as signals, so can’t actually store 65,000 total https://r12a.github.io/scripts/tutorial/part2 / (Creative Commons license)
  • 15. NUMBER OF ENCODINGS EXPLODE ISO 646, ASCII, EBCDIC, CP37, CP930, CP1047, ISO 8859-1, ISO 8859-2, ISO 8859-3, ISO 8859-4, ISO 8859-5, ISO 8859- 6, ISO 8859-7, ISO 8859-8, ISO 8859-9, ISO 8859-10 , ISO 8859-11 , ISO 8859-12 , ISO 8859-13 , ISO 8859-14 , ISO 8859-15 ,CP437, CP720, CP737, CP850, CP852, CP855, CP857, CP858, CP860, CP861, CP862, CP863, CP865, CP866, CP869, CP872, Windows-1250, Windows-1251, Windows- 1252, Windows-1253 , Windows-1254 , Windows-1255 , Windows-1256, Windows-1257, Windows-1258, Mac OS Roman, Shift JIS, GB 2312, GBK, Taiwan Big5, KOI8-R, KOI8-U, KOI7 ….
  • 16. 1980S: THE COMPUTING TOWER OF BABEL Same binary sequence represents entirely different characters  Sharing documents across borders becomes very difficult  Unintelligible Files (common experience during early days of web)  Hard to create a document containing multiple languages.  Double-byte encodings increase likelihood of and add
  • 17. THE NIGHTMARE If you open and save a file with the wrong character encoding, you can change it permanently. Important data may then be irretrievable.
  • 19. WHAT IS UNICODE?  Global, unified “solution” to character encoding tower of babel  One big encoding table for all world’s characters  All linguistic symbols have a unique, defined “code point”  Capacity for 1 million characters
  • 20. UNICODE CONSORTIUM  Non-profit corporation with global members from industry, government, academia, and other NGOs  Approve new characters for registration as official Unicode  Works closely with W3C and ISO
  • 21. MORE ON UNICODE  Abstract characters, not glyphs  Broken Into Planes (each with 65,536 characters):  Basic Multilingual Plane + 16 other planes  Room for more than 1 million individual characters NOT a specific binary encoding of that number (UTF-8 differs from UTF- 16)
  • 22. VERSION 9.0 (JUNE 21 2016)  Adds exactly 7,500 characters, for a total of 128,172 characters:  Osage, a Native American language  Nepal Bhasa, a language of Nepal  Fulani and other African languages  Tangut, a major historic script of China  72 emoji characters, such as new smilies and people, animals and nature, and food and drink
  • 23. STILL NOT UBIQUITOUS!  Pre-Unicode encodings very much still in use.  Legacy operating systems  Popular applications  MS Office Products  And even within Unicode, nightmares still possible (UTF-?)
  • 24. 4 RULES OF THUMB
  • 25. LIMIT YOUR APPLICATIONS  Every app in chain has potential to corrupt. Make sure nobody opens the file “just to take a look.”
  • 26. USE UTF-8  For websites and mobile apps, almost always the right choice  If resource uses different encoding, use ICONV or similar tool to convert
  • 27. KNOW YOUR METADATA <head> <meta http-equiv="Content- Type" content="text/html; charset=UTF-8"> </head> <head> <meta charset="UTF-8"> </head>
  • 28. KNOW THE DIFFERENCE BETWEEN CHARACTERS AND GLYPHS  technically, Unicode encodes characters, not glyphs or fonts  characters can be thought of as the base shape while glyphs and fonts are particular appearances of those characters, including combination of “root characters which appear as one symbol, like the é  this distinction can be important when you are diagnosing a character display problem; but the boundary can be fuzzy . Ä, for example is actually a complete character with unique code point, but is can also be stored as two code points, which combine the base character A with the umlaut in combination  you may have correct encoding, but the particular font you are using to display the characters may not have the appropriate glyphs to display the encoded character.
  • 29. TRICKS OF THE TRADE
  • 30. CHECK AND CONVERT ENCODING  Some text editors and stand alone utilities (like ICONV) guess and convert the encoding  Libraries available (Mozilla Universal Charset Detector, International Components for Unicode)  Can often guess correctly, but they are imperfect  Some tools allow you to check large sets of files in batches
  • 31. UTF-8 WITH BOM?  BOM = Byte Order Mark  Essentially a signal to receiver of message that the string is Unicode  Can be appended to binary strings by otherwise “neutral” apps like Windows Notepad  Can trip up various programming languages and introduce garbage (PHP, for example)  Could show up in text editor (if misinterpreted) as series of characters to right  Use editor (such as Sublime Text) or encoding converter to convert to straight UTF- 8 
  • 32. SPREADSHEET TIP  Careful with CSV and Excel  Excel often mangles CSV encoding  Use Google Docs (or MAC) to save CSV as Excel and then convert back to CSV
  • 33. TOOLS  Will post to our discussion page at the Meetup site.  Add your own!
  • 35. THANK YOU! Merci – Gracias – Danke Grazie – Obrigado ‫شكرا‬ 谢谢 당신을 감사하십시오 ありがとう www.mtmlinguasoft.com

Editor's Notes

  1. http://chinesehacks.com/resources/software/change-the-character-encoding-for-a-website/
  2. http://www.dzongkha.gov.bt/IT/ie-intsr.en.php (Dzongkha text displays as meaningless Latin characters)
  3. https://community.spiceworks.com/topic/1565360-email-character-encoding-problem
  4. https://creativecommons.org/licenses/by/4.0/ https://r12a.github.io/scripts/tutorial/part2 /
  5. 2016 June 21 Unicode 9.0 adds exactly 7,500 characters, for a total of 128,172 characters. These additions include six new scripts and 72 new emoji characters. The new scripts and characters in Version 9.0 add support for lesser-used languages worldwide, including: Osage, a Native American language Nepal Bhasa, a language of Nepal Fulani and other African languages The Bravanese dialect of Swahili, used in Somalia The Warsh orthography for Arabic, used in North and West Africa Tangut, a major historic script of China Important symbol additions include: 19 symbols for the new 4K TV standard 72 emoji characters, such as new smilies and people, animals and nature, and food and drink
  6. -charset Photoshop=CHARSET"  -charset or -L PDF XML declaration <?xml version="1.0" encoding="UTF-8"?> Mention special character codes https://dev.w3.org/html5/html-author/charref
  7. http://www.alanwood.net/unicode/utilities_editors.html
  8. http://coq.no/character-tables/en