SlideShare a Scribd company logo
Unicode
Encoding Forms
Which one do I choose? 🤔
Why are there so many? 🤔
What happened to the text? 🤔
Let’s start from the beginning
Once upon a time, when life was simpler
(for Americans), we had ASCII
ASCII
• Designed for teleprinters in the 1960s

• 7 bit (0000000 to 1111111 in binary)

• 128 codes (0 to 127 in decimal)
0 1 0 0 0 0 0 1
0 1 0 0 0 0 1 0
0 1 0 0 0 0 1 1
A
B
C
(65 in decimal)
(66 in decimal)
(67 in decimal)
Why waste this 1 bit?
0 1 0 0 0 0 0 1
0 1 0 0 0 0 1 0
0 1 0 0 0 0 1 1
A
B
C
(65 in decimal)
(66 in decimal)
(67 in decimal)
Meanwhile, 8-bit byte were becoming common
(aka,128 more spaces for characters)
🤔
Everyone had the same idea-
Let’s Extend ASCII
And use all those 256 characters
Let there be… chaos! 😈
Code Pages
• There are too many (more than 220 DOS and Windows
code pages alone)

• IBM, Apple all introduced their own “code pages” 🤛 🤜

• Side-note: ANSI character set has no well-defined
meaning, they are a collection of 8-bit character sets
compatible with ASCII but incompatible with each other
Problems
• Most are compatible with ASCII but incompatible with
each other

• Programs need to know what code page to use in order
to display the contents correctly

• Files created on one machine may be unreadable on
another

• Even 256 characters are not enough %
Internet happened!
And also
🔥
Fax machines were not enough anymore.
Unicode
🕴
to the rescue
Unicode
• Finally everyone agreed on what code point mapped to
what character 🎉

• There’s room for over 1 million code points (characters),
though the majority of common languages fit into the first
65536 code points
How do we serialize multi-
byte characters?

💾
“Character Encoding”
UTF-32
• Simplest encoding

• Unicode supports 1,114,112 code points. We can store
them using 21 bits

• UTF-32 is 32-bit in length. Fixed length encoding

• So we take a 21-bit value and simply zero-pad the value
out to 32 bits
UTF-32
• Example

A in ASCII 01000001

A in UTF-32 00000000 00000000 00000000 01000001

or, 01000001 00000000 00000000 00000000

• So, not compatible with ASCII

• Super wasteful 

• But faster text operations (e.g character count)

• Messes where “null terminated strings” (00000000) are
expected
UTF-32
• Also, now we have to deal with “Endianness” 🤔



00000000 00000000 00000000 01000001

vs,

01000001 00000000 00000000 00000000
Endianness
• Big endian machine: Stores data big-end first. When looking at
multiple bytes, the first byte is the biggest



increasing addresses ———->

00000000 00000000 00000000 01000001

• Little endian machine: Stores data little-end first. When looking at
multiple bytes, the first byte is smallest



increasing addresses ———-> 

01000001 00000000 00000000 00000000

• Endianness does not matter if you have a single byte, because
how we read a single byte is same in all machines 😌
UTF-16
• The oldest encoding for Unicode. Often mislabeled as
"Unicode encoding”

• Variable length encoding. 2 bytes for most common
characters (BMP), 4 bytes for everything else

• The most common characters (BMP) in Unicode fits into first
65,536 code points, so it’s straightforward. Throw away top 5
zeros from 21 bit, you get UTF-16.

A 00000 00000000 01000001 (21 bit) becomes-

A 00000000 01000001 (16 bit)

• Uses “Surrogate pairs” for other characters
UTF-16
• Multi-byte encoding, so has Endianness like UTF-32

• Incompatible with ASCII

A in UTF-16 00000000 01000001

A in ASCII 01000001

• Incompatible with old systems that rely on null (null byte:
00000000) terminated strings

• Uses less space than UTF-32 in practice

• Windows API, .NET and Java environments are founded on
UTF-16, often called “wide character string”
UTF-8
• Nice & simple. This is the kid that everyone loves *

• Backward compatible with ASCII 🙌

• 0 to 127 code points are stored as regular, single-byte
ASCII.

A in ASCII 01000001

A in UTF-8 01000001
* UTF-8 Everywhere Manifesto: https://utf8everywhere.org/
UTF-8
• Code points 128 and above are converted to binary and
stored (encoded) in a series of bytes



A 2-byte example looks like this



110xxxxx 10xxxxxx









In contrast, single byte ASCII characters (<128 decimal
code points) look like 0xxxxxxx
Count byte

Starts with 

11..0

Data byte

Starts with

10
UTF-8
• A 2-byte example looks like this

110xxxxx 10xxxxxx

(Count Byte) (Data Byte)

• The first count byte indicates the number of bytes for the code-point,
including the count byte. These bytes start with 11..0:

110xxxxx (The leading “11” is indicates 2 bytes in sequence, including the
“count” byte)



1110xxxx (1110 -> 3 bytes in sequence)



11110xxx (11110 -> 4 bytes in sequence)

• After count bytes, data bytes starting with 10… and contain information
for the code point

UTF-8
• Example: 

অ 'BENGALI LETTER A' (U+0985)

Binary form: 00001001 10000101



0000 100110 000101

UTF-8 form: 11100000 10100110 10000101

• Variable length encoding. Uses more space than UTF-16
for text with mostly Asian characters (2 bytes vs 3 bytes),
less space than UTF-16 for text with mostly ASCII
characters (1 byte vs 2 bytes)
UTF-8
• No null bytes. Old programs work just fine that treats null
bytes (00000000) as end of string

• We read and write a single byte at a time, so no worry of
Endianness. This is very convenient

• UTF-8 is the encoding you should use if you work on web
How does Notepad know which encoding was used when it opens a file?
BOM

(Byte Order Mask)
BOM Encoding
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8
How do browsers know which encoding was used when it opens a page?
HTTP Content-Type Header
Meta tag
Otherwise, browsers try to “guess” if these information are not present
Practical Considerations
• If you are working on web, use UTF-8

• If your operation is mostly with GUI and calling windows APIs with Unicode
string, use UTF-16

• UTF-16 takes least space than UTF-8 and UTF-16 if most characters are
Asian

• UTF-8 takes least space if most characters are Latin

• If memory is cheap and you need fastest operation, random access to
characters etc, use UTF-32

• If dealing with Endianness & BOM is a problem, then use UTF-8

• When in doubt, use UTF-8 😛
https://avro.im/utf.pdf

More Related Content

What's hot

Cyber security and demonstration of security tools
Cyber security and demonstration of security toolsCyber security and demonstration of security tools
Cyber security and demonstration of security tools
Vicky Fernandes
 
Backdoor
BackdoorBackdoor
Backdoor
phanleson
 
Lesson 1 - Introducing, Installing, and Upgrading Windows 7
Lesson 1 - Introducing, Installing, and Upgrading Windows 7Lesson 1 - Introducing, Installing, and Upgrading Windows 7
Lesson 1 - Introducing, Installing, and Upgrading Windows 7Gene Carboni
 
IP Address
IP AddressIP Address
IP Address
Rahul P
 
Web technologies lesson 1
Web technologies   lesson 1Web technologies   lesson 1
Web technologies lesson 1nhepner
 
Introduction to Software Security and Best Practices
Introduction to Software Security and Best PracticesIntroduction to Software Security and Best Practices
Introduction to Software Security and Best Practices
Maxime ALAY-EDDINE
 
Installing windows server 2016 TP 4
Installing windows server 2016 TP 4Installing windows server 2016 TP 4
Installing windows server 2016 TP 4
Ayman Sheta
 
Ethical hacking
Ethical hackingEthical hacking
Ethical hacking
ANKITA VISHWAKARMA
 
Introduction to Web Technology
Introduction to Web TechnologyIntroduction to Web Technology
Introduction to Web Technology
Aashish Jain
 
Computer Malware
Computer MalwareComputer Malware
Computer Malwareaztechtchr
 
Ip addressing
Ip addressingIp addressing
Ip addressing
Tanvir Amin
 
Computer forensics 1
Computer forensics 1Computer forensics 1
Computer forensics 1Jinalkakadiya
 
Architecture of Linux
 Architecture of Linux Architecture of Linux
Architecture of Linux
SHUBHA CHATURVEDI
 
Character sets and alphabets
Character sets and alphabetsCharacter sets and alphabets
Character sets and alphabets
RazinaShamim
 
Ip address presentation
Ip address presentationIp address presentation
Ip address presentation
muhammad amir
 
Cyber forensics question bank
Cyber forensics   question bankCyber forensics   question bank
Cyber forensics question bank
ArthyR3
 
IP Configuration
IP ConfigurationIP Configuration
IP ConfigurationStephen Raj
 
Linux distributions
Linux    distributionsLinux    distributions
Linux distributions
RJ Mehul Gadhiya
 
iP Address ,
 iP Address , iP Address ,
iP Address ,
Er Bhagat Sharma
 

What's hot (20)

Cyber security and demonstration of security tools
Cyber security and demonstration of security toolsCyber security and demonstration of security tools
Cyber security and demonstration of security tools
 
Backdoor
BackdoorBackdoor
Backdoor
 
Lesson 1 - Introducing, Installing, and Upgrading Windows 7
Lesson 1 - Introducing, Installing, and Upgrading Windows 7Lesson 1 - Introducing, Installing, and Upgrading Windows 7
Lesson 1 - Introducing, Installing, and Upgrading Windows 7
 
IP Address
IP AddressIP Address
IP Address
 
Web technologies lesson 1
Web technologies   lesson 1Web technologies   lesson 1
Web technologies lesson 1
 
Introduction to Software Security and Best Practices
Introduction to Software Security and Best PracticesIntroduction to Software Security and Best Practices
Introduction to Software Security and Best Practices
 
Installing windows server 2016 TP 4
Installing windows server 2016 TP 4Installing windows server 2016 TP 4
Installing windows server 2016 TP 4
 
Ethical hacking
Ethical hackingEthical hacking
Ethical hacking
 
Introduction to Web Technology
Introduction to Web TechnologyIntroduction to Web Technology
Introduction to Web Technology
 
Computer Malware
Computer MalwareComputer Malware
Computer Malware
 
Ip addressing
Ip addressingIp addressing
Ip addressing
 
Computer forensics 1
Computer forensics 1Computer forensics 1
Computer forensics 1
 
Architecture of Linux
 Architecture of Linux Architecture of Linux
Architecture of Linux
 
Character sets and alphabets
Character sets and alphabetsCharacter sets and alphabets
Character sets and alphabets
 
Ip address presentation
Ip address presentationIp address presentation
Ip address presentation
 
Cyber forensics question bank
Cyber forensics   question bankCyber forensics   question bank
Cyber forensics question bank
 
Notes on a Standard: Unicode
Notes on a Standard: UnicodeNotes on a Standard: Unicode
Notes on a Standard: Unicode
 
IP Configuration
IP ConfigurationIP Configuration
IP Configuration
 
Linux distributions
Linux    distributionsLinux    distributions
Linux distributions
 
iP Address ,
 iP Address , iP Address ,
iP Address ,
 

Similar to Unicode Encoding Forms

expect("").length.toBe(1)
expect("").length.toBe(1)expect("").length.toBe(1)
expect("").length.toBe(1)
Philip Hofstetter
 
What character is that
What character is thatWhat character is that
What character is that
Anders Karlsson
 
004 NUMBER SYSTEM (1).pdf
004 NUMBER SYSTEM (1).pdf004 NUMBER SYSTEM (1).pdf
004 NUMBER SYSTEM (1).pdf
MaheShiva
 
4 character encoding-unicode
4 character encoding-unicode4 character encoding-unicode
4 character encoding-unicodeirdginfo
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesMilind Patil
 
Lesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squencesLesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squences
Lexume1
 
Unicode Fundamentals
Unicode Fundamentals Unicode Fundamentals
Unicode Fundamentals
SamiHsDU
 
Topic 1 Data Representation
Topic 1 Data RepresentationTopic 1 Data Representation
Topic 1 Data Representationekul
 
Topic 1 Data Representation
Topic 1 Data RepresentationTopic 1 Data Representation
Topic 1 Data Representation
Kyle
 
Unicode - Hacking The International Character System
Unicode - Hacking The International Character SystemUnicode - Hacking The International Character System
Unicode - Hacking The International Character System
Websecurify
 
SD & D Representing Text
SD & D Representing TextSD & D Representing Text
SD & D Representing Text
Forrester High School
 
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
Elizabeth Smith
 
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - IT
guest6ddfb98
 
Topic 2.3 (1)
Topic 2.3 (1)Topic 2.3 (1)
Topic 2.3 (1)
nabilbesttravel
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingBert Pattyn
 
How Unidecoder Transliterates UTF-8 to ASCII
How Unidecoder Transliterates UTF-8 to ASCIIHow Unidecoder Transliterates UTF-8 to ASCII
How Unidecoder Transliterates UTF-8 to ASCII
Simon Courtois
 
1.1.1 BINARY SYSTEM
1.1.1 BINARY SYSTEM1.1.1 BINARY SYSTEM
1.1.1 BINARY SYSTEM
Buxoo Abdullah
 

Similar to Unicode Encoding Forms (20)

expect("").length.toBe(1)
expect("").length.toBe(1)expect("").length.toBe(1)
expect("").length.toBe(1)
 
What character is that
What character is thatWhat character is that
What character is that
 
004 NUMBER SYSTEM (1).pdf
004 NUMBER SYSTEM (1).pdf004 NUMBER SYSTEM (1).pdf
004 NUMBER SYSTEM (1).pdf
 
4 character encoding-unicode
4 character encoding-unicode4 character encoding-unicode
4 character encoding-unicode
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfiles
 
Lesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squencesLesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squences
 
Unicode Fundamentals
Unicode Fundamentals Unicode Fundamentals
Unicode Fundamentals
 
Topic 1 Data Representation
Topic 1 Data RepresentationTopic 1 Data Representation
Topic 1 Data Representation
 
Topic 1 Data Representation
Topic 1 Data RepresentationTopic 1 Data Representation
Topic 1 Data Representation
 
Unicode - Hacking The International Character System
Unicode - Hacking The International Character SystemUnicode - Hacking The International Character System
Unicode - Hacking The International Character System
 
SD & D Representing Text
SD & D Representing TextSD & D Representing Text
SD & D Representing Text
 
Unicode basics in python
Unicode basics in pythonUnicode basics in python
Unicode basics in python
 
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
 
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - IT
 
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
 
Unicode
UnicodeUnicode
Unicode
 
Topic 2.3 (1)
Topic 2.3 (1)Topic 2.3 (1)
Topic 2.3 (1)
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character Encoding
 
How Unidecoder Transliterates UTF-8 to ASCII
How Unidecoder Transliterates UTF-8 to ASCIIHow Unidecoder Transliterates UTF-8 to ASCII
How Unidecoder Transliterates UTF-8 to ASCII
 
1.1.1 BINARY SYSTEM
1.1.1 BINARY SYSTEM1.1.1 BINARY SYSTEM
1.1.1 BINARY SYSTEM
 

Recently uploaded

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 

Recently uploaded (20)

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 

Unicode Encoding Forms

  • 2. Which one do I choose? 🤔
  • 3. Why are there so many? 🤔
  • 4. What happened to the text? 🤔
  • 5. Let’s start from the beginning Once upon a time, when life was simpler (for Americans), we had ASCII
  • 6. ASCII • Designed for teleprinters in the 1960s • 7 bit (0000000 to 1111111 in binary) • 128 codes (0 to 127 in decimal) 0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 A B C (65 in decimal) (66 in decimal) (67 in decimal)
  • 7. Why waste this 1 bit? 0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 A B C (65 in decimal) (66 in decimal) (67 in decimal) Meanwhile, 8-bit byte were becoming common (aka,128 more spaces for characters) 🤔
  • 8. Everyone had the same idea- Let’s Extend ASCII And use all those 256 characters
  • 9. Let there be… chaos! 😈
  • 10. Code Pages • There are too many (more than 220 DOS and Windows code pages alone) • IBM, Apple all introduced their own “code pages” 🤛 🤜 • Side-note: ANSI character set has no well-defined meaning, they are a collection of 8-bit character sets compatible with ASCII but incompatible with each other
  • 11. Problems • Most are compatible with ASCII but incompatible with each other • Programs need to know what code page to use in order to display the contents correctly • Files created on one machine may be unreadable on another • Even 256 characters are not enough %
  • 12. Internet happened! And also 🔥 Fax machines were not enough anymore.
  • 14. Unicode • Finally everyone agreed on what code point mapped to what character 🎉
 • There’s room for over 1 million code points (characters), though the majority of common languages fit into the first 65536 code points
  • 15. How do we serialize multi- byte characters?
 💾
  • 17. UTF-32 • Simplest encoding • Unicode supports 1,114,112 code points. We can store them using 21 bits • UTF-32 is 32-bit in length. Fixed length encoding • So we take a 21-bit value and simply zero-pad the value out to 32 bits
  • 18. UTF-32 • Example
 A in ASCII 01000001
 A in UTF-32 00000000 00000000 00000000 01000001
 or, 01000001 00000000 00000000 00000000 • So, not compatible with ASCII • Super wasteful • But faster text operations (e.g character count) • Messes where “null terminated strings” (00000000) are expected
  • 19. UTF-32 • Also, now we have to deal with “Endianness” 🤔
 
 00000000 00000000 00000000 01000001
 vs,
 01000001 00000000 00000000 00000000
  • 20. Endianness • Big endian machine: Stores data big-end first. When looking at multiple bytes, the first byte is the biggest
 
 increasing addresses ———->
 00000000 00000000 00000000 01000001 • Little endian machine: Stores data little-end first. When looking at multiple bytes, the first byte is smallest
 
 increasing addresses ———-> 
 01000001 00000000 00000000 00000000 • Endianness does not matter if you have a single byte, because how we read a single byte is same in all machines 😌
  • 21. UTF-16 • The oldest encoding for Unicode. Often mislabeled as "Unicode encoding” • Variable length encoding. 2 bytes for most common characters (BMP), 4 bytes for everything else • The most common characters (BMP) in Unicode fits into first 65,536 code points, so it’s straightforward. Throw away top 5 zeros from 21 bit, you get UTF-16.
 A 00000 00000000 01000001 (21 bit) becomes-
 A 00000000 01000001 (16 bit) • Uses “Surrogate pairs” for other characters
  • 22. UTF-16 • Multi-byte encoding, so has Endianness like UTF-32 • Incompatible with ASCII
 A in UTF-16 00000000 01000001
 A in ASCII 01000001 • Incompatible with old systems that rely on null (null byte: 00000000) terminated strings • Uses less space than UTF-32 in practice • Windows API, .NET and Java environments are founded on UTF-16, often called “wide character string”
  • 23. UTF-8 • Nice & simple. This is the kid that everyone loves * • Backward compatible with ASCII 🙌 • 0 to 127 code points are stored as regular, single-byte ASCII.
 A in ASCII 01000001
 A in UTF-8 01000001 * UTF-8 Everywhere Manifesto: https://utf8everywhere.org/
  • 24. UTF-8 • Code points 128 and above are converted to binary and stored (encoded) in a series of bytes
 
 A 2-byte example looks like this
 
 110xxxxx 10xxxxxx
 
 
 
 
 In contrast, single byte ASCII characters (<128 decimal code points) look like 0xxxxxxx Count byte
 Starts with 
 11..0
 Data byte
 Starts with
 10
  • 25. UTF-8 • A 2-byte example looks like this
 110xxxxx 10xxxxxx
 (Count Byte) (Data Byte) • The first count byte indicates the number of bytes for the code-point, including the count byte. These bytes start with 11..0:
 110xxxxx (The leading “11” is indicates 2 bytes in sequence, including the “count” byte)
 
 1110xxxx (1110 -> 3 bytes in sequence)
 
 11110xxx (11110 -> 4 bytes in sequence) • After count bytes, data bytes starting with 10… and contain information for the code point

  • 26. UTF-8 • Example: 
 অ 'BENGALI LETTER A' (U+0985)
 Binary form: 00001001 10000101
 
 0000 100110 000101
 UTF-8 form: 11100000 10100110 10000101
 • Variable length encoding. Uses more space than UTF-16 for text with mostly Asian characters (2 bytes vs 3 bytes), less space than UTF-16 for text with mostly ASCII characters (1 byte vs 2 bytes)
  • 27. UTF-8 • No null bytes. Old programs work just fine that treats null bytes (00000000) as end of string • We read and write a single byte at a time, so no worry of Endianness. This is very convenient • UTF-8 is the encoding you should use if you work on web
  • 28. How does Notepad know which encoding was used when it opens a file?
  • 29. BOM
 (Byte Order Mask) BOM Encoding 00 00 FE FF UTF-32, big-endian FF FE 00 00 UTF-32, little-endian FE FF UTF-16, big-endian FF FE UTF-16, little-endian EF BB BF UTF-8
  • 30. How do browsers know which encoding was used when it opens a page?
  • 32. Meta tag Otherwise, browsers try to “guess” if these information are not present
  • 33. Practical Considerations • If you are working on web, use UTF-8 • If your operation is mostly with GUI and calling windows APIs with Unicode string, use UTF-16 • UTF-16 takes least space than UTF-8 and UTF-16 if most characters are Asian • UTF-8 takes least space if most characters are Latin • If memory is cheap and you need fastest operation, random access to characters etc, use UTF-32 • If dealing with Endianness & BOM is a problem, then use UTF-8 • When in doubt, use UTF-8 😛