SlideShare a Scribd company logo
Understanding Character
Encodings
Basics of Character Encodings that all Programmers should Know.
Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
Introduction
• I am Pritam Barhate. I am cofounder and CTO of Mobisoft Infotech, which is a
iPhone & iPad Application Development Company.
• Mobisoft also has expertise in creating scalable cloud based API backends
for iOS and Android applications.
• For more information about various services provided by our company,
please check our website.
Definitions
• Character
A character is a sign or a symbol in a writing system. In computing a
character can be, a letter, a digit, a punctuation or mathematical symbol or a
control character (these are not visible, for example, carriage return).
• Script
A script is a collection of letters and other written signs used to represent
textual information in one or more writing systems. For example, Latin is a
script which supports multiple languages like English, French and German.
The Need for Character Sets
• Computers only understand binary data. To represents the characters as required
by human languages, the concept of character sets was introduced.
• In character sets each character in a human language is represented by a number.
• In early computing English was the only language used. To represent, the
characters used in English, ASCII character set was used.
• ASCII used 7 bits to represent 128 characters which included: numbers 0 to 9,
lowercase letters a to z, uppercase letters A to Z, basic punctuation symbols,
control codes that originated with Teletype machines, and a space.
• For example, in ASCII, character O is represented by decimal number 79. Same
can be written in binary format as: 1001111 or in hexadecimal (hex) format as: 4F.
Evolution!
• Naturally ASCII was insufficient to represent characters in all human
languages. This led to creation to multiple character sets as computing
evolved and became more widespread throughout the world. However, as
computing evolved, multiple companies created different computing platforms
which had support for different character sets.
• Various attempts were made by different platform providers to extend the
currently supported character sets while maintaining backward compatibility.
This created a lot of confusion when data created by program using X
encoding was to be used in a program which uses Y encoding as its default.
• Finally Unicode emerged as a common standard to support majority of written
scripts by people throughout the world.
Some Commonly Used Character Sets
• ASCII
• Unicode
• Windows 1252
• ISO 8859
So if I use Unicode I am OK, Right?
UTF-8, UTF-16 and UTF-32
• Since there are thousands of characters in Unicode, just one byte is not
sufficient to represent every character.
• Hence multiple bytes are required to represent Unicode characters.
• Though 32 bits or 4 bytes are sufficient to represent every character in
Unicode, most of the data is in English and Western European languages. So
using uniform 32 bits everywhere is waste of a lot of space.
• Hence, UTF-8 and UTF-16 are more prevalent. These encodings use
variable number of bytes to represent each character as needed.
• UTF-8 is the most prevalent encoding used on web.
• All modern Windows OSes and Java SE 5.0 onwards use UTF-16.
How do Encodings Affect Programming?
• When data is transferred between various computing systems, you need to
carefully check in which encoding the data is coming to your system and in
which encoding you are providing the data to other systems.
• An error in this generally results part or all of data being interpreted as
garbage.
Tips to Handle Encoding Properly in your Programs - I
• When saving your web pages, which are read by browsers, make sure that
you are using the correct encoding while saving your file. Most of the text
editors have the encoding setting buried deep somewhere in
preferences/options. Take out the time to learn how your favorite text editor /
IDE handles file encodings.
• In static HTML files, as well as dynamically generated HTML always use
proper charset attribute value for meta tag. Make sure that you are indeed
outputting the data in the character set mentioned by you. For example:
<head><meta charset="UTF-8"></head>
Tips to Handle Encoding Properly in your Programs - II
• While outputting data for dynamic web pages and APIs make sure that right
Content-Type header is being sent to the client with the response.
For example: Content-Type: text/html; charset=utf-8
• While creating the database in MySQL or PostgreSQL, make sure that you
specify the correct character encoding for your tables.
• For new programs, which use MySQL, when you create the database for
‘Default Collation’ setting choose: utf8 - utf8_unicode_ci.
• For new systems which use PostgreSQL, when you create the database
choose UTF-8 as encoding and en_US.UTF-8 as collation.
Tips to Handle Encoding Properly in your Programs - III
• While importing data from text based files, such as CSV, make sure that you
detect the encoding of the file and then read the data using that encoding.
Then while saving the data in your system, make sure that you have
converted to the character set that your system uses. If you are using a library
to do the CSV reading and writing for you, take out the time to understand
how the library handles character encodings. Choose a library only after
making sure that it has a robust support for various character encodings.
• Take out time to understand how the programming language / libraries, you
are using, handle character encoding while parsing files and parsing API
output such as JSON / XML.
Useful PHP Functions to Handle Character Encodings
• Most the PHP functions to read files (for example, file_get_contents)
return data as string.
• Unfortunately in PHP strings are just byte arrays. They don’t store any
encoding information. You can use following functions to detect the encoding
of a string.
mb_detect_encoding — Detect character encoding
mb_convert_encoding — Convert character encoding
• However, the implementations of these functions have been criticized online.
So use them with caution.
How to Handle Character Encoding in Java
• In Java, you can use InputStreamReader.getEncoding()to detect encoding.
• The juniversalchardet library which is a Java port of universalchardet library
from Mozilla can also be used to detect character encodings in Java.
• Once you have a byte array of data and you have detected the character
encoding for it, you can use the following String constructor to create a string
with proper encoding: public String(byte[] bytes, Charset charset)
• To save a file with desired character encoding, you need to pass the proper
character set to the OutputStreamWriter class. Here is an example of how to
do it.
How to Handle Character Encoding in Objective-C
• You use the UniversalDetector library to detect character encodings. This
library is a wrapper for uchardet, which is based on C++ implementation of
the universal charset detection library by Mozilla.
• Once you have detected the encoding of a byte buffer, you can pass that
encoding to NSString constructor to create the correct string with data. For
example: -(instancetype)initWithData:(NSData *)data
encoding:(NSStringEncoding)encoding
• To write a string to a file with desired encoding: - (BOOL)writeToFile:(NSString
*)path atomically:(BOOL)useAuxiliaryFile encoding:(NSStringEncoding)enc
error:(NSError * _Nullable *)error
How to Handle Character Encoding in C#
• To detect character encodings in C#, you can use the chardetsharp library,
which is a port of Mozilla’s CharDet Character Set Detector.
• Once you have detected the character set of the byte array, you can use
something similar to this example to obtain a correct C# string object from the
data.
• To write a C# string to a file with a desired encoding, you can follow the
examples provided here.
Thank You!

More Related Content

What's hot

Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
Mohammed Rafi
 
1 cs xii_python_file_handling text n binary file
1 cs xii_python_file_handling text n binary file1 cs xii_python_file_handling text n binary file
1 cs xii_python_file_handling text n binary file
SanjayKumarMahto1
 
Programming fundamentals 3
Programming fundamentals 3Programming fundamentals 3
Programming fundamentals 3
Adeem Mirza
 
Unit1 pps
Unit1 ppsUnit1 pps
Unit1 pps
deeparengade31
 
Summer Training Project On Python Programming
Summer Training Project On Python ProgrammingSummer Training Project On Python Programming
Summer Training Project On Python Programming
KAUSHAL KUMAR JHA
 
Immutable data type in python
Immutable data type in pythonImmutable data type in python
Immutable data type in python
Learnbay Datascience
 
4 character encoding
4 character encoding4 character encoding
4 character encodingirdginfo
 
Unicode - Hacking The International Character System
Unicode - Hacking The International Character SystemUnicode - Hacking The International Character System
Unicode - Hacking The International Character System
Websecurify
 
Python indroduction
Python indroductionPython indroduction
Python indroduction
FEG
 
Notes of programming
Notes of programmingNotes of programming
Notes of programming
Bilal Maqbool ツ
 
Extracting text from PDF (iOS)
Extracting text from PDF (iOS)Extracting text from PDF (iOS)
Extracting text from PDF (iOS)
Kaz Yoshikawa
 
Translation with technology
Translation with technologyTranslation with technology
Translation with technology
Ana Lucia Amaral
 
Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...
IndicThreads
 
Programming with Python: Week 1
Programming with Python: Week 1Programming with Python: Week 1
Programming with Python: Week 1
Ahmet Bulut
 
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
Andrei Zmievski
 
Programming with \'C\'
Programming with \'C\'Programming with \'C\'
Programming with \'C\'
bdmsts
 
PDT DC015 Chapter 2 Computer System 2017/2018 (e)
PDT DC015 Chapter 2 Computer System 2017/2018 (e)PDT DC015 Chapter 2 Computer System 2017/2018 (e)
PDT DC015 Chapter 2 Computer System 2017/2018 (e)
Fizaril Amzari Omar
 

What's hot (20)

Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
1 cs xii_python_file_handling text n binary file
1 cs xii_python_file_handling text n binary file1 cs xii_python_file_handling text n binary file
1 cs xii_python_file_handling text n binary file
 
Programming fundamentals 3
Programming fundamentals 3Programming fundamentals 3
Programming fundamentals 3
 
Unit1 pps
Unit1 ppsUnit1 pps
Unit1 pps
 
Summer Training Project On Python Programming
Summer Training Project On Python ProgrammingSummer Training Project On Python Programming
Summer Training Project On Python Programming
 
Immutable data type in python
Immutable data type in pythonImmutable data type in python
Immutable data type in python
 
4 character encoding
4 character encoding4 character encoding
4 character encoding
 
Unicode - Hacking The International Character System
Unicode - Hacking The International Character SystemUnicode - Hacking The International Character System
Unicode - Hacking The International Character System
 
File handling
File handlingFile handling
File handling
 
Python indroduction
Python indroductionPython indroduction
Python indroduction
 
Notes of programming
Notes of programmingNotes of programming
Notes of programming
 
Extracting text from PDF (iOS)
Extracting text from PDF (iOS)Extracting text from PDF (iOS)
Extracting text from PDF (iOS)
 
Translation with technology
Translation with technologyTranslation with technology
Translation with technology
 
Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...
 
Programming with Python: Week 1
Programming with Python: Week 1Programming with Python: Week 1
Programming with Python: Week 1
 
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
 
Programming with \'C\'
Programming with \'C\'Programming with \'C\'
Programming with \'C\'
 
Strings and encodings
Strings and encodingsStrings and encodings
Strings and encodings
 
PDT DC015 Chapter 2 Computer System 2017/2018 (e)
PDT DC015 Chapter 2 Computer System 2017/2018 (e)PDT DC015 Chapter 2 Computer System 2017/2018 (e)
PDT DC015 Chapter 2 Computer System 2017/2018 (e)
 
Unicode
UnicodeUnicode
Unicode
 

Similar to Understanding Character Encodings

Introduction to Python Programming
Introduction to Python ProgrammingIntroduction to Python Programming
Introduction to Python Programming
Akhil Kaushik
 
Programming Languages #devcon2013
Programming Languages #devcon2013Programming Languages #devcon2013
Programming Languages #devcon2013Iván Montes
 
Python programming language introduction unit
Python programming language introduction unitPython programming language introduction unit
Python programming language introduction unit
michaelaaron25322
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesMilind Patil
 
introduction to computer vision and image processing
introduction to computer vision and image processingintroduction to computer vision and image processing
introduction to computer vision and image processing
pakboy12
 
Lecture 1 introduction to language processors
Lecture 1  introduction to language processorsLecture 1  introduction to language processors
Lecture 1 introduction to language processors
Rebaz Najeeb
 
Compiler Design Basics
Compiler Design BasicsCompiler Design Basics
Compiler Design Basics
Akhil Kaushik
 
Chapter-introduction about java programming
Chapter-introduction about java programmingChapter-introduction about java programming
Chapter-introduction about java programming
DrRajeshkumarPPatel
 
CHAPTER 5 mechanical engineeringasaaa.pptx
CHAPTER 5 mechanical engineeringasaaa.pptxCHAPTER 5 mechanical engineeringasaaa.pptx
CHAPTER 5 mechanical engineeringasaaa.pptx
SadhilAggarwal
 
Compiler Design Basics
Compiler Design BasicsCompiler Design Basics
Compiler Design Basics
Akhil Kaushik
 
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
gagravarr
 
Software Internationalization Crash Course
Software Internationalization Crash CourseSoftware Internationalization Crash Course
Software Internationalization Crash Course
Will Iverson
 
Python_Introduction_Good_PPT.pptx
Python_Introduction_Good_PPT.pptxPython_Introduction_Good_PPT.pptx
Python_Introduction_Good_PPT.pptx
lemonchoos
 
Compiler Construction
Compiler ConstructionCompiler Construction
Compiler Construction
Ahmed Raza
 
Python Programming 1.pptx
Python Programming 1.pptxPython Programming 1.pptx
Python Programming 1.pptx
Francis Densil Raj
 
X-CS-8.0 Programming in C Language 2022-2023.pdf
X-CS-8.0 Programming in C Language 2022-2023.pdfX-CS-8.0 Programming in C Language 2022-2023.pdf
X-CS-8.0 Programming in C Language 2022-2023.pdf
Alefya1
 
Class_X_PYTHON_J.pdf
Class_X_PYTHON_J.pdfClass_X_PYTHON_J.pdf
Class_X_PYTHON_J.pdf
SanjeedaPraween
 
Compilers.pptx
Compilers.pptxCompilers.pptx
Compilers.pptx
MohammedMohammed578197
 
Introduction to compilers
Introduction to compilersIntroduction to compilers
Introduction to compilers
Bilal Maqbool ツ
 

Similar to Understanding Character Encodings (20)

Introduction to Python Programming
Introduction to Python ProgrammingIntroduction to Python Programming
Introduction to Python Programming
 
Programming Languages #devcon2013
Programming Languages #devcon2013Programming Languages #devcon2013
Programming Languages #devcon2013
 
Python programming language introduction unit
Python programming language introduction unitPython programming language introduction unit
Python programming language introduction unit
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfiles
 
introduction to computer vision and image processing
introduction to computer vision and image processingintroduction to computer vision and image processing
introduction to computer vision and image processing
 
Lecture 1 introduction to language processors
Lecture 1  introduction to language processorsLecture 1  introduction to language processors
Lecture 1 introduction to language processors
 
Compiler Design Basics
Compiler Design BasicsCompiler Design Basics
Compiler Design Basics
 
Chapter-introduction about java programming
Chapter-introduction about java programmingChapter-introduction about java programming
Chapter-introduction about java programming
 
CHAPTER 5 mechanical engineeringasaaa.pptx
CHAPTER 5 mechanical engineeringasaaa.pptxCHAPTER 5 mechanical engineeringasaaa.pptx
CHAPTER 5 mechanical engineeringasaaa.pptx
 
Compiler Design Basics
Compiler Design BasicsCompiler Design Basics
Compiler Design Basics
 
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
 
Software Internationalization Crash Course
Software Internationalization Crash CourseSoftware Internationalization Crash Course
Software Internationalization Crash Course
 
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
 
Python_Introduction_Good_PPT.pptx
Python_Introduction_Good_PPT.pptxPython_Introduction_Good_PPT.pptx
Python_Introduction_Good_PPT.pptx
 
Compiler Construction
Compiler ConstructionCompiler Construction
Compiler Construction
 
Python Programming 1.pptx
Python Programming 1.pptxPython Programming 1.pptx
Python Programming 1.pptx
 
X-CS-8.0 Programming in C Language 2022-2023.pdf
X-CS-8.0 Programming in C Language 2022-2023.pdfX-CS-8.0 Programming in C Language 2022-2023.pdf
X-CS-8.0 Programming in C Language 2022-2023.pdf
 
Class_X_PYTHON_J.pdf
Class_X_PYTHON_J.pdfClass_X_PYTHON_J.pdf
Class_X_PYTHON_J.pdf
 
Compilers.pptx
Compilers.pptxCompilers.pptx
Compilers.pptx
 
Introduction to compilers
Introduction to compilersIntroduction to compilers
Introduction to compilers
 

Recently uploaded

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 

Recently uploaded (20)

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 

Understanding Character Encodings

  • 1. Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
  • 2. Introduction • I am Pritam Barhate. I am cofounder and CTO of Mobisoft Infotech, which is a iPhone & iPad Application Development Company. • Mobisoft also has expertise in creating scalable cloud based API backends for iOS and Android applications. • For more information about various services provided by our company, please check our website.
  • 3. Definitions • Character A character is a sign or a symbol in a writing system. In computing a character can be, a letter, a digit, a punctuation or mathematical symbol or a control character (these are not visible, for example, carriage return). • Script A script is a collection of letters and other written signs used to represent textual information in one or more writing systems. For example, Latin is a script which supports multiple languages like English, French and German.
  • 4. The Need for Character Sets • Computers only understand binary data. To represents the characters as required by human languages, the concept of character sets was introduced. • In character sets each character in a human language is represented by a number. • In early computing English was the only language used. To represent, the characters used in English, ASCII character set was used. • ASCII used 7 bits to represent 128 characters which included: numbers 0 to 9, lowercase letters a to z, uppercase letters A to Z, basic punctuation symbols, control codes that originated with Teletype machines, and a space. • For example, in ASCII, character O is represented by decimal number 79. Same can be written in binary format as: 1001111 or in hexadecimal (hex) format as: 4F.
  • 5. Evolution! • Naturally ASCII was insufficient to represent characters in all human languages. This led to creation to multiple character sets as computing evolved and became more widespread throughout the world. However, as computing evolved, multiple companies created different computing platforms which had support for different character sets. • Various attempts were made by different platform providers to extend the currently supported character sets while maintaining backward compatibility. This created a lot of confusion when data created by program using X encoding was to be used in a program which uses Y encoding as its default. • Finally Unicode emerged as a common standard to support majority of written scripts by people throughout the world.
  • 6. Some Commonly Used Character Sets • ASCII • Unicode • Windows 1252 • ISO 8859
  • 7. So if I use Unicode I am OK, Right?
  • 8. UTF-8, UTF-16 and UTF-32 • Since there are thousands of characters in Unicode, just one byte is not sufficient to represent every character. • Hence multiple bytes are required to represent Unicode characters. • Though 32 bits or 4 bytes are sufficient to represent every character in Unicode, most of the data is in English and Western European languages. So using uniform 32 bits everywhere is waste of a lot of space. • Hence, UTF-8 and UTF-16 are more prevalent. These encodings use variable number of bytes to represent each character as needed. • UTF-8 is the most prevalent encoding used on web. • All modern Windows OSes and Java SE 5.0 onwards use UTF-16.
  • 9. How do Encodings Affect Programming? • When data is transferred between various computing systems, you need to carefully check in which encoding the data is coming to your system and in which encoding you are providing the data to other systems. • An error in this generally results part or all of data being interpreted as garbage.
  • 10. Tips to Handle Encoding Properly in your Programs - I • When saving your web pages, which are read by browsers, make sure that you are using the correct encoding while saving your file. Most of the text editors have the encoding setting buried deep somewhere in preferences/options. Take out the time to learn how your favorite text editor / IDE handles file encodings. • In static HTML files, as well as dynamically generated HTML always use proper charset attribute value for meta tag. Make sure that you are indeed outputting the data in the character set mentioned by you. For example: <head><meta charset="UTF-8"></head>
  • 11. Tips to Handle Encoding Properly in your Programs - II • While outputting data for dynamic web pages and APIs make sure that right Content-Type header is being sent to the client with the response. For example: Content-Type: text/html; charset=utf-8 • While creating the database in MySQL or PostgreSQL, make sure that you specify the correct character encoding for your tables. • For new programs, which use MySQL, when you create the database for ‘Default Collation’ setting choose: utf8 - utf8_unicode_ci. • For new systems which use PostgreSQL, when you create the database choose UTF-8 as encoding and en_US.UTF-8 as collation.
  • 12. Tips to Handle Encoding Properly in your Programs - III • While importing data from text based files, such as CSV, make sure that you detect the encoding of the file and then read the data using that encoding. Then while saving the data in your system, make sure that you have converted to the character set that your system uses. If you are using a library to do the CSV reading and writing for you, take out the time to understand how the library handles character encodings. Choose a library only after making sure that it has a robust support for various character encodings. • Take out time to understand how the programming language / libraries, you are using, handle character encoding while parsing files and parsing API output such as JSON / XML.
  • 13. Useful PHP Functions to Handle Character Encodings • Most the PHP functions to read files (for example, file_get_contents) return data as string. • Unfortunately in PHP strings are just byte arrays. They don’t store any encoding information. You can use following functions to detect the encoding of a string. mb_detect_encoding — Detect character encoding mb_convert_encoding — Convert character encoding • However, the implementations of these functions have been criticized online. So use them with caution.
  • 14. How to Handle Character Encoding in Java • In Java, you can use InputStreamReader.getEncoding()to detect encoding. • The juniversalchardet library which is a Java port of universalchardet library from Mozilla can also be used to detect character encodings in Java. • Once you have a byte array of data and you have detected the character encoding for it, you can use the following String constructor to create a string with proper encoding: public String(byte[] bytes, Charset charset) • To save a file with desired character encoding, you need to pass the proper character set to the OutputStreamWriter class. Here is an example of how to do it.
  • 15. How to Handle Character Encoding in Objective-C • You use the UniversalDetector library to detect character encodings. This library is a wrapper for uchardet, which is based on C++ implementation of the universal charset detection library by Mozilla. • Once you have detected the encoding of a byte buffer, you can pass that encoding to NSString constructor to create the correct string with data. For example: -(instancetype)initWithData:(NSData *)data encoding:(NSStringEncoding)encoding • To write a string to a file with desired encoding: - (BOOL)writeToFile:(NSString *)path atomically:(BOOL)useAuxiliaryFile encoding:(NSStringEncoding)enc error:(NSError * _Nullable *)error
  • 16. How to Handle Character Encoding in C# • To detect character encodings in C#, you can use the chardetsharp library, which is a port of Mozilla’s CharDet Character Set Detector. • Once you have detected the character set of the byte array, you can use something similar to this example to obtain a correct C# string object from the data. • To write a C# string to a file with a desired encoding, you can follow the examples provided here.

Editor's Notes

  1. TODO: Need to find out CSV libraries with robust support for character encodings for popular languages. Include that list in this presentation.
  2. The mycong page itself has issues while showing the data correctly. Need to create appropriate samples for these.