SlideShare a Scribd company logo
1 of 19
Download to read offline
Understand Unicode &
UTF8 in Perl
avoid common issues and gain guru status.
         (You too can be John)
Characters and Glyphs
A character: 'é'

Combination of 2 glyphs:

e (LATIN SMALL LETTER E)

Followed by:

´ (ACUTE ACCENT)
Characters and Glyphs
A character: 'é'

Or a combined glyph:

é (LATIN SMALL LETTER E WITH ACUTE)
So what is Unicode (in this
context)?
A collection of glyphs (mainly) called
Codepoints with a unique number and a set of
properties.
Example: E ( U+0045 )
  Name         LATIN CAPITAL
               LETTER E


  Block        Basic Latin


  Category     Letter, Uppercase [Lu]


  Combine      0


  BIDI         BIDI


  Lower case   U+0065
What is a String?
An ordered collection of glyphs i.e. an ordered
collection of Unicode point.
In Perl:

my $s = "he";
or
my $s = "N{U+0068}N{U+0065}";
What is a String ? - The glyph Pitfall
An ordered collection of glyphs. There's more
that one way to write it.
In Perl:
my $s = "é"
is
my $s = "N{U+00E9}"; OR..
my $s = "N{U+0065}N{U+00B4}";

In practice, software prefer the first way (pffui),
but not always. See Unicode::Normalize
How does Perl represent Strings?
Short answer: It's not your business.

Long answer: It depends :(

Only "latin1 characters" -> Latin1. Anything
outside that -> UTF-8.

Feeling fiddly, bug fixing? use utf8::* function.
Bedtime read: perldoc perlunicode
Not my business? So what's this
fuss about UTF-8 encoding?
How strings are represented internally is not
your business.
How they are transmitted from/to the outside
world is.
The outside world doesn't understand 'Strings'.
It understands 'bytes'.

An encoding is a bijection:
Unicode Points (glyphs) <-> bytes
UTF-8 encoding
Unicode Points (glyphs) <-> bytes

Variable number of bytes per unicode point.
Examples:

a <-> x{61} ,
☭ <-> x{E2}x{98}x{AD} (gdrive FAIL)

Sometimes, the bytes begin with a BOM.
The encoding law
Never transfer Strings. Always transfer Bytes.

But inside Perl: You want to work with Strings
as much as possible.

Sending: Encode as LATE as possible.

Receiving: Decode as EARLY as possible.
Common outside worlds: STDOUT
Latin1 encoding by default :(
-> You can only output 'Latin1 compliant
Strings'. And your shell should expect Latin1.

In the modern world:
# Set STDOUT to encode as UTF8
binmode STDOUT , ':utf8';
Common outside worlds: A text file
if you know the file encoding:
 open(my $fh, "<:encoding(UTF-8)",
"filename");

if you don't know.
Maybe you can count on the BOM byte.

But you don't want that. You want to know for
sure -> set a convention.
Common outside worlds: XML file
Encoding specified in the preamble:
<?xml version="1.0" encoding="utf-8"?>
If not specified -> utf8 is assumed.

Feed your XML parser with BYTES.
Write XML files in binary mode.

XML::LibXML:: Calls bytes 'Strings'.. People
are confused. Trust no one.
Common outside worlds: WWW
From a given page, browsers send parameters
in the encoding of the page.

Correctly encode your binary responses.

Decode $c->params()

In Catalyst:
Catalyst::Plugin::Unicode::Encoding
Common outside worlds: Your own
Every time you communicate with a system,
you will send/receive bytes. Never strings.

Think about encoding/decoding your strings
to/from bytes, according to what your system
expects/provides.

Sometime, it's done automagically through
some library options.
Bug avoiding guidelines.
Test everything with Unicode characters.

English keyboard? chartables.de, unicode
lorem ipsum.

Unit test => "N{U+262D}"

Never i/o strings. Never. i/o is about bytes.
Choose encodings explicitly.
Bonus: Escaping
What if you want to represent your nice shiny
UTF8 bytes as part of something else?

You need to escape them!

Example in URI, escaping parameters:
(URI::Escape):

http://foo.com/?q=%E2%98%AD
Bonus: Escaping for email headers
Encode AND Escape for Email subjects
(Encode with MIME-Q):

Encode::encode('MIME-Q', "aN{U+262D}c");
=?UTF-8?Q?a=E2=98=ADb?=

It encodes and escapes at the same time.
Beware of confusion.

Keep string for as long as you can.
Conclusion
Make sure you make a difference Strings and
Bytes. In Perl, it must come from discipline.

Make sure you always encode/decode on i/o as
explicitly as possible. Don't let confused others
confuse you.

Always wonder: What does this thing operates
on. Bytes or Strings? In doubt, investigate.

More Related Content

What's hot

Python Interview questions 2020
Python Interview questions 2020Python Interview questions 2020
Python Interview questions 2020VigneshVijay21
 
Python introduction towards data science
Python introduction towards data sciencePython introduction towards data science
Python introduction towards data sciencedeepak teja
 
Learn Python The Hard Way Presentation
Learn Python The Hard Way PresentationLearn Python The Hard Way Presentation
Learn Python The Hard Way PresentationAmira ElSharkawy
 
Unit 1 question and answer
Unit 1 question and answerUnit 1 question and answer
Unit 1 question and answerVasuki Ramasamy
 
Caring for file formats
Caring for file formatsCaring for file formats
Caring for file formatsAnge Albertini
 
Re Inventing Query Language
Re Inventing Query LanguageRe Inventing Query Language
Re Inventing Query LanguageRuslan Zakirov
 
Trusting files (and their formats)
Trusting files (and their formats)Trusting files (and their formats)
Trusting files (and their formats)Ange Albertini
 
Get started python programming part 1
Get started python programming   part 1Get started python programming   part 1
Get started python programming part 1Nicholas I
 
Python-01| Fundamentals
Python-01| FundamentalsPython-01| Fundamentals
Python-01| FundamentalsMohd Sajjad
 
Unsupervised Machine Learning for clone detection
Unsupervised Machine Learning for clone detectionUnsupervised Machine Learning for clone detection
Unsupervised Machine Learning for clone detectionValerio Maggio
 
Python programming msc(cs)
Python programming msc(cs)Python programming msc(cs)
Python programming msc(cs)KALAISELVI P
 
Fundamentals of Python Programming
Fundamentals of Python ProgrammingFundamentals of Python Programming
Fundamentals of Python ProgrammingKamal Acharya
 

What's hot (20)

Python Interview questions 2020
Python Interview questions 2020Python Interview questions 2020
Python Interview questions 2020
 
Python introduction towards data science
Python introduction towards data sciencePython introduction towards data science
Python introduction towards data science
 
Learn Python The Hard Way Presentation
Learn Python The Hard Way PresentationLearn Python The Hard Way Presentation
Learn Python The Hard Way Presentation
 
Unit 1 question and answer
Unit 1 question and answerUnit 1 question and answer
Unit 1 question and answer
 
Programming with Python
Programming with PythonProgramming with Python
Programming with Python
 
Caring for file formats
Caring for file formatsCaring for file formats
Caring for file formats
 
Java Datatypes
Java DatatypesJava Datatypes
Java Datatypes
 
Lecture 4
Lecture 4Lecture 4
Lecture 4
 
Python revision tour i
Python revision tour iPython revision tour i
Python revision tour i
 
What is Python?
What is Python?What is Python?
What is Python?
 
Python programming language
Python programming languagePython programming language
Python programming language
 
Re Inventing Query Language
Re Inventing Query LanguageRe Inventing Query Language
Re Inventing Query Language
 
Trusting files (and their formats)
Trusting files (and their formats)Trusting files (and their formats)
Trusting files (and their formats)
 
Get started python programming part 1
Get started python programming   part 1Get started python programming   part 1
Get started python programming part 1
 
Python-01| Fundamentals
Python-01| FundamentalsPython-01| Fundamentals
Python-01| Fundamentals
 
Unsupervised Machine Learning for clone detection
Unsupervised Machine Learning for clone detectionUnsupervised Machine Learning for clone detection
Unsupervised Machine Learning for clone detection
 
Csharp4 basics
Csharp4 basicsCsharp4 basics
Csharp4 basics
 
Python programming msc(cs)
Python programming msc(cs)Python programming msc(cs)
Python programming msc(cs)
 
Python
PythonPython
Python
 
Fundamentals of Python Programming
Fundamentals of Python ProgrammingFundamentals of Python Programming
Fundamentals of Python Programming
 

Viewers also liked

Unicode, character encodings in programming and standard persian keyboard layout
Unicode, character encodings in programming and standard persian keyboard layoutUnicode, character encodings in programming and standard persian keyboard layout
Unicode, character encodings in programming and standard persian keyboard layoutbijan_
 
PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)Jerome Eteve
 
Except UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in PythonExcept UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in PythonAram Dulyan
 
Mason - A Template system for us Perl programmers
Mason - A Template system for us Perl programmersMason - A Template system for us Perl programmers
Mason - A Template system for us Perl programmersJerome Eteve
 
Unicode Fundamentals
Unicode Fundamentals Unicode Fundamentals
Unicode Fundamentals SamiHsDU
 
Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)Project Student
 
Introduction to RDF
Introduction to RDFIntroduction to RDF
Introduction to RDFNarni Rajesh
 
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting PersonalKirsty Hulse
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldabaux singapore
 

Viewers also liked (15)

Unicode, character encodings in programming and standard persian keyboard layout
Unicode, character encodings in programming and standard persian keyboard layoutUnicode, character encodings in programming and standard persian keyboard layout
Unicode, character encodings in programming and standard persian keyboard layout
 
PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)
 
Except UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in PythonExcept UnicodeError: battling Unicode demons in Python
Except UnicodeError: battling Unicode demons in Python
 
Mason - A Template system for us Perl programmers
Mason - A Template system for us Perl programmersMason - A Template system for us Perl programmers
Mason - A Template system for us Perl programmers
 
Unicodeの闇
Unicodeの闇Unicodeの闇
Unicodeの闇
 
Unicode
UnicodeUnicode
Unicode
 
Unicode
UnicodeUnicode
Unicode
 
Unicode
UnicodeUnicode
Unicode
 
Unicode 101
Unicode 101Unicode 101
Unicode 101
 
Unicode Fundamentals
Unicode Fundamentals Unicode Fundamentals
Unicode Fundamentals
 
Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)
 
Introduction to RDF
Introduction to RDFIntroduction to RDF
Introduction to RDF
 
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting Personal
 
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job? Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
 

Similar to Understand unicode & utf8 in perl (2)

Unicode and character sets
Unicode and character setsUnicode and character sets
Unicode and character setsrenchenyu
 
Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsUnicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsRay Paseur
 
Introduction to W3C I18N Best Practices
Introduction to W3C I18N Best PracticesIntroduction to W3C I18N Best Practices
Introduction to W3C I18N Best PracticesGopal Venkatesan
 
Lecture 04 syntax analysis
Lecture 04 syntax analysisLecture 04 syntax analysis
Lecture 04 syntax analysisIffat Anjum
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Pythontswr
 
Common mistakes in C programming
Common mistakes in C programmingCommon mistakes in C programming
Common mistakes in C programmingLarion
 
Fundamental Unicode in Perl
Fundamental Unicode in PerlFundamental Unicode in Perl
Fundamental Unicode in PerlNova Patch
 
F# Eye for the C# Guy
F# Eye for the C# GuyF# Eye for the C# Guy
F# Eye for the C# Guygueste3f83d
 
Lesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squencesLesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squencesLexume1
 
Build a compiler using C#, Irony and RunSharp.
Build a compiler using C#, Irony and RunSharp.Build a compiler using C#, Irony and RunSharp.
Build a compiler using C#, Irony and RunSharp.James Curran
 

Similar to Understand unicode & utf8 in perl (2) (20)

Unicode 101
Unicode 101Unicode 101
Unicode 101
 
Unicode and character sets
Unicode and character setsUnicode and character sets
Unicode and character sets
 
Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsUnicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set Collisions
 
python-ch2.pptx
python-ch2.pptxpython-ch2.pptx
python-ch2.pptx
 
Introduction to W3C I18N Best Practices
Introduction to W3C I18N Best PracticesIntroduction to W3C I18N Best Practices
Introduction to W3C I18N Best Practices
 
Compiler
CompilerCompiler
Compiler
 
Lecture 04 syntax analysis
Lecture 04 syntax analysisLecture 04 syntax analysis
Lecture 04 syntax analysis
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
 
Common mistakes in C programming
Common mistakes in C programmingCommon mistakes in C programming
Common mistakes in C programming
 
Fundamental Unicode in Perl
Fundamental Unicode in PerlFundamental Unicode in Perl
Fundamental Unicode in Perl
 
C tutorial
C tutorialC tutorial
C tutorial
 
C tutorial
C tutorialC tutorial
C tutorial
 
C tutorial
C tutorialC tutorial
C tutorial
 
HackIM 2012 CTF Walkthrough
HackIM 2012 CTF WalkthroughHackIM 2012 CTF Walkthrough
HackIM 2012 CTF Walkthrough
 
Python Tutorial
Python TutorialPython Tutorial
Python Tutorial
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
 
F# Eye for the C# Guy
F# Eye for the C# GuyF# Eye for the C# Guy
F# Eye for the C# Guy
 
Pl ams 2015_unicode_dveeden
Pl ams 2015_unicode_dveedenPl ams 2015_unicode_dveeden
Pl ams 2015_unicode_dveeden
 
Lesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squencesLesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squences
 
Build a compiler using C#, Irony and RunSharp.
Build a compiler using C#, Irony and RunSharp.Build a compiler using C#, Irony and RunSharp.
Build a compiler using C#, Irony and RunSharp.
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

Understand unicode & utf8 in perl (2)

  • 1. Understand Unicode & UTF8 in Perl avoid common issues and gain guru status. (You too can be John)
  • 2. Characters and Glyphs A character: 'é' Combination of 2 glyphs: e (LATIN SMALL LETTER E) Followed by: ´ (ACUTE ACCENT)
  • 3. Characters and Glyphs A character: 'é' Or a combined glyph: é (LATIN SMALL LETTER E WITH ACUTE)
  • 4. So what is Unicode (in this context)? A collection of glyphs (mainly) called Codepoints with a unique number and a set of properties. Example: E ( U+0045 ) Name LATIN CAPITAL LETTER E Block Basic Latin Category Letter, Uppercase [Lu] Combine 0 BIDI BIDI Lower case U+0065
  • 5. What is a String? An ordered collection of glyphs i.e. an ordered collection of Unicode point. In Perl: my $s = "he"; or my $s = "N{U+0068}N{U+0065}";
  • 6. What is a String ? - The glyph Pitfall An ordered collection of glyphs. There's more that one way to write it. In Perl: my $s = "é" is my $s = "N{U+00E9}"; OR.. my $s = "N{U+0065}N{U+00B4}"; In practice, software prefer the first way (pffui), but not always. See Unicode::Normalize
  • 7. How does Perl represent Strings? Short answer: It's not your business. Long answer: It depends :( Only "latin1 characters" -> Latin1. Anything outside that -> UTF-8. Feeling fiddly, bug fixing? use utf8::* function. Bedtime read: perldoc perlunicode
  • 8. Not my business? So what's this fuss about UTF-8 encoding? How strings are represented internally is not your business. How they are transmitted from/to the outside world is. The outside world doesn't understand 'Strings'. It understands 'bytes'. An encoding is a bijection: Unicode Points (glyphs) <-> bytes
  • 9. UTF-8 encoding Unicode Points (glyphs) <-> bytes Variable number of bytes per unicode point. Examples: a <-> x{61} , ☭ <-> x{E2}x{98}x{AD} (gdrive FAIL) Sometimes, the bytes begin with a BOM.
  • 10. The encoding law Never transfer Strings. Always transfer Bytes. But inside Perl: You want to work with Strings as much as possible. Sending: Encode as LATE as possible. Receiving: Decode as EARLY as possible.
  • 11. Common outside worlds: STDOUT Latin1 encoding by default :( -> You can only output 'Latin1 compliant Strings'. And your shell should expect Latin1. In the modern world: # Set STDOUT to encode as UTF8 binmode STDOUT , ':utf8';
  • 12. Common outside worlds: A text file if you know the file encoding: open(my $fh, "<:encoding(UTF-8)", "filename"); if you don't know. Maybe you can count on the BOM byte. But you don't want that. You want to know for sure -> set a convention.
  • 13. Common outside worlds: XML file Encoding specified in the preamble: <?xml version="1.0" encoding="utf-8"?> If not specified -> utf8 is assumed. Feed your XML parser with BYTES. Write XML files in binary mode. XML::LibXML:: Calls bytes 'Strings'.. People are confused. Trust no one.
  • 14. Common outside worlds: WWW From a given page, browsers send parameters in the encoding of the page. Correctly encode your binary responses. Decode $c->params() In Catalyst: Catalyst::Plugin::Unicode::Encoding
  • 15. Common outside worlds: Your own Every time you communicate with a system, you will send/receive bytes. Never strings. Think about encoding/decoding your strings to/from bytes, according to what your system expects/provides. Sometime, it's done automagically through some library options.
  • 16. Bug avoiding guidelines. Test everything with Unicode characters. English keyboard? chartables.de, unicode lorem ipsum. Unit test => "N{U+262D}" Never i/o strings. Never. i/o is about bytes. Choose encodings explicitly.
  • 17. Bonus: Escaping What if you want to represent your nice shiny UTF8 bytes as part of something else? You need to escape them! Example in URI, escaping parameters: (URI::Escape): http://foo.com/?q=%E2%98%AD
  • 18. Bonus: Escaping for email headers Encode AND Escape for Email subjects (Encode with MIME-Q): Encode::encode('MIME-Q', "aN{U+262D}c"); =?UTF-8?Q?a=E2=98=ADb?= It encodes and escapes at the same time. Beware of confusion. Keep string for as long as you can.
  • 19. Conclusion Make sure you make a difference Strings and Bytes. In Perl, it must come from discipline. Make sure you always encode/decode on i/o as explicitly as possible. Don't let confused others confuse you. Always wonder: What does this thing operates on. Bytes or Strings? In doubt, investigate.