SlideShare a Scribd company logo
expect("💩").length
.toBe(1)
No. It’s not “just strings”
Where’s the problem?
• "".length == 0
• "a".length == 1
• "ä".length == 1
See? No problem here
What gives?
Question is:What are
we counting?
But first: Even more
fundamental
What is a string?
• Computers are really good at dealing with
numbers
• Humans are really good at dealing with text
• Marketing demands that computers do text
• We need to translate between text and
numbers
When all you have is a hammer,
every problem looks like a nail
In the beginning
• Array of characters
• C says char*
• char is defined as the “smallest addressable
unit that can contain basic character set”.
Integer type. Might be signed or unsigned
• char ends up being a byte
• Assign some meaning to each byte value
Interacting with the
world
• Just dump the contents of the memory into
a file
• Read back the same contents and put it in
memory
• Problem solved.
• Until you need to do this across machines
Interoperability
• char and the mapping from number to
letter is inherently implementation
dependent
• So is by definition the file you dump your
char* into
• Can’t move files between machines
ASCII
• “American Standard Code for Information
Interchange”
• Published 1963
• Uses 7 bits per character (circumventing
the signedness-issue)
• Perfectly fine for what everybody is using
(English)
But I need ümläüte
• Machines were used where people speak strange
languages (i.e. not English)
• ASCII is 7bit.
• Most machine have 8 bit bytes.
• Using that additional bit bit gives us another 127
characters!
• Depending on your country, these upper 127 characters
had different meanings
• No problem as texts usually don’t leave their country
remember “chcp 850”?
Then the Internet
happened
Thüs wäs nöt pюssїҌlҿ!
I apologize to all Russians for butchering their script.
Unicode 1.0
• 16 bits per character
• Published in 1991, revised in 1992
• Jumped on by everybody who wanted “to
do it right”
• APIs were made Unicode compliant by
extending the size of a character to 16 bits.
It’s the wild west 90ies
Still just dumping
memory
• wchar is 16 bits
• Endianness? See if we care!
• To save to a file: Dump memory contents.
• To load from a file: Read file into memory
• Note they didn’t dare extending char to 16
bits
• Let’s call this “Unicode”
16 bits everywhere
• Windows API (XxxxXxxW uses wchar
which is 16 bit wide)
• Java uses 16 bits
• Objective C uses 16 bits
• And of course, JavaScript uses 16 bits
• C and by extension Unix stayed away from
this.
That’s perfect. By using
16 bit characters, we
can store all of Unicode!
65K characters are
enough for everybody
640K are enough for
everybody
It didn’t work out so
well
• By just dumping memory, there’s no way to
know how to read it back
• You have no clue whether you have just read
Unicode or old-style data.
• Remember: it’s just numbers.
• Heuristics suck (try typing “Bush hid the
facts” in Windows Notepad, saving,
reloading)
BOM
We learned
• Unicode Transformation Format has
happened
• specifically UTF-8 happened
• Unicode 2.0 happened
• Programming environments learned
Unicode 2.0+
• Theoretically unlimited code space
• Doesn’t talk about bits any more
• The terminology is code point.
• Currently 1.1M code points
• The old characters (0000 - FFFF) are on
the BMP
Unicode Transformation
Format
• Specifies how to store Unicode on disk
• Specifies exact byte encoding for every
Unicode code point
• Available for 8-, 16- and 32 bit encodings
per code point
• Not every byte sequence is a valid UTF
byte sequence (finally!)
UTF-8
• Uses an 8bit encoding to store code points
• Is the same as ASCII for whatever’s in ASCII
• Uses multiple bytes to encode code points
outside of ASCII
• The old algorithms don’t work any more
UTF-16
• Combines the worst of both worlds
• Uses 16bit to encode a code point
• Uses surrogate pairs to encode a code point
outside of the BMP
• Wastes memory for ASCII, has byte-ordering-
issues and still breaks the old algorithms.
• Is the only way for these 16bit bandwagon
jumpers to support Unicode 2.0 and later
UTF-32
• 4 bytes per character
• Byte ordering issues
• Still breaking the old algorithms due to
combining marks
So.What is a String?
It’s not a collection of
bytes
• A string is a sequence of graphemes
• Or a sequence of Unicode Code Points
• A byte array is a sequence of bytes
• Both are incompatible with each other
• You can encode a string into a byte array
• You can decode a byte array into a string
Back to counting
Intermission: Combining
Marks
• ä is not the same as ä
• ä can be “LATIN SMALL LETTER A WITH
DIAERESIS”
• it can also be “LATIN SMALL LETTER A”
followed by “COMBINING DIAERESIS”
• both look exactly the same
Counting Lengths
• You could count the length in graphemes
• Or the length in unicode code points
• Or the length of your binary blob once
your string has been encoded.
• Or something in between because you
were trigger-happy in the 90ies
Which brings us back to
JS, Java and friends
• Live back in 1996
• Strings specified as being stored in UCS-2
(Fixed 16 bits per character)
• Leak its implementation in the API
• Don’t know about Unicode 2.0
Cheating abound
• Applications still want to support Unicode
2.0
• We need to display these piles of poo!
• String APIs use UTF-16, encoding
characters outside of the BMP as surrogate
pairs
• String APIs often don’t know about UTF-16
Let’s talk EcmaScript
String methods are leaky
• String.length returns mish-mash of byte
length and code point length for strings
outside the BMP
• substr() can break strings
• charAt() can return non-existing code-
points
• and let’s not talk about to*Case
Samples
Et tu RegEx?
• Character classes don’t
work right
• Counting characters
doesn’t work right
• Can break strings
No Normalization
Real-World example
«("💩")» is 5 graphemes long. I counted
6 underscores :-)
Also: Mixed Content Warning? What gives?
Some did it ok
• Python 3.3 (PEP 393)
• Ruby 1.9 (avoids political issues by giving a
lot of freedom)
• Perl (awesome libraries since forever)
• Swift after a very rough start
• ICU, ICU4C (http://icu-project.org/)
ES2015
still some 🐞 around
This was just the tip of
the iceberg!
• Localization issues (Collation, Case change)
• Security issues (Encoding, Homographs)
• Broken Software (including “US UTF-8”)
Highly recommended
Literature
Pop Quiz
"#".length
[…"#"].length
Thank you!
• @pilif on twitter
• https://github.com/pilif/
In case you answered
11 and 8, I salute you
• U+1F468 (MAN) 👨
• U+200D (ZERO WIDTH JOINER)
• U+2764 (HEAVY BLACK HEART) ❤
• U+FE0F (VARIATION SELECTOR-16)
• U+200D (ZERO WIDTH JOINER)
• U+1F48B (KISS MARK) 💋
• U+200D (ZERO WIDTH JOINER)
• U+1F468 (MAN) 👨

More Related Content

Similar to expect("").length.toBe(1)

Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
Elizabeth Smith
 
IntroPython-Week02-StringsIteration.pptx
IntroPython-Week02-StringsIteration.pptxIntroPython-Week02-StringsIteration.pptx
IntroPython-Week02-StringsIteration.pptx
chrisdy932
 
Lesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squencesLesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squences
Lexume1
 
presentation_python_7_1569170870_375360.pptx
presentation_python_7_1569170870_375360.pptxpresentation_python_7_1569170870_375360.pptx
presentation_python_7_1569170870_375360.pptx
ansariparveen06
 
Micro control idsecconf2010
Micro control idsecconf2010Micro control idsecconf2010
Micro control idsecconf2010
idsecconf
 
4 character encoding-unicode
4 character encoding-unicode4 character encoding-unicode
4 character encoding-unicodeirdginfo
 
Actors and Threads
Actors and ThreadsActors and Threads
Actors and Threads
mperham
 
Bonding with Pango
Bonding with PangoBonding with Pango
Bonding with Pango
ESUG
 
Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsUnicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set Collisions
Ray Paseur
 
Rails development environment talk
Rails development environment talkRails development environment talk
Rails development environment talk
Reuven Lerner
 
Test
TestTest
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingBert Pattyn
 
CNIT 127: Ch 18: Source Code Auditing
CNIT 127: Ch 18: Source Code AuditingCNIT 127: Ch 18: Source Code Auditing
CNIT 127: Ch 18: Source Code Auditing
Sam Bowne
 
Pipiot - the double-architecture shellcode constructor
Pipiot - the double-architecture shellcode constructorPipiot - the double-architecture shellcode constructor
Pipiot - the double-architecture shellcode constructor
Moshe Zioni
 
Ch 18: Source Code Auditing
Ch 18: Source Code AuditingCh 18: Source Code Auditing
Ch 18: Source Code Auditing
Sam Bowne
 
Understanding Character Encodings
Understanding Character EncodingsUnderstanding Character Encodings
Understanding Character Encodings
Mobisoft Infotech
 
Rustbridge
RustbridgeRustbridge
Rustbridge
kent marete
 

Similar to expect("").length.toBe(1) (20)

Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
 
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
 
IntroPython-Week02-StringsIteration.pptx
IntroPython-Week02-StringsIteration.pptxIntroPython-Week02-StringsIteration.pptx
IntroPython-Week02-StringsIteration.pptx
 
Lesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squencesLesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squences
 
presentation_python_7_1569170870_375360.pptx
presentation_python_7_1569170870_375360.pptxpresentation_python_7_1569170870_375360.pptx
presentation_python_7_1569170870_375360.pptx
 
Micro control idsecconf2010
Micro control idsecconf2010Micro control idsecconf2010
Micro control idsecconf2010
 
4 character encoding-unicode
4 character encoding-unicode4 character encoding-unicode
4 character encoding-unicode
 
C# basics...
C# basics...C# basics...
C# basics...
 
Actors and Threads
Actors and ThreadsActors and Threads
Actors and Threads
 
Bonding with Pango
Bonding with PangoBonding with Pango
Bonding with Pango
 
Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsUnicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set Collisions
 
Rails development environment talk
Rails development environment talkRails development environment talk
Rails development environment talk
 
Test
TestTest
Test
 
Unicode
UnicodeUnicode
Unicode
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character Encoding
 
CNIT 127: Ch 18: Source Code Auditing
CNIT 127: Ch 18: Source Code AuditingCNIT 127: Ch 18: Source Code Auditing
CNIT 127: Ch 18: Source Code Auditing
 
Pipiot - the double-architecture shellcode constructor
Pipiot - the double-architecture shellcode constructorPipiot - the double-architecture shellcode constructor
Pipiot - the double-architecture shellcode constructor
 
Ch 18: Source Code Auditing
Ch 18: Source Code AuditingCh 18: Source Code Auditing
Ch 18: Source Code Auditing
 
Understanding Character Encodings
Understanding Character EncodingsUnderstanding Character Encodings
Understanding Character Encodings
 
Rustbridge
RustbridgeRustbridge
Rustbridge
 

Recently uploaded

Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSCW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
veerababupersonal22
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
symbo111
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
aqil azizi
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
ssuser7dcef0
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
ongomchris
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 

Recently uploaded (20)

Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSCW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 

expect("").length.toBe(1)

  • 2. Where’s the problem? • "".length == 0 • "a".length == 1 • "ä".length == 1
  • 4.
  • 7. But first: Even more fundamental
  • 8. What is a string? • Computers are really good at dealing with numbers • Humans are really good at dealing with text • Marketing demands that computers do text • We need to translate between text and numbers
  • 9. When all you have is a hammer, every problem looks like a nail
  • 10. In the beginning • Array of characters • C says char* • char is defined as the “smallest addressable unit that can contain basic character set”. Integer type. Might be signed or unsigned • char ends up being a byte • Assign some meaning to each byte value
  • 11. Interacting with the world • Just dump the contents of the memory into a file • Read back the same contents and put it in memory • Problem solved. • Until you need to do this across machines
  • 12. Interoperability • char and the mapping from number to letter is inherently implementation dependent • So is by definition the file you dump your char* into • Can’t move files between machines
  • 13. ASCII • “American Standard Code for Information Interchange” • Published 1963 • Uses 7 bits per character (circumventing the signedness-issue) • Perfectly fine for what everybody is using (English)
  • 14. But I need ümläüte • Machines were used where people speak strange languages (i.e. not English) • ASCII is 7bit. • Most machine have 8 bit bytes. • Using that additional bit bit gives us another 127 characters! • Depending on your country, these upper 127 characters had different meanings • No problem as texts usually don’t leave their country
  • 17. Thüs wäs nöt pюssїҌlҿ! I apologize to all Russians for butchering their script.
  • 18. Unicode 1.0 • 16 bits per character • Published in 1991, revised in 1992 • Jumped on by everybody who wanted “to do it right” • APIs were made Unicode compliant by extending the size of a character to 16 bits.
  • 19. It’s the wild west 90ies
  • 20. Still just dumping memory • wchar is 16 bits • Endianness? See if we care! • To save to a file: Dump memory contents. • To load from a file: Read file into memory • Note they didn’t dare extending char to 16 bits • Let’s call this “Unicode”
  • 21. 16 bits everywhere • Windows API (XxxxXxxW uses wchar which is 16 bit wide) • Java uses 16 bits • Objective C uses 16 bits • And of course, JavaScript uses 16 bits • C and by extension Unix stayed away from this.
  • 22. That’s perfect. By using 16 bit characters, we can store all of Unicode!
  • 23. 65K characters are enough for everybody
  • 24. 640K are enough for everybody
  • 25. It didn’t work out so well • By just dumping memory, there’s no way to know how to read it back • You have no clue whether you have just read Unicode or old-style data. • Remember: it’s just numbers. • Heuristics suck (try typing “Bush hid the facts” in Windows Notepad, saving, reloading)
  • 26. BOM
  • 27. We learned • Unicode Transformation Format has happened • specifically UTF-8 happened • Unicode 2.0 happened • Programming environments learned
  • 28. Unicode 2.0+ • Theoretically unlimited code space • Doesn’t talk about bits any more • The terminology is code point. • Currently 1.1M code points • The old characters (0000 - FFFF) are on the BMP
  • 29. Unicode Transformation Format • Specifies how to store Unicode on disk • Specifies exact byte encoding for every Unicode code point • Available for 8-, 16- and 32 bit encodings per code point • Not every byte sequence is a valid UTF byte sequence (finally!)
  • 30. UTF-8 • Uses an 8bit encoding to store code points • Is the same as ASCII for whatever’s in ASCII • Uses multiple bytes to encode code points outside of ASCII • The old algorithms don’t work any more
  • 31. UTF-16 • Combines the worst of both worlds • Uses 16bit to encode a code point • Uses surrogate pairs to encode a code point outside of the BMP • Wastes memory for ASCII, has byte-ordering- issues and still breaks the old algorithms. • Is the only way for these 16bit bandwagon jumpers to support Unicode 2.0 and later
  • 32. UTF-32 • 4 bytes per character • Byte ordering issues • Still breaking the old algorithms due to combining marks
  • 33. So.What is a String?
  • 34. It’s not a collection of bytes • A string is a sequence of graphemes • Or a sequence of Unicode Code Points • A byte array is a sequence of bytes • Both are incompatible with each other • You can encode a string into a byte array • You can decode a byte array into a string
  • 36. Intermission: Combining Marks • ä is not the same as ä • ä can be “LATIN SMALL LETTER A WITH DIAERESIS” • it can also be “LATIN SMALL LETTER A” followed by “COMBINING DIAERESIS” • both look exactly the same
  • 37. Counting Lengths • You could count the length in graphemes • Or the length in unicode code points • Or the length of your binary blob once your string has been encoded. • Or something in between because you were trigger-happy in the 90ies
  • 38. Which brings us back to JS, Java and friends • Live back in 1996 • Strings specified as being stored in UCS-2 (Fixed 16 bits per character) • Leak its implementation in the API • Don’t know about Unicode 2.0
  • 39. Cheating abound • Applications still want to support Unicode 2.0 • We need to display these piles of poo! • String APIs use UTF-16, encoding characters outside of the BMP as surrogate pairs • String APIs often don’t know about UTF-16
  • 41. String methods are leaky • String.length returns mish-mash of byte length and code point length for strings outside the BMP • substr() can break strings • charAt() can return non-existing code- points • and let’s not talk about to*Case
  • 43. Et tu RegEx? • Character classes don’t work right • Counting characters doesn’t work right • Can break strings
  • 45. Real-World example «("💩")» is 5 graphemes long. I counted 6 underscores :-) Also: Mixed Content Warning? What gives?
  • 46. Some did it ok • Python 3.3 (PEP 393) • Ruby 1.9 (avoids political issues by giving a lot of freedom) • Perl (awesome libraries since forever) • Swift after a very rough start • ICU, ICU4C (http://icu-project.org/)
  • 48. still some 🐞 around
  • 49. This was just the tip of the iceberg! • Localization issues (Collation, Case change) • Security issues (Encoding, Homographs) • Broken Software (including “US UTF-8”)
  • 53. Thank you! • @pilif on twitter • https://github.com/pilif/
  • 54. In case you answered 11 and 8, I salute you
  • 55. • U+1F468 (MAN) 👨 • U+200D (ZERO WIDTH JOINER) • U+2764 (HEAVY BLACK HEART) ❤ • U+FE0F (VARIATION SELECTOR-16) • U+200D (ZERO WIDTH JOINER) • U+1F48B (KISS MARK) 💋 • U+200D (ZERO WIDTH JOINER) • U+1F468 (MAN) 👨