SlideShare a Scribd company logo
What character is THAT?




  Anders Karlsson
 anders@skysql.com
Agenda
• About Anders Karlsson
• Part 1 - The gruesome background
  • The history of character sets and
    collations
  • The “classic” 7 and 8 bit ASCII
    character sets
• Part 2 – UNICODE Rocks!
  • What is UNICODE and encodings
  • Why UTF-8 is smart. Or not so smart
• Part 3 - MySQL and UNICODE
• Questions? Answers?
About Anders Karlsson

• Senior Sales Engineer at SkySQL
• Former Database Architect at Recorded Future, Sales
  Engineer and Consultant with
  Oracle, Informix, TimesTen, MySQL / Sun / Oracle etc.
• Has been in the RDBMS business for 20+ years
• Has also worked as Tech Support engineer, Porting
  Engineer and in many other roles
• Outside SkySQL I build websites
  (www.papablues.com), develop Open Source software
  (MyQuery, mycleaner etc), am a keen
  photographer, has an affection for English Real Ales
  and a great interest in computer history
22/11/2012            SkySQL Ab 2011 Confidential         3
Part 1 – The history which we
are not to ignore (but which has
already been ignored several
times)
The history of Characters Sets and
collations


• At first there were no characters, only numbers
• Then on the 7th day we realized characters and
  words was a good thing, but that computers
  can only handle numbers, so we needed a way
  of representing characters as numbers
• So we different mappings from characters to
  numbers: ASCII, EBCDIC, FIELDATA, Baudot
  etc, in different variations (in particular
  EBCDIC)
ASCII – The mother of character sets

• For anyone not being a machochist (i.e.
  anyone not using a mainframe), the character
  set of choice soon became 7-bit ASCII
  (American Standard Code for Information
  Interchange), first published in 1963
• 7-bits was enough for US English characters
  and control characters, with some legroom
  (note that ASCII is US English, not UK
  English, centric)
• The 8th bit was used for parity in transmission
All ASCII hell breaks loose

• As the original 7-bit US ASCII didn’t support
  anything but US English, variations started to
  appear.
• Any decent computer was supporting 8-bit
  characters, but as the assumption was still
  that bit 8 was a partity bit.
• So 7-bit local variations was
  developed, Swedish 7-bit ASCII for example
  (anyone coding in C knows and hates this)
And then we get 8-bit ASCII hell!

• Extended 8-bit ASCII solves a few problems, but also
  introduces a few new ones. Most of the new problems
  came from an attempt of making 8-bit Extended ASCII
  compatible with 7-bit ASCII variations
• The Extended 8-bit “ASCII” characters sets are largely
  standardized as ISO 8859 (with variations). Most
  common is ISO 8859-1 (latin-1)
• 8859-15 is a not so popular 8859-1 update, including a
  Euro-sign among a few other
  things. If the Euro-sign really is a useful
  addition is yet to be determined
• Another 8859-1 variation is Windows CP1252,
  which is an enhanced 8859-1 character set
Oh, then we have collations!
• A “collation” determines how characters in a
  character set are to be sorted!
• 7-bit ASCII was great (numeric order same as
  character order)
   – Or was it? Really? Upper / Lower case?
• 7-bit localized ASCII was not so great. To say the
  least. Swedish 7-BIT ASCII was not correctly
  sorted (å last in the alphabet, after ä and ö)
• 8-bit Extended ASCII didn’t help much (Swedish
  again being in the wrong order, but not the same
  wrong order as with 7-bit “Swedish ASCII”)
Collation basics

         • Don’t ever think that the character
           set determines the sorting!
            – The same character set used in
              different countries may be sorted
              differently
            – Different sorting models may be used in
              the same country (A good example is
              case sensitivity)
         • Also, collations is not only about
           sorting, it’s also about comparisons
           and a few other things
Interoperating with ASCII
• A long as we were all using 1 single computer
  or a bunch of similar computers in a LAN, the
  issues were limited
• As usual, the Internet turned this beautiful
  environment into something truly evil!
• Internet got started in the US
  – Which means, again, that the founders were
    convinced that 7-bit ASCII would be OK. That this
    had been an incorrect assumption 30 years before
    Facebook came around made no difference. Of
    course not.
Interoperability necessities
• For us to be able to communicate we need to
  be able to tell what character set we expect
  here at the client side, the server has to tell
  what it delivers, and then we need a way to
  align all this.
• The trick: <meta http-equiv=Content-Type
  content="text/html; charset=iso8559-1">
  – Or maybe not? This tells what I get, but doesn’t
    allow me to say what I want!
• Actually, this didn’t help as much as we hoped
Part 1 Conclusion

• The many different local variations of
  characters served us well, for a while
• Now we have a global IT environment with
  many different character sets and
  collations, and we can’t deal with multiple
  local versions anymore
• And we have languages whose character set
  will not fit in 8 bits anyway
• And the we need to sort and compare all this!
Part 2 – UNICODE and Ken
Thompson saves the
world, without Batman and not by
tracking down the penguin
UNICODE – One Character set for all

• Yes, that is what UNICODE (or ISO/IEC 10646)
  sets out to do – A common character set for
  ALL languges (close to 240.000 characters are
  defined in UNICODE 4.1 today, MySQL is
  somewhat at UNICODE 3.0). Sort of.
• This means that UNICODE has character codes
  than can not fit in 1 byte. This is big surprise
  to anyone on the other side of the pond, but
  there you go
• But there is a remedy: UNICODE Encodings!
UNICODE Encodings

• A UNICODE encoding is a standardized way of
  representing a character in the UNICODE
  character set
• UNICODE encodings represent select parts of
  the full UNICODE character set
• UNICODE encodings are part of the UNICODE
  standard itself (and this is a VERY good thing!
  If this wasn’t the case, both Apple and
  Microsoft would have invented their own
  encodings I’m sure)
UNICODE Encodings

• Among the UNICODE encodings are
  – UCS-2 – 2 bytes wide (i.e. only 64k different
    characters can be represented)
  – UTF-16 – 2 or 4 bytes wide. This is then a variable
    length scheme with a very complex setup. When
    only 2 bytes are used, they are the same as UCS-2
  – UTF-32 – 4 bytes fixed size
• To be honest, besides UTF-16 / UCS-2 that is
  common in Windows and related frameworks
  (like COM), none of these are very popular
UTF-8 – Some smart dudes at work!

• The problem than UNICODE has is that it has
  to represent all those characters. This should
  break some applications for sure.
• Well, Encodings solve that too, and the
  mother of all encodings is UTF-8.
  Invented not by Albert Einstein or
  Batman but by Ken Thompson!
• Let’s now have a round of applause
  for Ken Thompson!
The details of UTF-8
• UNICODE characters 0 – 127 are the same as
  in standard 7-bit ASCII (remember that?)
• UTF-8 works the same: For characters 0 –
  127, the most significant (first) bit of the first
  (and only) byte is 0
• Beyond 7-bit ASCII characters, the number of
  “leading” 1’s in the first byte tells how many
  bytes make of the up the character
• All other bytes start with a 1 and a 0
• And the rest of the bits make up the character
The details of UTF-8
• So in the first byte, it is one of two things:
   – A leading 0 meaning a single byte character
   – A number of 1’s (at least 2, as 1 byte characters
     are indicated by a leading 0) followed by a 0
      • This means that the first byte in a character NEVER
        starts with the sequence 10
   – All other bytes starts with 10
   – 1 UTF-8 byte can contain up to 7 bits of data
   – 2 UTF-8 bytes contains from 8 to 13 bits of data
   – 3 UTF-8 bytes contains from 14 to 16 bits of data
   – 4 UTF-8 bytes contains from 17 to 21 bits of data
Some useful aspects of UTF-8

• You can always find the leading byte of a
  character in a word, starting from any byte
  – Just move “backward” til a byte not having a
    leading 10 is found
• Byte values 0 – 127 are ONLY present as
  character values 0 – 127, nowhere else!
  – All other byte values have the highest bit set
  – So strlen(), strcmp() etc. still works, but on a byte
    by byte, not character by character, level
So, are we all OK with UTF-8 now?
• Let’s see. Using UTF-8 we can represents
  binary values with up to 21 bits, which is
  2.097.152 characters! Which is
  more than enough! (But 640K
  RAM was ALSO more than enough)
• If we limit ourselves to 3 bytes UTF-8
  we can represent 65.536 different
  characters, the same as if we use
  UCS-2 (which is fixed 2-byte format).
  65.536 characters is what is in the
  UNICODE Basic Multilingual Plane
Why we actually need 4-bytes UTF-8

• Beyond the BMP comes a couple of other
  “planes”. The one that causes most issues is
  the one that adds a bunch of
  Chinese, Japanese and Korean characters
• For these, we need to go beyond the BMP and
  hence beyond the nice and cosy 65.536
  characters. Duh!
• And this is why the MySQL assumption on
  UTF-8 means a maximum of 3 bytes might not
  be such a good idea after all 
Part 3 – MySQL and UNICODE
So, how does MySQL handle all this?
• MySQL supports a whole range of UNICODE
  Encodings and collations! Good!
• MySQL understand the case when we have
  one character set stored in a column in a table
  and another one on the client side, and nicely
  does a conversion for is! Good!
• Not all UNICODE Encodings are valid on the
  Client side! Not so good 
• Actually, anything beyond UTF-8, when it
  comes to UNICODE on the client side, is
  troublesome
Lessons in MySQL and UNICODE

• Lesson 1: Learn about UNICODE and
  understand how it works
• Lesson 2: Stick with UTF-8! Most others does
  that too. Including Java, Java Script, JSON, the
  web any many, many others!
• Lesson 3: UCS-2 may seem like a good idea, it
  is fixed length after all. It’s not (a good idea
  that is, fixed length it is)
• Lesson 4: Don’t forget about collations! They
  are important!
Collations – The Sequel

• Collations determine how strings are sorted
  – Order by
  – Indexes
  – WHERE col1 > ‘Über’
• Collations determine how strings are
  compared
  – Is A = Ä or not? Y = Ü?
• What in particular for COLLATIONS used for
  PRIMARY KEYs
Storing UTF-8 data in MySQL

• Most Storage Engines are happy to use utf-8
• The MySQL Interpretation of UTF-8 is 1 – 3
  bytes, or 65.536 different characters!
  – This means that
     • A CHAR(10) column requires 30 bytes fixed space!
     • A VARCHAR(10) column is limited to 30 bytes
• MySQL 5.5 and up also supports 4-byte UTF-8
  by using the character set utf8mb4
Storing UTF-8 data in MySQL
• VARCHAR columns are actually fixed in some
  Storage Engines, most notably those engines
  developed sometime around the time of the
  American Civil War, when variable length data
  was still in it’s infancy
• UTF-8 can potentially waste A LOT of space
• Extra space for UTF-8 also affects byte size
  limits, such as VARCHAR and INDEX sizes
• UTF-8 data sorting is way more complex than
  a simple binary sort (so in some ways, things
  were better in the old 7-bit ASCII days)
Some simple demos, Questions
and Answers.
And I haven’t even began to talk
about byte ordering and byte order
marks.
THANK YOU!


Anders Karlsson
anders@skysql.com
http://karlssonondatabases.blogspot.com

More Related Content

Viewers also liked

Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignityCharacter Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Travis Fischer
 
learn-python
learn-pythonlearn-python
learn-python
Minwoo Park
 
Unicode and character sets
Unicode and character setsUnicode and character sets
Unicode and character setsrenchenyu
 
Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsUnicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set Collisions
Ray Paseur
 
Character Sets
Character SetsCharacter Sets
Character Sets
Leo Hernandez
 
Digital Image Processing and Edge Detection
Digital Image Processing and Edge DetectionDigital Image Processing and Edge Detection
Digital Image Processing and Edge Detection
Seda Yalçın
 
Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)
Project Student
 
Bank Account Of Life
Bank Account Of LifeBank Account Of Life
Bank Account Of LifeNafass
 
Strategies for Friendly English and Successful Localization
Strategies for Friendly English and Successful LocalizationStrategies for Friendly English and Successful Localization
Strategies for Friendly English and Successful Localization
John Collins
 
Translated Strings and Foreign Language Support in JavaScript Web Apps - OSCO...
Translated Strings and Foreign Language Support in JavaScript Web Apps - OSCO...Translated Strings and Foreign Language Support in JavaScript Web Apps - OSCO...
Translated Strings and Foreign Language Support in JavaScript Web Apps - OSCO...
Ken Tabor
 
แรงดันในของเหลว
แรงดันในของเหลวแรงดันในของเหลว
แรงดันในของเหลวtewin2553
 
Putting Out Fires with Content Strategy (InfoDevDC meetup)
Putting Out Fires with Content Strategy (InfoDevDC meetup)Putting Out Fires with Content Strategy (InfoDevDC meetup)
Putting Out Fires with Content Strategy (InfoDevDC meetup)
John Collins
 
Conspiracy Profile
Conspiracy ProfileConspiracy Profile
Conspiracy Profilecharlyheus
 
Designing for Multiple Mobile Platforms
Designing for Multiple Mobile PlatformsDesigning for Multiple Mobile Platforms
Designing for Multiple Mobile Platforms
Robert Douglas
 
Conspiracy Profile
Conspiracy ProfileConspiracy Profile
Conspiracy Profilecharlyheus
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Mike Long
 
My trans kit checklist gw1 ds1_gw3
My trans kit checklist gw1 ds1_gw3My trans kit checklist gw1 ds1_gw3
My trans kit checklist gw1 ds1_gw3David Sommer
 
2008 Fourth Quarter Real Estate Commentary
2008 Fourth Quarter Real Estate Commentary2008 Fourth Quarter Real Estate Commentary
2008 Fourth Quarter Real Estate Commentaryalghanim
 
Sample of instructions
Sample of instructionsSample of instructions
Sample of instructionsDavid Sommer
 

Viewers also liked (20)

Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignityCharacter Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
 
learn-python
learn-pythonlearn-python
learn-python
 
Unicode and character sets
Unicode and character setsUnicode and character sets
Unicode and character sets
 
Unicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set CollisionsUnicode, PHP, and Character Set Collisions
Unicode, PHP, and Character Set Collisions
 
Character Sets
Character SetsCharacter Sets
Character Sets
 
Digital Image Processing and Edge Detection
Digital Image Processing and Edge DetectionDigital Image Processing and Edge Detection
Digital Image Processing and Edge Detection
 
Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)Ascii and Unicode (Character Codes)
Ascii and Unicode (Character Codes)
 
Bank Account Of Life
Bank Account Of LifeBank Account Of Life
Bank Account Of Life
 
Strategies for Friendly English and Successful Localization
Strategies for Friendly English and Successful LocalizationStrategies for Friendly English and Successful Localization
Strategies for Friendly English and Successful Localization
 
Translated Strings and Foreign Language Support in JavaScript Web Apps - OSCO...
Translated Strings and Foreign Language Support in JavaScript Web Apps - OSCO...Translated Strings and Foreign Language Support in JavaScript Web Apps - OSCO...
Translated Strings and Foreign Language Support in JavaScript Web Apps - OSCO...
 
แรงดันในของเหลว
แรงดันในของเหลวแรงดันในของเหลว
แรงดันในของเหลว
 
Putting Out Fires with Content Strategy (InfoDevDC meetup)
Putting Out Fires with Content Strategy (InfoDevDC meetup)Putting Out Fires with Content Strategy (InfoDevDC meetup)
Putting Out Fires with Content Strategy (InfoDevDC meetup)
 
Conspiracy Profile
Conspiracy ProfileConspiracy Profile
Conspiracy Profile
 
Designing for Multiple Mobile Platforms
Designing for Multiple Mobile PlatformsDesigning for Multiple Mobile Platforms
Designing for Multiple Mobile Platforms
 
Conspiracy Profile
Conspiracy ProfileConspiracy Profile
Conspiracy Profile
 
Silmeyiniz
SilmeyinizSilmeyiniz
Silmeyiniz
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
My trans kit checklist gw1 ds1_gw3
My trans kit checklist gw1 ds1_gw3My trans kit checklist gw1 ds1_gw3
My trans kit checklist gw1 ds1_gw3
 
2008 Fourth Quarter Real Estate Commentary
2008 Fourth Quarter Real Estate Commentary2008 Fourth Quarter Real Estate Commentary
2008 Fourth Quarter Real Estate Commentary
 
Sample of instructions
Sample of instructionsSample of instructions
Sample of instructions
 

Similar to What character is that

Lecture_ASCII and Unicode.ppt
Lecture_ASCII and Unicode.pptLecture_ASCII and Unicode.ppt
Lecture_ASCII and Unicode.ppt
Alula Tafere
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesMilind Patil
 
Unicode Encoding Forms
Unicode Encoding FormsUnicode Encoding Forms
Unicode Encoding Forms
Mehdi Hasan
 
Data Communication & Computer Networks : Data Types
Data Communication & Computer Networks : Data TypesData Communication & Computer Networks : Data Types
Data Communication & Computer Networks : Data Types
Dr Rajiv Srivastava
 
Lesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squencesLesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squences
Lexume1
 
ElixirConf 2017 - Writing an Editor in Elixir - Ian Duggan
ElixirConf 2017 - Writing an Editor in Elixir - Ian DugganElixirConf 2017 - Writing an Editor in Elixir - Ian Duggan
ElixirConf 2017 - Writing an Editor in Elixir - Ian Duggan
ijcd
 
Pipiot - the double-architecture shellcode constructor
Pipiot - the double-architecture shellcode constructorPipiot - the double-architecture shellcode constructor
Pipiot - the double-architecture shellcode constructor
Moshe Zioni
 
Computer repair -_a_complete_illustrated_guide_to_pc_hardware
Computer repair -_a_complete_illustrated_guide_to_pc_hardwareComputer repair -_a_complete_illustrated_guide_to_pc_hardware
Computer repair -_a_complete_illustrated_guide_to_pc_hardware
Shripal Oswal
 
When 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and such
When 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and suchWhen 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and such
When 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and such
Kim Berg Hansen
 
multilanguage.pdf
multilanguage.pdfmultilanguage.pdf
multilanguage.pdf
ssusera9b90d
 
Encoding Nightmares (and how to avoid them)
Encoding Nightmares (and how to avoid them)Encoding Nightmares (and how to avoid them)
Encoding Nightmares (and how to avoid them)
Kenneth Farrall
 
MySQL 8.0 & Unicode: Why, what & how
MySQL 8.0 & Unicode: Why, what & howMySQL 8.0 & Unicode: Why, what & how
MySQL 8.0 & Unicode: Why, what & how
Bernt Marius Johnsen
 
Topic 2.3 (1)
Topic 2.3 (1)Topic 2.3 (1)
Topic 2.3 (1)
nabilbesttravel
 
Pl ams 2015_unicode_dveeden
Pl ams 2015_unicode_dveedenPl ams 2015_unicode_dveeden
Pl ams 2015_unicode_dveeden
Daniël van Eeden
 
Unicode
UnicodeUnicode
Unicode 101
Unicode 101Unicode 101
Unicode 101
davidfstr
 
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - IT
guest6ddfb98
 

Similar to What character is that (20)

Lecture_ASCII and Unicode.ppt
Lecture_ASCII and Unicode.pptLecture_ASCII and Unicode.ppt
Lecture_ASCII and Unicode.ppt
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfiles
 
Unicode Encoding Forms
Unicode Encoding FormsUnicode Encoding Forms
Unicode Encoding Forms
 
Data Communication & Computer Networks : Data Types
Data Communication & Computer Networks : Data TypesData Communication & Computer Networks : Data Types
Data Communication & Computer Networks : Data Types
 
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
 
Lesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squencesLesson4.2 u4 l1 binary squences
Lesson4.2 u4 l1 binary squences
 
Strings and encodings
Strings and encodingsStrings and encodings
Strings and encodings
 
ElixirConf 2017 - Writing an Editor in Elixir - Ian Duggan
ElixirConf 2017 - Writing an Editor in Elixir - Ian DugganElixirConf 2017 - Writing an Editor in Elixir - Ian Duggan
ElixirConf 2017 - Writing an Editor in Elixir - Ian Duggan
 
Pipiot - the double-architecture shellcode constructor
Pipiot - the double-architecture shellcode constructorPipiot - the double-architecture shellcode constructor
Pipiot - the double-architecture shellcode constructor
 
Computer repair -_a_complete_illustrated_guide_to_pc_hardware
Computer repair -_a_complete_illustrated_guide_to_pc_hardwareComputer repair -_a_complete_illustrated_guide_to_pc_hardware
Computer repair -_a_complete_illustrated_guide_to_pc_hardware
 
When 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and such
When 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and suchWhen 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and such
When 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and such
 
multilanguage.pdf
multilanguage.pdfmultilanguage.pdf
multilanguage.pdf
 
Encoding Nightmares (and how to avoid them)
Encoding Nightmares (and how to avoid them)Encoding Nightmares (and how to avoid them)
Encoding Nightmares (and how to avoid them)
 
MySQL 8.0 & Unicode: Why, what & how
MySQL 8.0 & Unicode: Why, what & howMySQL 8.0 & Unicode: Why, what & how
MySQL 8.0 & Unicode: Why, what & how
 
Unicode
UnicodeUnicode
Unicode
 
Topic 2.3 (1)
Topic 2.3 (1)Topic 2.3 (1)
Topic 2.3 (1)
 
Pl ams 2015_unicode_dveeden
Pl ams 2015_unicode_dveedenPl ams 2015_unicode_dveeden
Pl ams 2015_unicode_dveeden
 
Unicode
UnicodeUnicode
Unicode
 
Unicode 101
Unicode 101Unicode 101
Unicode 101
 
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - IT
 

Recently uploaded

GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 

Recently uploaded (20)

GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 

What character is that

  • 1. What character is THAT? Anders Karlsson anders@skysql.com
  • 2. Agenda • About Anders Karlsson • Part 1 - The gruesome background • The history of character sets and collations • The “classic” 7 and 8 bit ASCII character sets • Part 2 – UNICODE Rocks! • What is UNICODE and encodings • Why UTF-8 is smart. Or not so smart • Part 3 - MySQL and UNICODE • Questions? Answers?
  • 3. About Anders Karlsson • Senior Sales Engineer at SkySQL • Former Database Architect at Recorded Future, Sales Engineer and Consultant with Oracle, Informix, TimesTen, MySQL / Sun / Oracle etc. • Has been in the RDBMS business for 20+ years • Has also worked as Tech Support engineer, Porting Engineer and in many other roles • Outside SkySQL I build websites (www.papablues.com), develop Open Source software (MyQuery, mycleaner etc), am a keen photographer, has an affection for English Real Ales and a great interest in computer history 22/11/2012 SkySQL Ab 2011 Confidential 3
  • 4. Part 1 – The history which we are not to ignore (but which has already been ignored several times)
  • 5. The history of Characters Sets and collations • At first there were no characters, only numbers • Then on the 7th day we realized characters and words was a good thing, but that computers can only handle numbers, so we needed a way of representing characters as numbers • So we different mappings from characters to numbers: ASCII, EBCDIC, FIELDATA, Baudot etc, in different variations (in particular EBCDIC)
  • 6. ASCII – The mother of character sets • For anyone not being a machochist (i.e. anyone not using a mainframe), the character set of choice soon became 7-bit ASCII (American Standard Code for Information Interchange), first published in 1963 • 7-bits was enough for US English characters and control characters, with some legroom (note that ASCII is US English, not UK English, centric) • The 8th bit was used for parity in transmission
  • 7. All ASCII hell breaks loose • As the original 7-bit US ASCII didn’t support anything but US English, variations started to appear. • Any decent computer was supporting 8-bit characters, but as the assumption was still that bit 8 was a partity bit. • So 7-bit local variations was developed, Swedish 7-bit ASCII for example (anyone coding in C knows and hates this)
  • 8. And then we get 8-bit ASCII hell! • Extended 8-bit ASCII solves a few problems, but also introduces a few new ones. Most of the new problems came from an attempt of making 8-bit Extended ASCII compatible with 7-bit ASCII variations • The Extended 8-bit “ASCII” characters sets are largely standardized as ISO 8859 (with variations). Most common is ISO 8859-1 (latin-1) • 8859-15 is a not so popular 8859-1 update, including a Euro-sign among a few other things. If the Euro-sign really is a useful addition is yet to be determined • Another 8859-1 variation is Windows CP1252, which is an enhanced 8859-1 character set
  • 9. Oh, then we have collations! • A “collation” determines how characters in a character set are to be sorted! • 7-bit ASCII was great (numeric order same as character order) – Or was it? Really? Upper / Lower case? • 7-bit localized ASCII was not so great. To say the least. Swedish 7-BIT ASCII was not correctly sorted (å last in the alphabet, after ä and ö) • 8-bit Extended ASCII didn’t help much (Swedish again being in the wrong order, but not the same wrong order as with 7-bit “Swedish ASCII”)
  • 10. Collation basics • Don’t ever think that the character set determines the sorting! – The same character set used in different countries may be sorted differently – Different sorting models may be used in the same country (A good example is case sensitivity) • Also, collations is not only about sorting, it’s also about comparisons and a few other things
  • 11. Interoperating with ASCII • A long as we were all using 1 single computer or a bunch of similar computers in a LAN, the issues were limited • As usual, the Internet turned this beautiful environment into something truly evil! • Internet got started in the US – Which means, again, that the founders were convinced that 7-bit ASCII would be OK. That this had been an incorrect assumption 30 years before Facebook came around made no difference. Of course not.
  • 12. Interoperability necessities • For us to be able to communicate we need to be able to tell what character set we expect here at the client side, the server has to tell what it delivers, and then we need a way to align all this. • The trick: <meta http-equiv=Content-Type content="text/html; charset=iso8559-1"> – Or maybe not? This tells what I get, but doesn’t allow me to say what I want! • Actually, this didn’t help as much as we hoped
  • 13. Part 1 Conclusion • The many different local variations of characters served us well, for a while • Now we have a global IT environment with many different character sets and collations, and we can’t deal with multiple local versions anymore • And we have languages whose character set will not fit in 8 bits anyway • And the we need to sort and compare all this!
  • 14. Part 2 – UNICODE and Ken Thompson saves the world, without Batman and not by tracking down the penguin
  • 15. UNICODE – One Character set for all • Yes, that is what UNICODE (or ISO/IEC 10646) sets out to do – A common character set for ALL languges (close to 240.000 characters are defined in UNICODE 4.1 today, MySQL is somewhat at UNICODE 3.0). Sort of. • This means that UNICODE has character codes than can not fit in 1 byte. This is big surprise to anyone on the other side of the pond, but there you go • But there is a remedy: UNICODE Encodings!
  • 16. UNICODE Encodings • A UNICODE encoding is a standardized way of representing a character in the UNICODE character set • UNICODE encodings represent select parts of the full UNICODE character set • UNICODE encodings are part of the UNICODE standard itself (and this is a VERY good thing! If this wasn’t the case, both Apple and Microsoft would have invented their own encodings I’m sure)
  • 17. UNICODE Encodings • Among the UNICODE encodings are – UCS-2 – 2 bytes wide (i.e. only 64k different characters can be represented) – UTF-16 – 2 or 4 bytes wide. This is then a variable length scheme with a very complex setup. When only 2 bytes are used, they are the same as UCS-2 – UTF-32 – 4 bytes fixed size • To be honest, besides UTF-16 / UCS-2 that is common in Windows and related frameworks (like COM), none of these are very popular
  • 18. UTF-8 – Some smart dudes at work! • The problem than UNICODE has is that it has to represent all those characters. This should break some applications for sure. • Well, Encodings solve that too, and the mother of all encodings is UTF-8. Invented not by Albert Einstein or Batman but by Ken Thompson! • Let’s now have a round of applause for Ken Thompson!
  • 19. The details of UTF-8 • UNICODE characters 0 – 127 are the same as in standard 7-bit ASCII (remember that?) • UTF-8 works the same: For characters 0 – 127, the most significant (first) bit of the first (and only) byte is 0 • Beyond 7-bit ASCII characters, the number of “leading” 1’s in the first byte tells how many bytes make of the up the character • All other bytes start with a 1 and a 0 • And the rest of the bits make up the character
  • 20. The details of UTF-8 • So in the first byte, it is one of two things: – A leading 0 meaning a single byte character – A number of 1’s (at least 2, as 1 byte characters are indicated by a leading 0) followed by a 0 • This means that the first byte in a character NEVER starts with the sequence 10 – All other bytes starts with 10 – 1 UTF-8 byte can contain up to 7 bits of data – 2 UTF-8 bytes contains from 8 to 13 bits of data – 3 UTF-8 bytes contains from 14 to 16 bits of data – 4 UTF-8 bytes contains from 17 to 21 bits of data
  • 21. Some useful aspects of UTF-8 • You can always find the leading byte of a character in a word, starting from any byte – Just move “backward” til a byte not having a leading 10 is found • Byte values 0 – 127 are ONLY present as character values 0 – 127, nowhere else! – All other byte values have the highest bit set – So strlen(), strcmp() etc. still works, but on a byte by byte, not character by character, level
  • 22. So, are we all OK with UTF-8 now? • Let’s see. Using UTF-8 we can represents binary values with up to 21 bits, which is 2.097.152 characters! Which is more than enough! (But 640K RAM was ALSO more than enough) • If we limit ourselves to 3 bytes UTF-8 we can represent 65.536 different characters, the same as if we use UCS-2 (which is fixed 2-byte format). 65.536 characters is what is in the UNICODE Basic Multilingual Plane
  • 23. Why we actually need 4-bytes UTF-8 • Beyond the BMP comes a couple of other “planes”. The one that causes most issues is the one that adds a bunch of Chinese, Japanese and Korean characters • For these, we need to go beyond the BMP and hence beyond the nice and cosy 65.536 characters. Duh! • And this is why the MySQL assumption on UTF-8 means a maximum of 3 bytes might not be such a good idea after all 
  • 24. Part 3 – MySQL and UNICODE
  • 25. So, how does MySQL handle all this? • MySQL supports a whole range of UNICODE Encodings and collations! Good! • MySQL understand the case when we have one character set stored in a column in a table and another one on the client side, and nicely does a conversion for is! Good! • Not all UNICODE Encodings are valid on the Client side! Not so good  • Actually, anything beyond UTF-8, when it comes to UNICODE on the client side, is troublesome
  • 26. Lessons in MySQL and UNICODE • Lesson 1: Learn about UNICODE and understand how it works • Lesson 2: Stick with UTF-8! Most others does that too. Including Java, Java Script, JSON, the web any many, many others! • Lesson 3: UCS-2 may seem like a good idea, it is fixed length after all. It’s not (a good idea that is, fixed length it is) • Lesson 4: Don’t forget about collations! They are important!
  • 27. Collations – The Sequel • Collations determine how strings are sorted – Order by – Indexes – WHERE col1 > ‘Über’ • Collations determine how strings are compared – Is A = Ä or not? Y = Ü? • What in particular for COLLATIONS used for PRIMARY KEYs
  • 28. Storing UTF-8 data in MySQL • Most Storage Engines are happy to use utf-8 • The MySQL Interpretation of UTF-8 is 1 – 3 bytes, or 65.536 different characters! – This means that • A CHAR(10) column requires 30 bytes fixed space! • A VARCHAR(10) column is limited to 30 bytes • MySQL 5.5 and up also supports 4-byte UTF-8 by using the character set utf8mb4
  • 29. Storing UTF-8 data in MySQL • VARCHAR columns are actually fixed in some Storage Engines, most notably those engines developed sometime around the time of the American Civil War, when variable length data was still in it’s infancy • UTF-8 can potentially waste A LOT of space • Extra space for UTF-8 also affects byte size limits, such as VARCHAR and INDEX sizes • UTF-8 data sorting is way more complex than a simple binary sort (so in some ways, things were better in the old 7-bit ASCII days)
  • 30. Some simple demos, Questions and Answers. And I haven’t even began to talk about byte ordering and byte order marks.