What character is that

What character is THAT?

Anders Karlsson
anders@skysql.com

Agenda
• About Anders Karlsson
• Part 1 - The gruesome background
• The history of character sets and
collations
• The “classic” 7 and 8 bit ASCII
character sets
• Part 2 – UNICODE Rocks!
• What is UNICODE and encodings
• Why UTF-8 is smart. Or not so smart
• Part 3 - MySQL and UNICODE
• Questions? Answers?

About Anders Karlsson

• Senior Sales Engineer at SkySQL
• Former Database Architect at Recorded Future, Sales
Engineer and Consultant with
Oracle, Informix, TimesTen, MySQL / Sun / Oracle etc.
• Has been in the RDBMS business for 20+ years
• Has also worked as Tech Support engineer, Porting
Engineer and in many other roles
• Outside SkySQL I build websites
(www.papablues.com), develop Open Source software
(MyQuery, mycleaner etc), am a keen
photographer, has an affection for English Real Ales
and a great interest in computer history
22/11/2012 SkySQL Ab 2011 Confidential 3

Part 1 – The history which we
are not to ignore (but which has
already been ignored several
times)

The history of Characters Sets and
collations

• At first there were no characters, only numbers
• Then on the 7th day we realized characters and
words was a good thing, but that computers
can only handle numbers, so we needed a way
of representing characters as numbers
• So we different mappings from characters to
numbers: ASCII, EBCDIC, FIELDATA, Baudot
etc, in different variations (in particular
EBCDIC)

ASCII – The mother of character sets

• For anyone not being a machochist (i.e.
anyone not using a mainframe), the character
set of choice soon became 7-bit ASCII
(American Standard Code for Information
Interchange), first published in 1963
• 7-bits was enough for US English characters
and control characters, with some legroom
(note that ASCII is US English, not UK
English, centric)
• The 8th bit was used for parity in transmission

All ASCII hell breaks loose

• As the original 7-bit US ASCII didn’t support
anything but US English, variations started to
appear.
• Any decent computer was supporting 8-bit
characters, but as the assumption was still
that bit 8 was a partity bit.
• So 7-bit local variations was
developed, Swedish 7-bit ASCII for example
(anyone coding in C knows and hates this)

And then we get 8-bit ASCII hell!

• Extended 8-bit ASCII solves a few problems, but also
introduces a few new ones. Most of the new problems
came from an attempt of making 8-bit Extended ASCII
compatible with 7-bit ASCII variations
• The Extended 8-bit “ASCII” characters sets are largely
standardized as ISO 8859 (with variations). Most
common is ISO 8859-1 (latin-1)
• 8859-15 is a not so popular 8859-1 update, including a
Euro-sign among a few other
things. If the Euro-sign really is a useful
addition is yet to be determined
• Another 8859-1 variation is Windows CP1252,
which is an enhanced 8859-1 character set

Oh, then we have collations!
• A “collation” determines how characters in a
character set are to be sorted!
• 7-bit ASCII was great (numeric order same as
character order)
– Or was it? Really? Upper / Lower case?
• 7-bit localized ASCII was not so great. To say the
least. Swedish 7-BIT ASCII was not correctly
sorted (å last in the alphabet, after ä and ö)
• 8-bit Extended ASCII didn’t help much (Swedish
again being in the wrong order, but not the same
wrong order as with 7-bit “Swedish ASCII”)

Collation basics

• Don’t ever think that the character
set determines the sorting!
– The same character set used in
different countries may be sorted
differently
– Different sorting models may be used in
the same country (A good example is
case sensitivity)
• Also, collations is not only about
sorting, it’s also about comparisons
and a few other things

Interoperating with ASCII
• A long as we were all using 1 single computer
or a bunch of similar computers in a LAN, the
issues were limited
• As usual, the Internet turned this beautiful
environment into something truly evil!
• Internet got started in the US
– Which means, again, that the founders were
convinced that 7-bit ASCII would be OK. That this
had been an incorrect assumption 30 years before
Facebook came around made no difference. Of
course not.

Interoperability necessities
• For us to be able to communicate we need to
be able to tell what character set we expect
here at the client side, the server has to tell
what it delivers, and then we need a way to
align all this.
• The trick: <meta http-equiv=Content-Type
content="text/html; charset=iso8559-1">
– Or maybe not? This tells what I get, but doesn’t
allow me to say what I want!
• Actually, this didn’t help as much as we hoped

Part 1 Conclusion

• The many different local variations of
characters served us well, for a while
• Now we have a global IT environment with
many different character sets and
collations, and we can’t deal with multiple
local versions anymore
• And we have languages whose character set
will not fit in 8 bits anyway
• And the we need to sort and compare all this!

Part 2 – UNICODE and Ken
Thompson saves the
world, without Batman and not by
tracking down the penguin

UNICODE – One Character set for all

• Yes, that is what UNICODE (or ISO/IEC 10646)
sets out to do – A common character set for
ALL languges (close to 240.000 characters are
defined in UNICODE 4.1 today, MySQL is
somewhat at UNICODE 3.0). Sort of.
• This means that UNICODE has character codes
than can not fit in 1 byte. This is big surprise
to anyone on the other side of the pond, but
there you go
• But there is a remedy: UNICODE Encodings!

UNICODE Encodings

• A UNICODE encoding is a standardized way of
representing a character in the UNICODE
character set
• UNICODE encodings represent select parts of
the full UNICODE character set
• UNICODE encodings are part of the UNICODE
standard itself (and this is a VERY good thing!
If this wasn’t the case, both Apple and
Microsoft would have invented their own
encodings I’m sure)

UNICODE Encodings

• Among the UNICODE encodings are
– UCS-2 – 2 bytes wide (i.e. only 64k different
characters can be represented)
– UTF-16 – 2 or 4 bytes wide. This is then a variable
length scheme with a very complex setup. When
only 2 bytes are used, they are the same as UCS-2
– UTF-32 – 4 bytes fixed size
• To be honest, besides UTF-16 / UCS-2 that is
common in Windows and related frameworks
(like COM), none of these are very popular

UTF-8 – Some smart dudes at work!

• The problem than UNICODE has is that it has
to represent all those characters. This should
break some applications for sure.
• Well, Encodings solve that too, and the
mother of all encodings is UTF-8.
Invented not by Albert Einstein or
Batman but by Ken Thompson!
• Let’s now have a round of applause
for Ken Thompson!

The details of UTF-8
• UNICODE characters 0 – 127 are the same as
in standard 7-bit ASCII (remember that?)
• UTF-8 works the same: For characters 0 –
127, the most significant (first) bit of the first
(and only) byte is 0
• Beyond 7-bit ASCII characters, the number of
“leading” 1’s in the first byte tells how many
bytes make of the up the character
• All other bytes start with a 1 and a 0
• And the rest of the bits make up the character

The details of UTF-8
• So in the first byte, it is one of two things:
– A leading 0 meaning a single byte character
– A number of 1’s (at least 2, as 1 byte characters
are indicated by a leading 0) followed by a 0
• This means that the first byte in a character NEVER
starts with the sequence 10
– All other bytes starts with 10
– 1 UTF-8 byte can contain up to 7 bits of data
– 2 UTF-8 bytes contains from 8 to 13 bits of data

Some useful aspects of UTF-8

• You can always find the leading byte of a
character in a word, starting from any byte
– Just move “backward” til a byte not having a
leading 10 is found
• Byte values 0 – 127 are ONLY present as
character values 0 – 127, nowhere else!
– All other byte values have the highest bit set
– So strlen(), strcmp() etc. still works, but on a byte
by byte, not character by character, level

So, are we all OK with UTF-8 now?
• Let’s see. Using UTF-8 we can represents
binary values with up to 21 bits, which is
2.097.152 characters! Which is
more than enough! (But 640K
RAM was ALSO more than enough)
• If we limit ourselves to 3 bytes UTF-8
we can represent 65.536 different
characters, the same as if we use
UCS-2 (which is fixed 2-byte format).
65.536 characters is what is in the
UNICODE Basic Multilingual Plane

Why we actually need 4-bytes UTF-8

• Beyond the BMP comes a couple of other
“planes”. The one that causes most issues is
the one that adds a bunch of
Chinese, Japanese and Korean characters
• For these, we need to go beyond the BMP and
hence beyond the nice and cosy 65.536
characters. Duh!
• And this is why the MySQL assumption on
UTF-8 means a maximum of 3 bytes might not
be such a good idea after all 

So, how does MySQL handle all this?
• MySQL supports a whole range of UNICODE
Encodings and collations! Good!
• MySQL understand the case when we have
one character set stored in a column in a table
and another one on the client side, and nicely
does a conversion for is! Good!
• Not all UNICODE Encodings are valid on the
Client side! Not so good 
• Actually, anything beyond UTF-8, when it
comes to UNICODE on the client side, is
troublesome

Lessons in MySQL and UNICODE

• Lesson 1: Learn about UNICODE and
understand how it works
• Lesson 2: Stick with UTF-8! Most others does
that too. Including Java, Java Script, JSON, the
web any many, many others!
• Lesson 3: UCS-2 may seem like a good idea, it
is fixed length after all. It’s not (a good idea
that is, fixed length it is)
• Lesson 4: Don’t forget about collations! They
are important!

Collations – The Sequel

• Collations determine how strings are sorted
– Order by
– Indexes
– WHERE col1 > ‘Über’
• Collations determine how strings are
compared
– Is A = Ä or not? Y = Ü?
• What in particular for COLLATIONS used for
PRIMARY KEYs

Storing UTF-8 data in MySQL

• Most Storage Engines are happy to use utf-8
• The MySQL Interpretation of UTF-8 is 1 – 3
bytes, or 65.536 different characters!
– This means that
• A CHAR(10) column requires 30 bytes fixed space!
• A VARCHAR(10) column is limited to 30 bytes
• MySQL 5.5 and up also supports 4-byte UTF-8
by using the character set utf8mb4

Storing UTF-8 data in MySQL
• VARCHAR columns are actually fixed in some
Storage Engines, most notably those engines
developed sometime around the time of the
American Civil War, when variable length data
was still in it’s infancy
• UTF-8 can potentially waste A LOT of space
• Extra space for UTF-8 also affects byte size
limits, such as VARCHAR and INDEX sizes
• UTF-8 data sorting is way more complex than
a simple binary sort (so in some ways, things
were better in the old 7-bit ASCII days)

Some simple demos, Questions
and Answers.
And I haven’t even began to talk
about byte ordering and byte order
marks.

THANK YOU!

Anders Karlsson
anders@skysql.com
http://karlssonondatabases.blogspot.com

What character is that

More Related Content

Viewers also liked

Similar to What character is that

Recently uploaded

What character is that