What character is that

1,085 views
962 views

Published on

Character sets and collations are am important part of the database setup. In this presentation I show you the history of character sets and how they are used today, how UTF-8 works and how MySQL handles all this.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,085
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

What character is that

  1. 1. What character is THAT? Anders Karlsson anders@skysql.com
  2. 2. Agenda• About Anders Karlsson• Part 1 - The gruesome background • The history of character sets and collations • The “classic” 7 and 8 bit ASCII character sets• Part 2 – UNICODE Rocks! • What is UNICODE and encodings • Why UTF-8 is smart. Or not so smart• Part 3 - MySQL and UNICODE• Questions? Answers?
  3. 3. About Anders Karlsson• Senior Sales Engineer at SkySQL• Former Database Architect at Recorded Future, Sales Engineer and Consultant with Oracle, Informix, TimesTen, MySQL / Sun / Oracle etc.• Has been in the RDBMS business for 20+ years• Has also worked as Tech Support engineer, Porting Engineer and in many other roles• Outside SkySQL I build websites (www.papablues.com), develop Open Source software (MyQuery, mycleaner etc), am a keen photographer, has an affection for English Real Ales and a great interest in computer history22/11/2012 SkySQL Ab 2011 Confidential 3
  4. 4. Part 1 – The history which weare not to ignore (but which hasalready been ignored severaltimes)
  5. 5. The history of Characters Sets andcollations• At first there were no characters, only numbers• Then on the 7th day we realized characters and words was a good thing, but that computers can only handle numbers, so we needed a way of representing characters as numbers• So we different mappings from characters to numbers: ASCII, EBCDIC, FIELDATA, Baudot etc, in different variations (in particular EBCDIC)
  6. 6. ASCII – The mother of character sets• For anyone not being a machochist (i.e. anyone not using a mainframe), the character set of choice soon became 7-bit ASCII (American Standard Code for Information Interchange), first published in 1963• 7-bits was enough for US English characters and control characters, with some legroom (note that ASCII is US English, not UK English, centric)• The 8th bit was used for parity in transmission
  7. 7. All ASCII hell breaks loose• As the original 7-bit US ASCII didn’t support anything but US English, variations started to appear.• Any decent computer was supporting 8-bit characters, but as the assumption was still that bit 8 was a partity bit.• So 7-bit local variations was developed, Swedish 7-bit ASCII for example (anyone coding in C knows and hates this)
  8. 8. And then we get 8-bit ASCII hell!• Extended 8-bit ASCII solves a few problems, but also introduces a few new ones. Most of the new problems came from an attempt of making 8-bit Extended ASCII compatible with 7-bit ASCII variations• The Extended 8-bit “ASCII” characters sets are largely standardized as ISO 8859 (with variations). Most common is ISO 8859-1 (latin-1)• 8859-15 is a not so popular 8859-1 update, including a Euro-sign among a few other things. If the Euro-sign really is a useful addition is yet to be determined• Another 8859-1 variation is Windows CP1252, which is an enhanced 8859-1 character set
  9. 9. Oh, then we have collations!• A “collation” determines how characters in a character set are to be sorted!• 7-bit ASCII was great (numeric order same as character order) – Or was it? Really? Upper / Lower case?• 7-bit localized ASCII was not so great. To say the least. Swedish 7-BIT ASCII was not correctly sorted (å last in the alphabet, after ä and ö)• 8-bit Extended ASCII didn’t help much (Swedish again being in the wrong order, but not the same wrong order as with 7-bit “Swedish ASCII”)
  10. 10. Collation basics • Don’t ever think that the character set determines the sorting! – The same character set used in different countries may be sorted differently – Different sorting models may be used in the same country (A good example is case sensitivity) • Also, collations is not only about sorting, it’s also about comparisons and a few other things
  11. 11. Interoperating with ASCII• A long as we were all using 1 single computer or a bunch of similar computers in a LAN, the issues were limited• As usual, the Internet turned this beautiful environment into something truly evil!• Internet got started in the US – Which means, again, that the founders were convinced that 7-bit ASCII would be OK. That this had been an incorrect assumption 30 years before Facebook came around made no difference. Of course not.
  12. 12. Interoperability necessities• For us to be able to communicate we need to be able to tell what character set we expect here at the client side, the server has to tell what it delivers, and then we need a way to align all this.• The trick: <meta http-equiv=Content-Type content="text/html; charset=iso8559-1"> – Or maybe not? This tells what I get, but doesn’t allow me to say what I want!• Actually, this didn’t help as much as we hoped
  13. 13. Part 1 Conclusion• The many different local variations of characters served us well, for a while• Now we have a global IT environment with many different character sets and collations, and we can’t deal with multiple local versions anymore• And we have languages whose character set will not fit in 8 bits anyway• And the we need to sort and compare all this!
  14. 14. Part 2 – UNICODE and KenThompson saves theworld, without Batman and not bytracking down the penguin
  15. 15. UNICODE – One Character set for all• Yes, that is what UNICODE (or ISO/IEC 10646) sets out to do – A common character set for ALL languges (close to 240.000 characters are defined in UNICODE 4.1 today, MySQL is somewhat at UNICODE 3.0). Sort of.• This means that UNICODE has character codes than can not fit in 1 byte. This is big surprise to anyone on the other side of the pond, but there you go• But there is a remedy: UNICODE Encodings!
  16. 16. UNICODE Encodings• A UNICODE encoding is a standardized way of representing a character in the UNICODE character set• UNICODE encodings represent select parts of the full UNICODE character set• UNICODE encodings are part of the UNICODE standard itself (and this is a VERY good thing! If this wasn’t the case, both Apple and Microsoft would have invented their own encodings I’m sure)
  17. 17. UNICODE Encodings• Among the UNICODE encodings are – UCS-2 – 2 bytes wide (i.e. only 64k different characters can be represented) – UTF-16 – 2 or 4 bytes wide. This is then a variable length scheme with a very complex setup. When only 2 bytes are used, they are the same as UCS-2 – UTF-32 – 4 bytes fixed size• To be honest, besides UTF-16 / UCS-2 that is common in Windows and related frameworks (like COM), none of these are very popular
  18. 18. UTF-8 – Some smart dudes at work!• The problem than UNICODE has is that it has to represent all those characters. This should break some applications for sure.• Well, Encodings solve that too, and the mother of all encodings is UTF-8. Invented not by Albert Einstein or Batman but by Ken Thompson!• Let’s now have a round of applause for Ken Thompson!
  19. 19. The details of UTF-8• UNICODE characters 0 – 127 are the same as in standard 7-bit ASCII (remember that?)• UTF-8 works the same: For characters 0 – 127, the most significant (first) bit of the first (and only) byte is 0• Beyond 7-bit ASCII characters, the number of “leading” 1’s in the first byte tells how many bytes make of the up the character• All other bytes start with a 1 and a 0• And the rest of the bits make up the character
  20. 20. The details of UTF-8• So in the first byte, it is one of two things: – A leading 0 meaning a single byte character – A number of 1’s (at least 2, as 1 byte characters are indicated by a leading 0) followed by a 0 • This means that the first byte in a character NEVER starts with the sequence 10 – All other bytes starts with 10 – 1 UTF-8 byte can contain up to 7 bits of data – 2 UTF-8 bytes contains from 8 to 13 bits of data – 3 UTF-8 bytes contains from 14 to 16 bits of data – 4 UTF-8 bytes contains from 17 to 21 bits of data
  21. 21. Some useful aspects of UTF-8• You can always find the leading byte of a character in a word, starting from any byte – Just move “backward” til a byte not having a leading 10 is found• Byte values 0 – 127 are ONLY present as character values 0 – 127, nowhere else! – All other byte values have the highest bit set – So strlen(), strcmp() etc. still works, but on a byte by byte, not character by character, level
  22. 22. So, are we all OK with UTF-8 now?• Let’s see. Using UTF-8 we can represents binary values with up to 21 bits, which is 2.097.152 characters! Which is more than enough! (But 640K RAM was ALSO more than enough)• If we limit ourselves to 3 bytes UTF-8 we can represent 65.536 different characters, the same as if we use UCS-2 (which is fixed 2-byte format). 65.536 characters is what is in the UNICODE Basic Multilingual Plane
  23. 23. Why we actually need 4-bytes UTF-8• Beyond the BMP comes a couple of other “planes”. The one that causes most issues is the one that adds a bunch of Chinese, Japanese and Korean characters• For these, we need to go beyond the BMP and hence beyond the nice and cosy 65.536 characters. Duh!• And this is why the MySQL assumption on UTF-8 means a maximum of 3 bytes might not be such a good idea after all 
  24. 24. Part 3 – MySQL and UNICODE
  25. 25. So, how does MySQL handle all this?• MySQL supports a whole range of UNICODE Encodings and collations! Good!• MySQL understand the case when we have one character set stored in a column in a table and another one on the client side, and nicely does a conversion for is! Good!• Not all UNICODE Encodings are valid on the Client side! Not so good • Actually, anything beyond UTF-8, when it comes to UNICODE on the client side, is troublesome
  26. 26. Lessons in MySQL and UNICODE• Lesson 1: Learn about UNICODE and understand how it works• Lesson 2: Stick with UTF-8! Most others does that too. Including Java, Java Script, JSON, the web any many, many others!• Lesson 3: UCS-2 may seem like a good idea, it is fixed length after all. It’s not (a good idea that is, fixed length it is)• Lesson 4: Don’t forget about collations! They are important!
  27. 27. Collations – The Sequel• Collations determine how strings are sorted – Order by – Indexes – WHERE col1 > ‘Über’• Collations determine how strings are compared – Is A = Ä or not? Y = Ü?• What in particular for COLLATIONS used for PRIMARY KEYs
  28. 28. Storing UTF-8 data in MySQL• Most Storage Engines are happy to use utf-8• The MySQL Interpretation of UTF-8 is 1 – 3 bytes, or 65.536 different characters! – This means that • A CHAR(10) column requires 30 bytes fixed space! • A VARCHAR(10) column is limited to 30 bytes• MySQL 5.5 and up also supports 4-byte UTF-8 by using the character set utf8mb4
  29. 29. Storing UTF-8 data in MySQL• VARCHAR columns are actually fixed in some Storage Engines, most notably those engines developed sometime around the time of the American Civil War, when variable length data was still in it’s infancy• UTF-8 can potentially waste A LOT of space• Extra space for UTF-8 also affects byte size limits, such as VARCHAR and INDEX sizes• UTF-8 data sorting is way more complex than a simple binary sort (so in some ways, things were better in the old 7-bit ASCII days)
  30. 30. Some simple demos, Questionsand Answers.And I haven’t even began to talkabout byte ordering and byte ordermarks.
  31. 31. THANK YOU!Anders Karlssonanders@skysql.comhttp://karlssonondatabases.blogspot.com

×