SlideShare a Scribd company logo
2Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
Safe Harbor Statement
The following is intended to outline our general product direction. It is
intended for information purposes only, and may not be incorporated
into any contract. It is not a commitment to deliver any material, code,
or functionality, and should not be relied upon in making purchasing
decisions. The development, release, and timing of any features or
functionality described for Oracle’s products remains at the sole
discretion of Oracle.
3Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
Agenda
Why Unicode
What is character set/collation etc.
How to migrate and some issues to
consider
1
2
3
4
5
6
7
4Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
Why Unicode?
●
The whole world is moving towards Unicode as digital devices is used by more and
more people across all cultures all around the globe.
– Approximate billion users of the six most used writing systems:
Latin1: ~5, Chinese: ~1.5, Arabic: ~0.7, Devanagari: ~0.5, Cyrillic: ~0.25, Bengali: ~0.22, Kana:
~0.12
●
One driving force is Emojis
– Smileys, hearts, roses etc, and all the stuff people are sending to each other when communicating
these days. )(���
–
“Useful” example: Unicode character 0x1F574, MAN IN BUSINESS SUIT LEVITATING: �
1This is way more letters than just ASCII!
5Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
Why Unicode in a database?
●
You may use one character set for all your data,
for all purposes.
– E.g. if you make an application, utf8mb4 for a table with
names, it may be used by Russians, Chinese, Japanese
etc.
– Even esoteric extinct writing systems are covered like
e.g. the Phaistos disc (look it up...)
– But not Klingon, nor Tengwar �
6Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
What is Unicode?
●
Unicode is a computing industry standard for the
consistent encoding, representation, and handling of text
expressed in most of the world's writing systems.
(Wikipedia)
●
ISO/IEC 10646
●
Unicode covers most existing and extinct writing systems
known to man in one standard.
●
The standard has allocated 17 planes, blocks of
characters are allocated into the planes
7Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
Six planes allocated (... so far, Unicode
9.0.0)
●
0x0000-0xFFFF: Basic Multilingual Plane (BMP)
●
0x10000-0x1FFFF: Supplementary Multilingual Plane (SMP)
●
0x20000-0x2FFFF: Supplementary Ideographic Plane (SIP)
●
0xE0000-0xEFFFF: Supplementary Special-purpose Plane
(SSP)
●
0xF0000-0xFFFFF: Supplementary Private Use Plane A
(SPUA-A)
●
0x100000-0x10FFFF: Supplementary Private Use Plane B
(SPUA-B)
8Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
What is a CHARACTER SET?
●
A character set is defined by:
– A repertoire of characters/graphemes
– A value given to each character/grapheme (codepoint)
– An encoding which defines the binary representation of the
values
9Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
What is Encoding?
●
The binary representation of a character. Unicode
defines 3 encodings:
– UTF-8 (1-4 bytes per character)
– UTF-16 (2 or 4 bytes per character)
– UTF-32 (4 bytes per character)
10Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
Character set examples
Character Character set Value Encoding Encoded as
A ASCII
ISO-8859-1 (Latin-1)
Unicode
41
41
0041
1:1
1:1
UTF-8
UTF16
41
41
41
0041
Ä ISO-8859-1 (Latin-1)
Unicode
C4
00C4
1:1
UTF-8
UTF16
C4
C384
00C4
д KOI8-R
ISO-8859-5
Unicode
C4
D4
0434
1:1
1:1
UTF-8
UTF-16
C4
D4
D0B4
0434
人 GB-18030
Unicode
Big5
JIS X 0208 (SJIS)
C8CB
4EBA
A448
906C
1:1
UTF-8
UTF-16
1:1
1:1
C8CB
E4BABA
4EBA
A448
906C
� Unicode
GB-18030
1F574
9439EE36
UTF8
UTF-16
1:1
F09F95B4
D83DDD74
9439EE36
11Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
What is collation
●
Collation is the assembly of written information into a standard order
(Wikipedia)
●
Collation may consider
– Case (e.g 'A' vs. 'a')
– Accents (e.g. 'E' vs. 'É')
– Locale-specific rules (e.g. 'A' vs. 'Å' vs. 'AA' in Danish and Norwegian)
– Numeric characters (e.g. '2' vs. ' ')ⅱ
– Punctuation (e.g. 'blackbird' vs. 'black-bird')
– Etc.
●
12Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
What is a COLLATION in (My)SQL?
●
In MySQL, a COLLATION is a set of rules for a given character set
which defines an order and affects:
– ORDER BY
– LIKE
– Primary keys and indexes
– Unique constraints
– Comparison operators
– Some string functions
●
All strings in MySQL have a character set and a collation
13Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
Character sets in MySQL
+----------+---------------------------------+---------------------+--------+
| Charset | Description | Default collation | Maxlen |
+----------+---------------------------------+---------------------+--------+
| ascii | US ASCII | ascii_general_ci | 1 |
| latin1 | cp1252 West European | latin1_swedish_ci | 1 |
| utf8 | UTF-8 Unicode | utf8_general_ci | 3 |
| utf8mb4 | UTF-8 Unicode | utf8mb4_0900_ai_ci | 4 |
Get all by typing:
mysql> show character set;
The rest of them are:
armscii8, big5, binary, cp1250, cp1251, cp1256, cp1257, cp850, cp852, cp866, cp932, dec8, eucjpms,
euckr, gb18030, gb2312, gbk, geostd8, greek, hebrew, hp8, keybcs2, koi8r, koi8u, latin2, latin5, latin7,
macce, macroman, sjis, swe7, tis620, ucs2, ujis, utf16, utf16le, utf32
14Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
What's in MySQL 8.0
●
Default character set: utf8mb4 with default collation: utf8mb4_0900_ai_ci
●
Three language independent collations: utf8mb4_0900_ai_ci, utf8mb4_0900_as_ci,
utf8mb4_0900_as_cs
– may be used for German dictionary order, English, French1, Irish Gaelic, Indonesian, Italian,
Luxembourgian, Malay, Dutch, Portuguese, Swahili and Zulu
●
A lot of new collations based on Unicode v. 9.0.0
– UCA (Unicode Collation Algorithm)
– DUCET (Default Unicode Collation Entry Table)
– CLDR v.30 (Common Locale Data Repository)
●
All utf8mb4_*_0900_* collations are NO PAD
1) Canadian French may not use utf8mb4_0900_as_cs/utf8mb4_0900_as_ci collations due to differences to standard accent order.
15Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
New in MySQL 8.0
●
We have gone to great lengthts to make the new utf8mb4_*_0900_* collations correct and complete.
●
Accent insensitive/case insensitive and accent sensitive/case sensitive have been made for:
– Classical Latin (la), Croatian (hr), Czech (cs), Danish/Norwegian (da), Esperanto (eo), Estonian (et),
German phone book order (de_pb), Hungarian (hu), Icelandic (is), Latvian (lv), Lithuanian (lt), Polish (pl),
Romanian (ro), Russian (ru), Slovak (sk), Slovenian (sl), Modern Spanish (es), Traditional Spanish
(es_trad), Swedish (sv), Turkish (tr), Vietnamese (vi), Classical Latin (la), Croatian (hr), Czech (cs),
Danish/Norwegian (da), Esperanto (eo), Estonian (et), German phone book order (de_pb), Hungarian (hu),
Icelandic (is), Latvian (lv), Lithuanian (lt), Polish (pl), Romanian (ro), Russian (ru), Slovak (sk), Slovenian
(sl), Modern Spanish (es), Traditional Spanish (es_trad), Swedish (sv), Turkish (tr), Vietnamese (vi)
●
Accent/case sensitive and accent/case/kana sensitive collations for:
Japanese (ja)
16Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
MySQL 8.0 collation name scheme
●
<charset>[_<language> [_<variant>]]_<unicodeversion>(_<attribute>)+
– <charset> = utf8mb4
– <language>, an ISO 639-1 language code (or ISO 639-2 if needed)
– <variant>, a variant to the standard collation for the language.
Per today: utf8mb4_de_pb_0900_* and utf8mb4_es_trad_0900_*.
– <unicodeversion> = 0900
– <attribute>: accent sensitivity (ai, as), case sensitivity (ci, cs), kana sensitivity (ks) and
possible future ones.
17Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
Why not ...
●
Fix utf8mb4_general_ci instead of introducing
utf8mb4_0900_ai_ci or fix utf8mb4_german2_ci instead of
introducing utf8mb4_de_pb_0900_ai_ci?
– Because that might break existing applications using the old collations (The
most serious issue for large databases: Indexes would have to be rebuilt).
Policy: Collations don't change!
●
Have a simpler name scheme?
– Because we prepare for
●
More languages
●
New Unicode versions (Unicode 10.0.0 is expected in 2018)
– ISO-639-1/ISO-639-2 language codes are well defined
18Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
How to migrate?
●
When migrating from 5.7 tables:
– Just convert the table:
ALTER TABLE foo CONVERT TO CHARACTER SET utf8mb4;
●
This will change the default character set of the table (so that future
new columns get utf8mb4) and the character set of all applicable
columns.
●
In principle, all character data in MySQL may be
converted to utf8mb4 without loss of data.
That was easy ..... is that all to it ... ?
19Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
… not quite … column by column
●
If you have more complex tables with different character sets:
– Change the default character set of the table:
ALTER TABLE foo DEFAULT CHARACTER SET utf8mb4;
– Modify all relevant relevant columns:
ALTER TABLE foo MODIFY bar VARCHAR(100) CHARACTER SET
utf8mb4;
Generally we recommend doing it column by column.
– ALTER TABLE … CONVERT … will e.g. change TEXT to MEDIUMTEXT
when you convert from latin1 to utf8mb4 and that won't necessarily be
what you want.
20Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
… not quite … the schema too
●
A schema (aka. database) in MySQL has a default character set which
will be the default character set of new tables in the schema
– mysql> show create schema bar;
+----------+----------------------------------------------------------------+
| Database | Create Database |
+----------+----------------------------------------------------------------+
| bar | CREATE DATABASE `bar` /*!40100 DEFAULT CHARACTER SET latin1 */ |
+----------+----------------------------------------------------------------+
1 row in set (0.00 sec)
●
Change the default character set of the schema(database):
ALTER SCHEMA bar DEFAULT CHARACTER SET utf8mb4;
21Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
… not quite … collation differences
Collations are not equal, so converting from one collation to another may break
UNIQUE constraints (e.g PRIMARY KEY).
●
Default collation:
– latin1_swedish_ci vs. utf8mb4_0900_ai_ci
E.g. 'o'='ö' is false in the first, but true in the other.
– Possible solution: Stick to Swedish or another suitable collation depending on your
application:
ALTER TABLE foo CONVERT TO CHARACTER SET utf8mb4 COLLATE
utf8mb4_sv_0900_ai_ci;
– Generally, if you don't care about case insensitivity (just got it by default),
utf8mb4_0900_as_cs should be safe.
●
There's an huge number of possibilities depending on your data and the collations
used, partly because pre MySQL 8.0 collations where not complete (and in some
cases not correct).
22Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
… not quite … index and key issues
●
If you change the collation of a column, indexes on that column will be
regenerated.
– This takes time for large data, and the table is locked during that time.
– And the conversion may fail due to changed space consumption.
●
Max key length is 3072 bytes1, which implies that max length of a utf8mb4
varchar column which is also a key is 768 characters (Worst case scenario: 4
bytes per character).
– mysql> create table foo (v varchar(1000) character set latin1 primary key);
Query OK, 0 rows affected (0.01 sec)
mysql> alter table foo modify v varchar(1000) character set utf8mb4;
ERROR 1071 (42000): Specified key was too long; max key length is 3072 bytes
1For default InnoDB row format and default innodb_page_size in MySQL 8.0. See the documentation for details.
23Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
Upgrade example
mysql> show create table cities;
+--------+----------------------
| Table | Create Table
+--------+----------------------
| cities | CREATE TABLE `cities` (
`name` varchar(1024) NOT NULL,
`population` int(11) DEFAULT NULL,
PRIMARY KEY (`name`)
) ENGINE=InnoDB DEFAULT
CHARSET=latin1
+--------+----------------------
1 row in set (0.00 sec)
mysql> select * from cities;
+------------+------------+
| name | population |
+------------+------------+
| København | 1246611 |
| Orebro | 107380 |
| Oslo | 666759 |
| Stockholm | 935619 |
| Örebro | 107380 |
+------------+------------+
5 rows in set (0.00 sec)
mysql> alter table cities modify column name varchar(1024) charset utf8mb4;
ERROR 1062 (23000): Duplicate entry 'Örebro' for key 'PRIMARY'
24Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
Checking before altering the table
with counted as
(select name, count(name)
over w as cnt from cities window w as
(partition by convert(name using utf8mb4)))
select name from counted where cnt > 1;// See footnote 1
+---------+
| name |
+---------+
| Orebro |
| Örebro |
+---------+
2 rows in set (0.00 sec)
¹ Only MySQL 8.0, not 5.7 or ealier!
25Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
⚠ 文字化け (Mojibake)
… or you what you see is not what you get...
mysql> create table foo(v varchar(10) character set latin1);
mysql> insert into foo values('å');
mysql> set names latin1;
mysql> insert into foo values('å');
mysql> set names utf8mb4;
mysql> select * from foo;
+------+
| v |
+------+
| å |
| å |
+------+
2 rows in set (0.00 sec)
mysql> select hex(v) from foo;
+--------+
| hex(v) |
+--------+
| E5 |
| C3A5 |
+--------+
2 rows in set (0.00 sec)
26Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
Fixing æ–‡å—化㠑
mysql> select v from foo;
+-------------------------------+
| v |
+-------------------------------+
| æ–‡å—化㠑 |
+-------------------------------+
1 row in set (0.01 sec)
mysql> alter table foo modify column v varchar(128) charset binary;
Query OK, 1 row affected (0.14 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> alter table foo modify column v varchar(128) charset utf8mb4;
Query OK, 1 row affected (0.14 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> select v from foo;
+--------------+
| v |
+--------------+
| 文字化け |
+--------------+
1 row in set (0.00 sec)
27Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
Space consumption
●
utf8mb4 use
– 1 byte for ASCII characters (0x00-0x7F),
– 2 bytes for most alphabets/abjads (0x80-0x7FF),
– 3 bytes for Indic scripts, Hangul, Kana, the most used
CJK Ideographs (0x800-0xFFFF),
– 4 bytes for the rest: Archaic scripts, Emojis, Rarely used
CJK extensions etc. (0x10000-)
28Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
Speed issues
●
Operations on multibyte character sets inherently slower than
singlebyte character sets (e.g. latin1 vs. utf8mb4)
●
We have done a lot of code improvements.
– New code for the new utf8mb4 collations
– New collations are NO PAD (which gives faster algorithms)
– But expect a performance degradation in the order of 10-20% for sorting when you
migrate from e.g latin1 to utf8mb4, depending on your data of course.
●
Some collations are inherently slower than others (e.g.
utf8mb4_0900_ai_ci vs. utf8mb4_ja_0900_as_cs_ks)
29Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
Truly usable for global purposes.....
30Copyright © 2017 Oracle and/or its affiliates. All rights reserved.
Q&A
●
Check out my blogs at
http://mysqlserverteam.com/author/bernt/
●
The 8.0 documentation (if everything else fails … 😠)
https://dev.mysql.com/doc/refman/8.0/en/charset.html
●
The Unicode documents (for those truly interested … 😇)
http://unicode.org/
�

More Related Content

What's hot

Packer by HashiCorp
Packer by HashiCorpPacker by HashiCorp
Packer by HashiCorp
Łukasz Cieśluk
 
Azure Functions & Serverless Computing
Azure Functions & Serverless ComputingAzure Functions & Serverless Computing
Azure Functions & Serverless Computing
Abhimanyu Singhal
 
Terraform on Azure
Terraform on AzureTerraform on Azure
Terraform on Azure
Julien Corioland
 
Terraform introduction
Terraform introductionTerraform introduction
Terraform introduction
Jason Vance
 
Getting Started with Infrastructure as Code
Getting Started with Infrastructure as CodeGetting Started with Infrastructure as Code
Getting Started with Infrastructure as Code
WinWire Technologies Inc
 
Terraform 0.12 + Terragrunt
Terraform 0.12 + TerragruntTerraform 0.12 + Terragrunt
Terraform 0.12 + Terragrunt
Anton Babenko
 
Advanced Container Security
Advanced Container Security Advanced Container Security
Advanced Container Security
Amazon Web Services
 
CodeBuild CodePipeline CodeDeploy CodeCommit in AWS | Edureka
CodeBuild CodePipeline CodeDeploy CodeCommit in AWS | EdurekaCodeBuild CodePipeline CodeDeploy CodeCommit in AWS | Edureka
CodeBuild CodePipeline CodeDeploy CodeCommit in AWS | Edureka
Edureka!
 
Automation with Packer and TerraForm
Automation with Packer and TerraFormAutomation with Packer and TerraForm
Automation with Packer and TerraForm
Wesley Charles Blake
 
MySQL Security
MySQL SecurityMySQL Security
MySQL Security
Ted Wennmark
 
Packer
Packer Packer
Packer
Nitesh Saini
 
Azure Active Directory | Microsoft Azure Tutorial for Beginners | Azure 70-53...
Azure Active Directory | Microsoft Azure Tutorial for Beginners | Azure 70-53...Azure Active Directory | Microsoft Azure Tutorial for Beginners | Azure 70-53...
Azure Active Directory | Microsoft Azure Tutorial for Beginners | Azure 70-53...
Edureka!
 
Terraform
TerraformTerraform
Terraform
Harish Kumar
 
AZ-204 : Implement Azure security
AZ-204 : Implement Azure securityAZ-204 : Implement Azure security
AZ-204 : Implement Azure security
AzureEzy1
 
Implement SQL Server on an Azure VM
Implement SQL Server on an Azure VMImplement SQL Server on an Azure VM
Implement SQL Server on an Azure VM
James Serra
 
Understanding container security
Understanding container securityUnderstanding container security
Understanding container security
John Kinsella
 
Terraform -- Infrastructure as Code
Terraform -- Infrastructure as CodeTerraform -- Infrastructure as Code
Terraform -- Infrastructure as Code
Martin Schütte
 
Microsoft Azure Platform-as-a-Service (PaaS)
Microsoft Azure Platform-as-a-Service (PaaS)Microsoft Azure Platform-as-a-Service (PaaS)
Microsoft Azure Platform-as-a-Service (PaaS)
Chris Dufour
 
Ebs clone r12.2.4
Ebs clone r12.2.4Ebs clone r12.2.4
Ebs clone r12.2.4
Osama Mustafa
 
Comprehensive Terraform Training
Comprehensive Terraform TrainingComprehensive Terraform Training
Comprehensive Terraform Training
Yevgeniy Brikman
 

What's hot (20)

Packer by HashiCorp
Packer by HashiCorpPacker by HashiCorp
Packer by HashiCorp
 
Azure Functions & Serverless Computing
Azure Functions & Serverless ComputingAzure Functions & Serverless Computing
Azure Functions & Serverless Computing
 
Terraform on Azure
Terraform on AzureTerraform on Azure
Terraform on Azure
 
Terraform introduction
Terraform introductionTerraform introduction
Terraform introduction
 
Getting Started with Infrastructure as Code
Getting Started with Infrastructure as CodeGetting Started with Infrastructure as Code
Getting Started with Infrastructure as Code
 
Terraform 0.12 + Terragrunt
Terraform 0.12 + TerragruntTerraform 0.12 + Terragrunt
Terraform 0.12 + Terragrunt
 
Advanced Container Security
Advanced Container Security Advanced Container Security
Advanced Container Security
 
CodeBuild CodePipeline CodeDeploy CodeCommit in AWS | Edureka
CodeBuild CodePipeline CodeDeploy CodeCommit in AWS | EdurekaCodeBuild CodePipeline CodeDeploy CodeCommit in AWS | Edureka
CodeBuild CodePipeline CodeDeploy CodeCommit in AWS | Edureka
 
Automation with Packer and TerraForm
Automation with Packer and TerraFormAutomation with Packer and TerraForm
Automation with Packer and TerraForm
 
MySQL Security
MySQL SecurityMySQL Security
MySQL Security
 
Packer
Packer Packer
Packer
 
Azure Active Directory | Microsoft Azure Tutorial for Beginners | Azure 70-53...
Azure Active Directory | Microsoft Azure Tutorial for Beginners | Azure 70-53...Azure Active Directory | Microsoft Azure Tutorial for Beginners | Azure 70-53...
Azure Active Directory | Microsoft Azure Tutorial for Beginners | Azure 70-53...
 
Terraform
TerraformTerraform
Terraform
 
AZ-204 : Implement Azure security
AZ-204 : Implement Azure securityAZ-204 : Implement Azure security
AZ-204 : Implement Azure security
 
Implement SQL Server on an Azure VM
Implement SQL Server on an Azure VMImplement SQL Server on an Azure VM
Implement SQL Server on an Azure VM
 
Understanding container security
Understanding container securityUnderstanding container security
Understanding container security
 
Terraform -- Infrastructure as Code
Terraform -- Infrastructure as CodeTerraform -- Infrastructure as Code
Terraform -- Infrastructure as Code
 
Microsoft Azure Platform-as-a-Service (PaaS)
Microsoft Azure Platform-as-a-Service (PaaS)Microsoft Azure Platform-as-a-Service (PaaS)
Microsoft Azure Platform-as-a-Service (PaaS)
 
Ebs clone r12.2.4
Ebs clone r12.2.4Ebs clone r12.2.4
Ebs clone r12.2.4
 
Comprehensive Terraform Training
Comprehensive Terraform TrainingComprehensive Terraform Training
Comprehensive Terraform Training
 

Similar to Unicode and Collations in MySQL 8.0

MySQL 8.0 & Unicode: Why, what & how
MySQL 8.0 & Unicode: Why, what & howMySQL 8.0 & Unicode: Why, what & how
MySQL 8.0 & Unicode: Why, what & how
Bernt Marius Johnsen
 
Collations in MySQL 8.0
Collations in MySQL 8.0Collations in MySQL 8.0
Collations in MySQL 8.0
Bernt Marius Johnsen
 
MySQL8.0 in COSCUP2017
MySQL8.0 in COSCUP2017MySQL8.0 in COSCUP2017
MySQL8.0 in COSCUP2017
Shinya Sugiyama
 
Oracle Globalization Support, NLS_LENGTH_SEMANTICS, Unicode
Oracle Globalization Support, NLS_LENGTH_SEMANTICS, UnicodeOracle Globalization Support, NLS_LENGTH_SEMANTICS, Unicode
Oracle Globalization Support, NLS_LENGTH_SEMANTICS, Unicode
Markus Flechtner
 
groovy DSLs from beginner to expert
groovy DSLs from beginner to expertgroovy DSLs from beginner to expert
groovy DSLs from beginner to expert
Paul King
 
MySQL Cluster overview + development slides (2014)
MySQL Cluster overview + development slides (2014) MySQL Cluster overview + development slides (2014)
MySQL Cluster overview + development slides (2014)
Frazer Clement
 
Uncdtalk
UncdtalkUncdtalk
Till Vollmer Presentation
Till Vollmer PresentationTill Vollmer Presentation
Till Vollmer Presentation
RubyOnRails_dude
 
Cisco Connect Montreal 2017 - Segment Routing - Technology Deep-dive and Adva...
Cisco Connect Montreal 2017 - Segment Routing - Technology Deep-dive and Adva...Cisco Connect Montreal 2017 - Segment Routing - Technology Deep-dive and Adva...
Cisco Connect Montreal 2017 - Segment Routing - Technology Deep-dive and Adva...
Cisco Canada
 
When 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and such
When 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and suchWhen 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and such
When 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and such
Kim Berg Hansen
 
Software Internationalization Crash Course
Software Internationalization Crash CourseSoftware Internationalization Crash Course
Software Internationalization Crash Course
Will Iverson
 
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph clusterCeph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Community
 
When 7 bit-ascii ain't enough - about NLS, collation, charsets, unicode and s...
When 7 bit-ascii ain't enough - about NLS, collation, charsets, unicode and s...When 7 bit-ascii ain't enough - about NLS, collation, charsets, unicode and s...
When 7 bit-ascii ain't enough - about NLS, collation, charsets, unicode and s...
Kim Berg Hansen
 
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Community
 
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Community
 
Oracle ADF Architecture TV - Design - Designing for Internationalization
Oracle ADF Architecture TV - Design - Designing for InternationalizationOracle ADF Architecture TV - Design - Designing for Internationalization
Oracle ADF Architecture TV - Design - Designing for Internationalization
Chris Muir
 
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
Elizabeth Smith
 
MythBusters Globalization Support - Avoid Data Corruption
MythBusters Globalization Support - Avoid Data CorruptionMythBusters Globalization Support - Avoid Data Corruption
MythBusters Globalization Support - Avoid Data Corruption
Christian Gohmann
 
ODA X6-2 family
ODA X6-2 familyODA X6-2 family
ODA X6-2 family
MarketingArrowECS_CZ
 
ITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and GrammarsITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and Grammars
Tonny Madsen
 

Similar to Unicode and Collations in MySQL 8.0 (20)

MySQL 8.0 & Unicode: Why, what & how
MySQL 8.0 & Unicode: Why, what & howMySQL 8.0 & Unicode: Why, what & how
MySQL 8.0 & Unicode: Why, what & how
 
Collations in MySQL 8.0
Collations in MySQL 8.0Collations in MySQL 8.0
Collations in MySQL 8.0
 
MySQL8.0 in COSCUP2017
MySQL8.0 in COSCUP2017MySQL8.0 in COSCUP2017
MySQL8.0 in COSCUP2017
 
Oracle Globalization Support, NLS_LENGTH_SEMANTICS, Unicode
Oracle Globalization Support, NLS_LENGTH_SEMANTICS, UnicodeOracle Globalization Support, NLS_LENGTH_SEMANTICS, Unicode
Oracle Globalization Support, NLS_LENGTH_SEMANTICS, Unicode
 
groovy DSLs from beginner to expert
groovy DSLs from beginner to expertgroovy DSLs from beginner to expert
groovy DSLs from beginner to expert
 
MySQL Cluster overview + development slides (2014)
MySQL Cluster overview + development slides (2014) MySQL Cluster overview + development slides (2014)
MySQL Cluster overview + development slides (2014)
 
Uncdtalk
UncdtalkUncdtalk
Uncdtalk
 
Till Vollmer Presentation
Till Vollmer PresentationTill Vollmer Presentation
Till Vollmer Presentation
 
Cisco Connect Montreal 2017 - Segment Routing - Technology Deep-dive and Adva...
Cisco Connect Montreal 2017 - Segment Routing - Technology Deep-dive and Adva...Cisco Connect Montreal 2017 - Segment Routing - Technology Deep-dive and Adva...
Cisco Connect Montreal 2017 - Segment Routing - Technology Deep-dive and Adva...
 
When 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and such
When 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and suchWhen 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and such
When 7-bit ASCII ain't enough - about NLS, Collation, Charsets, Unicode and such
 
Software Internationalization Crash Course
Software Internationalization Crash CourseSoftware Internationalization Crash Course
Software Internationalization Crash Course
 
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph clusterCeph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
 
When 7 bit-ascii ain't enough - about NLS, collation, charsets, unicode and s...
When 7 bit-ascii ain't enough - about NLS, collation, charsets, unicode and s...When 7 bit-ascii ain't enough - about NLS, collation, charsets, unicode and s...
When 7 bit-ascii ain't enough - about NLS, collation, charsets, unicode and s...
 
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
 
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
 
Oracle ADF Architecture TV - Design - Designing for Internationalization
Oracle ADF Architecture TV - Design - Designing for InternationalizationOracle ADF Architecture TV - Design - Designing for Internationalization
Oracle ADF Architecture TV - Design - Designing for Internationalization
 
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
 
MythBusters Globalization Support - Avoid Data Corruption
MythBusters Globalization Support - Avoid Data CorruptionMythBusters Globalization Support - Avoid Data Corruption
MythBusters Globalization Support - Avoid Data Corruption
 
ODA X6-2 family
ODA X6-2 familyODA X6-2 family
ODA X6-2 family
 
ITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and GrammarsITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and Grammars
 

Recently uploaded

RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 

Recently uploaded (20)

RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 

Unicode and Collations in MySQL 8.0

  • 1.
  • 2. 2Copyright © 2017 Oracle and/or its affiliates. All rights reserved. Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
  • 3. 3Copyright © 2017 Oracle and/or its affiliates. All rights reserved. Agenda Why Unicode What is character set/collation etc. How to migrate and some issues to consider 1 2 3 4 5 6 7
  • 4. 4Copyright © 2017 Oracle and/or its affiliates. All rights reserved. Why Unicode? ● The whole world is moving towards Unicode as digital devices is used by more and more people across all cultures all around the globe. – Approximate billion users of the six most used writing systems: Latin1: ~5, Chinese: ~1.5, Arabic: ~0.7, Devanagari: ~0.5, Cyrillic: ~0.25, Bengali: ~0.22, Kana: ~0.12 ● One driving force is Emojis – Smileys, hearts, roses etc, and all the stuff people are sending to each other when communicating these days. )(��� – “Useful” example: Unicode character 0x1F574, MAN IN BUSINESS SUIT LEVITATING: � 1This is way more letters than just ASCII!
  • 5. 5Copyright © 2017 Oracle and/or its affiliates. All rights reserved. Why Unicode in a database? ● You may use one character set for all your data, for all purposes. – E.g. if you make an application, utf8mb4 for a table with names, it may be used by Russians, Chinese, Japanese etc. – Even esoteric extinct writing systems are covered like e.g. the Phaistos disc (look it up...) – But not Klingon, nor Tengwar �
  • 6. 6Copyright © 2017 Oracle and/or its affiliates. All rights reserved. What is Unicode? ● Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. (Wikipedia) ● ISO/IEC 10646 ● Unicode covers most existing and extinct writing systems known to man in one standard. ● The standard has allocated 17 planes, blocks of characters are allocated into the planes
  • 7. 7Copyright © 2017 Oracle and/or its affiliates. All rights reserved. Six planes allocated (... so far, Unicode 9.0.0) ● 0x0000-0xFFFF: Basic Multilingual Plane (BMP) ● 0x10000-0x1FFFF: Supplementary Multilingual Plane (SMP) ● 0x20000-0x2FFFF: Supplementary Ideographic Plane (SIP) ● 0xE0000-0xEFFFF: Supplementary Special-purpose Plane (SSP) ● 0xF0000-0xFFFFF: Supplementary Private Use Plane A (SPUA-A) ● 0x100000-0x10FFFF: Supplementary Private Use Plane B (SPUA-B)
  • 8. 8Copyright © 2017 Oracle and/or its affiliates. All rights reserved. What is a CHARACTER SET? ● A character set is defined by: – A repertoire of characters/graphemes – A value given to each character/grapheme (codepoint) – An encoding which defines the binary representation of the values
  • 9. 9Copyright © 2017 Oracle and/or its affiliates. All rights reserved. What is Encoding? ● The binary representation of a character. Unicode defines 3 encodings: – UTF-8 (1-4 bytes per character) – UTF-16 (2 or 4 bytes per character) – UTF-32 (4 bytes per character)
  • 10. 10Copyright © 2017 Oracle and/or its affiliates. All rights reserved. Character set examples Character Character set Value Encoding Encoded as A ASCII ISO-8859-1 (Latin-1) Unicode 41 41 0041 1:1 1:1 UTF-8 UTF16 41 41 41 0041 Ä ISO-8859-1 (Latin-1) Unicode C4 00C4 1:1 UTF-8 UTF16 C4 C384 00C4 д KOI8-R ISO-8859-5 Unicode C4 D4 0434 1:1 1:1 UTF-8 UTF-16 C4 D4 D0B4 0434 人 GB-18030 Unicode Big5 JIS X 0208 (SJIS) C8CB 4EBA A448 906C 1:1 UTF-8 UTF-16 1:1 1:1 C8CB E4BABA 4EBA A448 906C � Unicode GB-18030 1F574 9439EE36 UTF8 UTF-16 1:1 F09F95B4 D83DDD74 9439EE36
  • 11. 11Copyright © 2017 Oracle and/or its affiliates. All rights reserved. What is collation ● Collation is the assembly of written information into a standard order (Wikipedia) ● Collation may consider – Case (e.g 'A' vs. 'a') – Accents (e.g. 'E' vs. 'É') – Locale-specific rules (e.g. 'A' vs. 'Å' vs. 'AA' in Danish and Norwegian) – Numeric characters (e.g. '2' vs. ' ')ⅱ – Punctuation (e.g. 'blackbird' vs. 'black-bird') – Etc. ●
  • 12. 12Copyright © 2017 Oracle and/or its affiliates. All rights reserved. What is a COLLATION in (My)SQL? ● In MySQL, a COLLATION is a set of rules for a given character set which defines an order and affects: – ORDER BY – LIKE – Primary keys and indexes – Unique constraints – Comparison operators – Some string functions ● All strings in MySQL have a character set and a collation
  • 13. 13Copyright © 2017 Oracle and/or its affiliates. All rights reserved. Character sets in MySQL +----------+---------------------------------+---------------------+--------+ | Charset | Description | Default collation | Maxlen | +----------+---------------------------------+---------------------+--------+ | ascii | US ASCII | ascii_general_ci | 1 | | latin1 | cp1252 West European | latin1_swedish_ci | 1 | | utf8 | UTF-8 Unicode | utf8_general_ci | 3 | | utf8mb4 | UTF-8 Unicode | utf8mb4_0900_ai_ci | 4 | Get all by typing: mysql> show character set; The rest of them are: armscii8, big5, binary, cp1250, cp1251, cp1256, cp1257, cp850, cp852, cp866, cp932, dec8, eucjpms, euckr, gb18030, gb2312, gbk, geostd8, greek, hebrew, hp8, keybcs2, koi8r, koi8u, latin2, latin5, latin7, macce, macroman, sjis, swe7, tis620, ucs2, ujis, utf16, utf16le, utf32
  • 14. 14Copyright © 2017 Oracle and/or its affiliates. All rights reserved. What's in MySQL 8.0 ● Default character set: utf8mb4 with default collation: utf8mb4_0900_ai_ci ● Three language independent collations: utf8mb4_0900_ai_ci, utf8mb4_0900_as_ci, utf8mb4_0900_as_cs – may be used for German dictionary order, English, French1, Irish Gaelic, Indonesian, Italian, Luxembourgian, Malay, Dutch, Portuguese, Swahili and Zulu ● A lot of new collations based on Unicode v. 9.0.0 – UCA (Unicode Collation Algorithm) – DUCET (Default Unicode Collation Entry Table) – CLDR v.30 (Common Locale Data Repository) ● All utf8mb4_*_0900_* collations are NO PAD 1) Canadian French may not use utf8mb4_0900_as_cs/utf8mb4_0900_as_ci collations due to differences to standard accent order.
  • 15. 15Copyright © 2017 Oracle and/or its affiliates. All rights reserved. New in MySQL 8.0 ● We have gone to great lengthts to make the new utf8mb4_*_0900_* collations correct and complete. ● Accent insensitive/case insensitive and accent sensitive/case sensitive have been made for: – Classical Latin (la), Croatian (hr), Czech (cs), Danish/Norwegian (da), Esperanto (eo), Estonian (et), German phone book order (de_pb), Hungarian (hu), Icelandic (is), Latvian (lv), Lithuanian (lt), Polish (pl), Romanian (ro), Russian (ru), Slovak (sk), Slovenian (sl), Modern Spanish (es), Traditional Spanish (es_trad), Swedish (sv), Turkish (tr), Vietnamese (vi), Classical Latin (la), Croatian (hr), Czech (cs), Danish/Norwegian (da), Esperanto (eo), Estonian (et), German phone book order (de_pb), Hungarian (hu), Icelandic (is), Latvian (lv), Lithuanian (lt), Polish (pl), Romanian (ro), Russian (ru), Slovak (sk), Slovenian (sl), Modern Spanish (es), Traditional Spanish (es_trad), Swedish (sv), Turkish (tr), Vietnamese (vi) ● Accent/case sensitive and accent/case/kana sensitive collations for: Japanese (ja)
  • 16. 16Copyright © 2017 Oracle and/or its affiliates. All rights reserved. MySQL 8.0 collation name scheme ● <charset>[_<language> [_<variant>]]_<unicodeversion>(_<attribute>)+ – <charset> = utf8mb4 – <language>, an ISO 639-1 language code (or ISO 639-2 if needed) – <variant>, a variant to the standard collation for the language. Per today: utf8mb4_de_pb_0900_* and utf8mb4_es_trad_0900_*. – <unicodeversion> = 0900 – <attribute>: accent sensitivity (ai, as), case sensitivity (ci, cs), kana sensitivity (ks) and possible future ones.
  • 17. 17Copyright © 2017 Oracle and/or its affiliates. All rights reserved. Why not ... ● Fix utf8mb4_general_ci instead of introducing utf8mb4_0900_ai_ci or fix utf8mb4_german2_ci instead of introducing utf8mb4_de_pb_0900_ai_ci? – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be rebuilt). Policy: Collations don't change! ● Have a simpler name scheme? – Because we prepare for ● More languages ● New Unicode versions (Unicode 10.0.0 is expected in 2018) – ISO-639-1/ISO-639-2 language codes are well defined
  • 18. 18Copyright © 2017 Oracle and/or its affiliates. All rights reserved. How to migrate? ● When migrating from 5.7 tables: – Just convert the table: ALTER TABLE foo CONVERT TO CHARACTER SET utf8mb4; ● This will change the default character set of the table (so that future new columns get utf8mb4) and the character set of all applicable columns. ● In principle, all character data in MySQL may be converted to utf8mb4 without loss of data. That was easy ..... is that all to it ... ?
  • 19. 19Copyright © 2017 Oracle and/or its affiliates. All rights reserved. … not quite … column by column ● If you have more complex tables with different character sets: – Change the default character set of the table: ALTER TABLE foo DEFAULT CHARACTER SET utf8mb4; – Modify all relevant relevant columns: ALTER TABLE foo MODIFY bar VARCHAR(100) CHARACTER SET utf8mb4; Generally we recommend doing it column by column. – ALTER TABLE … CONVERT … will e.g. change TEXT to MEDIUMTEXT when you convert from latin1 to utf8mb4 and that won't necessarily be what you want.
  • 20. 20Copyright © 2017 Oracle and/or its affiliates. All rights reserved. … not quite … the schema too ● A schema (aka. database) in MySQL has a default character set which will be the default character set of new tables in the schema – mysql> show create schema bar; +----------+----------------------------------------------------------------+ | Database | Create Database | +----------+----------------------------------------------------------------+ | bar | CREATE DATABASE `bar` /*!40100 DEFAULT CHARACTER SET latin1 */ | +----------+----------------------------------------------------------------+ 1 row in set (0.00 sec) ● Change the default character set of the schema(database): ALTER SCHEMA bar DEFAULT CHARACTER SET utf8mb4;
  • 21. 21Copyright © 2017 Oracle and/or its affiliates. All rights reserved. … not quite … collation differences Collations are not equal, so converting from one collation to another may break UNIQUE constraints (e.g PRIMARY KEY). ● Default collation: – latin1_swedish_ci vs. utf8mb4_0900_ai_ci E.g. 'o'='ö' is false in the first, but true in the other. – Possible solution: Stick to Swedish or another suitable collation depending on your application: ALTER TABLE foo CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_sv_0900_ai_ci; – Generally, if you don't care about case insensitivity (just got it by default), utf8mb4_0900_as_cs should be safe. ● There's an huge number of possibilities depending on your data and the collations used, partly because pre MySQL 8.0 collations where not complete (and in some cases not correct).
  • 22. 22Copyright © 2017 Oracle and/or its affiliates. All rights reserved. … not quite … index and key issues ● If you change the collation of a column, indexes on that column will be regenerated. – This takes time for large data, and the table is locked during that time. – And the conversion may fail due to changed space consumption. ● Max key length is 3072 bytes1, which implies that max length of a utf8mb4 varchar column which is also a key is 768 characters (Worst case scenario: 4 bytes per character). – mysql> create table foo (v varchar(1000) character set latin1 primary key); Query OK, 0 rows affected (0.01 sec) mysql> alter table foo modify v varchar(1000) character set utf8mb4; ERROR 1071 (42000): Specified key was too long; max key length is 3072 bytes 1For default InnoDB row format and default innodb_page_size in MySQL 8.0. See the documentation for details.
  • 23. 23Copyright © 2017 Oracle and/or its affiliates. All rights reserved. Upgrade example mysql> show create table cities; +--------+---------------------- | Table | Create Table +--------+---------------------- | cities | CREATE TABLE `cities` ( `name` varchar(1024) NOT NULL, `population` int(11) DEFAULT NULL, PRIMARY KEY (`name`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 +--------+---------------------- 1 row in set (0.00 sec) mysql> select * from cities; +------------+------------+ | name | population | +------------+------------+ | København | 1246611 | | Orebro | 107380 | | Oslo | 666759 | | Stockholm | 935619 | | Örebro | 107380 | +------------+------------+ 5 rows in set (0.00 sec) mysql> alter table cities modify column name varchar(1024) charset utf8mb4; ERROR 1062 (23000): Duplicate entry 'Örebro' for key 'PRIMARY'
  • 24. 24Copyright © 2017 Oracle and/or its affiliates. All rights reserved. Checking before altering the table with counted as (select name, count(name) over w as cnt from cities window w as (partition by convert(name using utf8mb4))) select name from counted where cnt > 1;// See footnote 1 +---------+ | name | +---------+ | Orebro | | Örebro | +---------+ 2 rows in set (0.00 sec) ¹ Only MySQL 8.0, not 5.7 or ealier!
  • 25. 25Copyright © 2017 Oracle and/or its affiliates. All rights reserved. ⚠ 文字化け (Mojibake) … or you what you see is not what you get... mysql> create table foo(v varchar(10) character set latin1); mysql> insert into foo values('å'); mysql> set names latin1; mysql> insert into foo values('å'); mysql> set names utf8mb4; mysql> select * from foo; +------+ | v | +------+ | å | | Ã¥ | +------+ 2 rows in set (0.00 sec) mysql> select hex(v) from foo; +--------+ | hex(v) | +--------+ | E5 | | C3A5 | +--------+ 2 rows in set (0.00 sec)
  • 26. 26Copyright © 2017 Oracle and/or its affiliates. All rights reserved. Fixing æ–‡å—化㠑 mysql> select v from foo; +-------------------------------+ | v | +-------------------------------+ | æ–‡å—化㠑 | +-------------------------------+ 1 row in set (0.01 sec) mysql> alter table foo modify column v varchar(128) charset binary; Query OK, 1 row affected (0.14 sec) Records: 1 Duplicates: 0 Warnings: 0 mysql> alter table foo modify column v varchar(128) charset utf8mb4; Query OK, 1 row affected (0.14 sec) Records: 1 Duplicates: 0 Warnings: 0 mysql> select v from foo; +--------------+ | v | +--------------+ | 文字化け | +--------------+ 1 row in set (0.00 sec)
  • 27. 27Copyright © 2017 Oracle and/or its affiliates. All rights reserved. Space consumption ● utf8mb4 use – 1 byte for ASCII characters (0x00-0x7F), – 2 bytes for most alphabets/abjads (0x80-0x7FF), – 3 bytes for Indic scripts, Hangul, Kana, the most used CJK Ideographs (0x800-0xFFFF), – 4 bytes for the rest: Archaic scripts, Emojis, Rarely used CJK extensions etc. (0x10000-)
  • 28. 28Copyright © 2017 Oracle and/or its affiliates. All rights reserved. Speed issues ● Operations on multibyte character sets inherently slower than singlebyte character sets (e.g. latin1 vs. utf8mb4) ● We have done a lot of code improvements. – New code for the new utf8mb4 collations – New collations are NO PAD (which gives faster algorithms) – But expect a performance degradation in the order of 10-20% for sorting when you migrate from e.g latin1 to utf8mb4, depending on your data of course. ● Some collations are inherently slower than others (e.g. utf8mb4_0900_ai_ci vs. utf8mb4_ja_0900_as_cs_ks)
  • 29. 29Copyright © 2017 Oracle and/or its affiliates. All rights reserved. Truly usable for global purposes.....
  • 30. 30Copyright © 2017 Oracle and/or its affiliates. All rights reserved. Q&A ● Check out my blogs at http://mysqlserverteam.com/author/bernt/ ● The 8.0 documentation (if everything else fails … 😠) https://dev.mysql.com/doc/refman/8.0/en/charset.html ● The Unicode documents (for those truly interested … 😇) http://unicode.org/ �