Character Encoding - MySQL DevRoom - FOSDEM 2015

Character encoding
Breaking and unbreaking your data
Maciej Dobrzanski
maciek@psce.com | @mushupl
Brussels, 1 Feb 2015
01.02.2015 Follow us on Twitter @dbasquare www.psce.com

Character Encoding
• Binary representation of glyphs
• Each character can be represented by 1 or more bytes
• Popular schemes
• ASCII
• Unicode
• UTF-8, UTF-16, UTF-32
• Language specific character sets
• US (Latin US)
• Europe (Latin 1, Latin 2)
• Asia (EUC-KR, GB18030)

Character Encoding
• Character set defines the visual interpretation of binary information
• One glyph can be associated with several numeric codes
• One numeric code may be used to represent several different glyphs

Please state the nature of the emergency
• Application configuration
• Database configuration
• Table/column definitions

Problem #1: We are all born Swedish
• MySQL uses latin1 by default
• MySQL 5.7 too
• Is anyone actually aware of that?
• Why Swedish?
• latin1_swedish_ci is the default collation

Problem #1
• Let’s build an application
mysql> SELECT @@global.character_set_server, @@session.character_set_client;
+-------------------------------+--------------------------------+
| @@global.character_set_server | @@session.character_set_client |
+-------------------------------+--------------------------------+
| latin1 | latin1 |
+-------------------------------+--------------------------------+
1 row in set (0.00 sec)
mysql> CREATE SCHEMA fosdem;
Query OK, 1 row affected (0.00 sec)
mysql> USE fosdem;
mysql> CREATE TABLE locations (city VARCHAR(30) NOT NULL);
Query OK, 0 rows affected (0.15 sec)
mysql> SHOW CREATE TABLE locationsG
*************************** 1. row ***************************
Table: locations
Create Table: CREATE TABLE `locations` (
`city` varchar(30) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1

Problem #1

Problem #1
• Let’s fix this
• Or can we ignore it?
• Ruby may not like it
# grep character-set-server /etc/mysql/my.cnf
character-set-server = utf8
mysql> SELECT @@global.character_set_server, @@session.character_set_client;
+-------------------------------+--------------------------------+
+-------------------------------+--------------------------------+
| utf8 | utf8 |
+-------------------------------+--------------------------------+
...we are fixing our tables here...
mysql> SHOW CREATE TABLE locationsG
*************************** 1. row ***************************
Table: locations
Create Table: CREATE TABLE `locations` (
`city` varchar(30) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8

Problem #1: The good news
• It’s usually fixable

Problem #2: Settings, defaults, inheritance
• Where do you set character sets in MySQL?
• Sesssion settings
• character_set_server
• character_set_client
• character_set_connection
• character_set_database
• character_set_result
• Schema level defaults
• Table level defaults
• Column charsets

Problem #2
• Having fixed our problem #1, we continue to develop our application
mysql> SELECT @@session.character_set_server, @@session.character_set_client;
+--------------------------------+--------------------------------+
| @@session.character_set_server | @@session.character_set_client |
+--------------------------------+--------------------------------+
| utf8 | utf8 |
+--------------------------------+--------------------------------+
mysql> USE fosdem;
mysql> CREATE TABLE people (first_name VARCHAR(30) NOT NULL, last_name VARCHAR(30) NOT NULL);

Problem #2

Problem #2
• Why is the table character set latin1?
mysql> SELECT @@session.character_set_server, @@session.character_set_client;
+--------------------------------+--------------------------------+
| @@session.character_set_server | @@session.character_set_client |
+--------------------------------+--------------------------------+
| utf8 | utf8 |
+--------------------------------+--------------------------------+
mysql> USE fosdem;
mysql> SHOW CREATE TABLE peopleG
*************************** 1. row ***************************
Table: people
Create Table: CREATE TABLE `people` (
`first_name` varchar(30) NOT NULL,
`last_name` varchar(30) NOT NULL

Problem #2
• What’s all this, then?
mysql> SHOW SESSION VARIABLES LIKE 'character_set_%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
mysql> SHOW CREATE DATABASE fosdemG
*************************** 1. row ***************************
Database: fosdem
Create Database: CREATE DATABASE `fosdem` /*!40100 DEFAULT CHARACTER SET latin1 */

Problem #2: The bad news
• It may not be enough to configure the server correctly
• A mismatch between client and server can permantenly break data
• Implicit conversion inside MySQL server

• Where do you set character sets in MySQL?
• Sesssion settings
• character_set_server
• character_set_client
• character_set_connection
• character_set_database
• character_set_result
• Schema level defaults – affect new tables
• Table level defaults – affect new columns
• Column charsets

master [localhost] {msandbox} ((none)) > SELECT @@global.character_set_server, @@session.character_set_client;
+-------------------------------+--------------------------------+
+-------------------------------+--------------------------------+
| latin1 | utf8 |
+-------------------------------+--------------------------------+
master [localhost] {msandbox} ((none)) > CREATE SCHEMA fosdemG
Query OK, 1 row affected (0.00 sec)
master [localhost] {msandbox} ((none)) > SHOW CREATE SCHEMA fosdemG
*************************** 1. row ***************************
Database: fosdem
Create Database: CREATE DATABASE `fosdem` /*!40100 DEFAULT CHARACTER SET latin1 */

master [localhost] {msandbox} ((none)) > USE fosdem;
Database changed
master [localhost] {msandbox} (fosdem) > CREATE TABLE test (a VARCHAR(300), INDEX (a));
master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE testG
*************************** 1. row ***************************
Table: test
Create Table: CREATE TABLE `test` (
à` varchar(300) DEFAULT NULL,
KEY à` (à`)

master [localhost] {msandbox} (fosdem) > ALTER TABLE test DEFAULT CHARSET = utf8;
Records: 0 Duplicates: 0 Warnings: 0
*************************** 1. row ***************************
Table: test
à` varchar(300) CHARACTER SET latin1 DEFAULT NULL,
KEY à` (à`)

master [localhost] {msandbox} (fosdem) > ALTER TABLE test ADD b VARCHAR(10);
*************************** 1. row ***************************
Table: test
à` varchar(300) CHARACTER SET latin1 DEFAULT NULL,
`b` varchar(10) DEFAULT NULL,
KEY à` (à`)

I f**ckd up. What do I do?
• Let’s start with what you shouldn’t do
• Keep calm and don’t start by changing something
• Analyze the situation
• Why did the problem occur in the first place?
• Reassess the damage
• Is it consistent?
• Are all rows broken in the same way?
• Are some rows bad, but others are okay?
• Are all bad in several different ways?
• Is it actually repearable?
• No character mapping occurred during writes (e.g. unicode over latin1/latin1)

I f**ckd up. What else I shouldn’t do, then?
• Do not rush things as you may easily go from bad to worse
• Do not start fixing this on a replication slave
• You can’t fix this by fixing tables one by one on a live database
• Unless you really have everything in one table
• Do not use: ALTER TABLE … DEFAULT CHARSET = …
• It only changes the default character set for new columns
• Do not use: ALTER TABLE … CONVERT TO CHARACTER SET …
• It’s not for fixing broken encoding
• Do not use: ALTER TABLE … MODIFY col_name … CHARACTER SET …

I f**ckd up. So how do I fix it?
• What needs to be fixed?
• Schema defaut character set
• ALTER SCHEMA fosdem DEFAULT CHARSET = utf8
• Tables with text columns: CHAR, VARCHAR, TEXT, TINYTEXT, LONGTEXT
• What about ENUM?
• Use INFORMATION_SCHEMA to grab a list
• What about other tables?
• They too (eventually), but it’s not critical
SELECT CONCAT(c.table_schema, '.', c.table_name) AS candidate_table
FROM information_schema.columns c
WHERE c.table_schema = 'fosdem'
AND c.column_type REGEXP '^(.*CHAR|.*TEXT|ENUM)((.+))?$'
GROUP BY candidate_table;

• Option 1 – requires downtime
• Dump and restore
• Dump the data preserving the bad configuration and drop the old database
bash# mysqldump -u root -p --skip-set-charset --default-character-set=latin1 fosdem >
fosdem.sql
mysql> DROP SCHEMA fosdem;
• Correct table definitions in the dump file
• Edit DEFAULT CHARSET in all CREATE TABLE statements
• Create the database again and import the data back
mysql> CREATE SCHEMA fosdem DEFAULT CHARSET utf8;
bash# mysql -u root -p --default-character-set=utf8 fosdem < fosdem.sql

• Option 2 – requires downtime
• Perform a two step conversion with ALTER TABLE
• Original encoding -> VARBINARY/BLOB -> Target encoding
• Conversion from/to BINARY/BLOB removes character set context
• How?
• Stop applications
• On each tabe, for each text column perform:
ALTER TABLE tbl MODIFY col_name VARBINARY(255);
ALTER TABLE tbl MODIFY col_name VARCHAR(255) CHARACTER SET utf8;
• You may specify multiple columns per ALTER TABLE
• Fix the problems (application and/or db configs)
• Restart applications

• Option 3 – online character set fix; no downtime*
• Thanks to our plugin for pt-online-schema-change
• and a tiny patch for pt-online-schema-change that goes with the plugin 
• How?
• Start pt-online-schema-change on all tables – one by one
• Do not rotate tables (--no-swap-tables) or drop pt-osc triggers
• Wait until all tables have been converted
• Stop applications
• Fix the problems (application and/or db configs)
• Rotate tables – takes just 1 minute
• Restart applications
• Et voilà

GOTCHAs!
• Data space requrements may change during conversion
• Latin1 uses 1 byte per character, utf8 will need to assume 3 bytes
• VARCHAR/TEXT fit up to 64KB – it won’t fit 65536 multi-byte characters
• Key length limit is 767 bytes
• Data type and/or index length changes may be required
• Test and plan this ahead
• There may be more prolems than you think
• Detect irrecoverible problems with a simple stored procedure
CREATE FUNCTION `cnv_test_conversion` (`value_before` LONGTEXT, `value_after` LONGTEXT) RETURNS tinyint(1)
BEGIN
RETURN (IFNULL(CONVERT(CONVERT(`value_before` USING latin1) USING binary), "") =
IFNULL(CONVERT(`value_after` USING binary), ""));
END;;

01.02.2015 Follow us on Twitter @dbasquare Need help? Visit www.psce.com
GOTCHAs!
master [localhost] {msandbox} (fosdem) > ALTER TABLE test MODIFY a VARCHAR(300) CHARACTER SET utf8;
Query OK, 0 rows affected, 1 warning (1.23 sec)
master [localhost] {msandbox} (fosdem) > SHOW WARNINGSG
*************************** 1. row ***************************
Level: Warning
Code: 1071
Message: Specified key was too long; max key length is 767 bytes
*************************** 1. row ***************************
Table: test
à` varchar(300) DEFAULT NULL,
`b` varchar(10) DEFAULT NULL,
KEY à` (à`(255))

How to do it right?
• Set character-set-server during initial configuration
• When creating new schemas, always specify the desired charset
• CREATE SCHEMA fosdem DEFAULT CHARSET = utf8
• ALTER SCHEMA fosdem DEFAULT CHARSET = utf8
• When creating new tables, also explicitly specify the charset
• CREATE TABLE people (…) DEFAULT CHARSET = utf8
• And don’t forget to configure applications too
• You can try to force charset on the clients
• init-connect = "SET NAMES utf8"
• It might also break applications that don’t want to talk to MySQL using utf8

Oh, and one more thing…
01.02.2015 Follow us on Twitter @dbasquare Need help? Visit www.psce.com

• We are sharing WebScaleSQL packages with the MySQL Community!
• Check out http://www.psce.com/blog for details
• Follow @dbasquare to receive updates
01.02.2015 Follow us on Twitter @dbasquare 35
WebScaleSQL
What is WebScaleSQL?
WebScaleSQL is a collaboration among engineers from several companies
such as Facebook, Twitter, Google or Linkedin, that face the same challenges
in deploying MySQL at scale, and seek greater performance from a database
technology tailored for their needs.

Character Encoding - MySQL DevRoom - FOSDEM 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Character Encoding - MySQL DevRoom - FOSDEM 2015

Similar to Character Encoding - MySQL DevRoom - FOSDEM 2015 (20)

Recently uploaded

Recently uploaded (20)

Character Encoding - MySQL DevRoom - FOSDEM 2015