Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
MyRocks in MariaDB
Sergei Petrunia
sergey@mariadb.com
- What is MyRocks
- MyRocks’ way into MariaDB
- Using MyRocks
3
What is MyRocks
● InnoDB’s limitations
● LSM Trees
● RocksDB + MyRocks
4
Data vs Disk

Put some data into the database

How much is written to disk?
INSERT INTO tbl1 VALUES ('foo')

amplific...
5
Amplification

Write amplification

Space amplification

Read amplification
amplification =
size of data
size of data...
6
Amplification in InnoDB
● B*-tree
● Read amplification
– Assume random data lookup
– Locate the page, read it
– Root pag...
7
Write amplification in InnoDB
● Locate the page to update
● Read; modify; write
– Write more if we caused a page split
●...
8
Space amplification in InnoDB
● Page fill factor <100%
– to allow for updates
● Compression is done per-page
– Compressi...
9
InnoDB amplification summary
● Read amplification is ok
● Write and space amplification is an issue
– Saturated IO
– SSD...
10
What is MyRocks
● InnoDB’s limitations
● LSM Trees
● RocksDB + MyRocks
11
Log-structured merge tree
● Writes go to
– MemTable
– Log (for durablility)
● When MemTable is full, it is
flushed to d...
12
Log-structured merge tree
● Writing produces
more SSTs
● SSTs are immutable
● SSTs may have
multiple versions of
data
M...
13
Reads in LSM tree
● Need to merge the
data on read
– Read amplification
suffers
● Should not have too
many SSTs.
MemTab...
14
Compaction
● Merges multiple SSTs into one
● Removes old data versions
● Reduces the number of files
● Write amplificat...
15
LSM Tree summary
● LSM architecture
– Data is stored in the log, then SST files
– Writes to SST files are sequential
– ...
16
What is MyRocks
● InnoDB’s limitations
● LSM Trees
● RocksDB + MyRocks
17
RocksDB
● “An embeddable key-value store for
fast storage environments”
● LSM architecture
● Developed at Facebook (201...
18
MyRocks
● A MySQL storage engine
● Uses RocksDB for storage
● Implements a MySQL storage engine
on top
– Secondary inde...
19
Size amplification benchmark
● Benchmark data and
chart from Facebook
● Linkbench run
● 24 hours
20
Write amplification benchmark
● Benchmark data and
chart from Facebook
● Linkbench
21
QPS
● Benchmark data and
chart from Facebook
● Linkbench
22
QPS on in-memory workload
● Benchmark data and
chart from Facebook
● Sysbench read-write,
in-memory
● MyRocks doesn’t
a...
23
QPS
● Benchmark data and
chart from Facebook
● Linkbench
24
Another write amplification test
● InnoDB
vs
● MyRocks serving 2x
data
InnoDB 2x RocksDB
0
4
8
12
Flash GC
Binlog / Rel...
25
A real efficiency test
For a certain workload we could reduce the number of MySQL
servers by 50%
– Yoshinori Matsunobu
- What is MyRocks
- MyRocks’ way into MariaDB
- Using MyRocks
27
The source of MyRocks
● MyRocks is a part of Facebook’s MySQL tree
– github.com/facebook/mysql-5.6
● No binaries or pac...
28
MyRocks in MariaDB
● MyRocks is a part of MariaDB 10.2 and MariaDB 10.3
– MyRocks itself is the same across 10.2 and 10...
- What is MyRocks
- MyRocks’ way into MariaDB
- Using MyRocks
30
Using MyRocks
● Installation
● Migration
● Tuning
● Replication
● Backup
31
MyRocks packages and binaries
Packages are available for modern distributions
● RPM-based (RedHat, CentOS)
● Deb-based ...
32
Installing MyRocks
● Install the plugin
– RPM
– DEB
– Bintar/windows:
apt install mariadb-plugin-rocksdb
MariaDB> insta...
33
Using MyRocks
● Installation
● Migration
● Tuning
● Replication
● Backup
34
Data loading – good news
● It’s a write-optimized storage engine
● The same data in MyRocks takes less space on disk
th...
35
Data loading – limitations
● Limitation: Transaction must fit in memory
mysql> ALTER TABLE big_table ENGINE=RocksDB;
ER...
36
Sorted bulk loading
MariaDB> set rocksdb_bulk_load=1;
... # something that inserts the data in PK order
MariaDB> set ro...
37
Unsorted bulk loading
MariaDB> set rocksdb_bulk_load_allow_unsorted=0;
MariaDB> set rocksdb_bulk_load=1;
... # somethin...
38
Other speedups for lading
● rocksdb_commit_in_the_middle
– Auto commit every rocksdb_bulk_load_size rows
● @@unique_che...
39
Creating indexes
● Index creation doesn’t need any special settings
mysql> ALTER TABLE myrocks_table ADD INDEX(...);
● ...
40
max_row_locks as a safety setting
● Avoid run-away memory usage and OOM killer:
MariaDB> set rocksdb_max_row_locks=1000...
41
Using MyRocks
● Installation
● Migration
● Tuning
● Replication
● Backup
42
Essential MyRocks tuning
● rocksdb_flush_log_at_trx_commit=1
– Same as innodb_flush_log_at_trx_commit=1
– Together with...
43
Indexes
● Primary Key is the clustered index (like in InnoDB)
● Secondary indexes include PK columns (like in InnoDB)
●...
44
Charsets and collations
● Index entries are compared with memcmp(), “mem-comparable”
● “Reversible collations”: binary,...
45
Charsets and collations (2)
● Using indexes on “Other” (collation) produces a warning
MariaDB> create table t3 (...
a v...
46
Using MyRocks
● Installing
● Migration
● Tuning
● Replication
● Backup
47
MyRocks only supports RBR
● MyRocks’ highest isolation level is “snapshot isolation”,
that is REPEATABLE-READ
● Stateme...
48
Parallel replication
● Facebook/MySQL-5.6 is based on MySQL 5.6
– Parallel replication for data in different databases
...
49
Parallel replication (2)
● Conservative parallel
replication works with MyRocks
– Different execution paths for
log_sla...
50
mysql.gtid_slave_pos
●
mysql.gtid_slave_pos stores slave’s position
● Use a transactional storage engine for it
– Datab...
51
Per-engine mysql.gtid_slave_pos
● The idea:
– Have mysql.gtid_slave_pos_${engine} for each
storage engine
– Slave posit...
52
Per-engine mysql.gtid_slave_pos
● Available in MariaDB 10.3
● Thanks for the patch to
– Kristian Nielsen (implementatio...
53
Replication takeaways
● Must use binlog_format=row
● Parallel replication works
(slave_parallel_mode=conservative)
● Se...
54
Using MyRocks
● Installation
● Migration
● Tuning
● Replication
● Backup
55
myrocks_hotbackup tool
● Takes a hot (non-blocking) backup of RocksDB data
mkdir /path/to/checkpoint-dir
myrocks_hotbac...
56
myrocks_hotbackup limitations
● Targets MyRocks-only installs
– Copies MyRocks data and supplementary files
– Does *not...
57
Further backup considerations
● mariabackup does not work with MyRocks
currently
● myrocks_hotbackup is similar (actual...
58
Using MyRocks
● Installation
● Migration
● Tuning
● Replication
● Backup
✔
✔
✔
✔
✔
59
Documentation
● https://mariadb.com/kb/en/library/myrocks/
● https://github.com/facebook/mysql-5.6/wiki
● https://maria...
60
Conclusions
● MyRocks is a write-optimized storage engine based
on LSM-trees
– Developed and used by Facebook
● Availab...
Thanks!
Q&A
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

M|18 How to use MyRocks with MariaDB Server

Download to read offline

M|18 How to use MyRocks with MariaDB Server

M|18 How to use MyRocks with MariaDB Server

  1. 1. MyRocks in MariaDB Sergei Petrunia sergey@mariadb.com
  2. 2. - What is MyRocks - MyRocks’ way into MariaDB - Using MyRocks
  3. 3. 3 What is MyRocks ● InnoDB’s limitations ● LSM Trees ● RocksDB + MyRocks
  4. 4. 4 Data vs Disk  Put some data into the database  How much is written to disk? INSERT INTO tbl1 VALUES ('foo')  amplification = size of data size of data on disk
  5. 5. 5 Amplification  Write amplification  Space amplification  Read amplification amplification = size of data size of data on disk foo
  6. 6. 6 Amplification in InnoDB ● B*-tree ● Read amplification – Assume random data lookup – Locate the page, read it – Root page is cached ● ~1 disk read for the leaf page – Read amplification is ok
  7. 7. 7 Write amplification in InnoDB ● Locate the page to update ● Read; modify; write – Write more if we caused a page split ● One page write per update! – page_size / sizeof(int) = 16K/8 ● Write amplification is an issue.
  8. 8. 8 Space amplification in InnoDB ● Page fill factor <100% – to allow for updates ● Compression is done per-page – Compressing bigger portions would be better – Page alignment ● Compressed to 3K ? Still on 4k page. ● => Space amplification is an issue
  9. 9. 9 InnoDB amplification summary ● Read amplification is ok ● Write and space amplification is an issue – Saturated IO – SSDs wear out faster – Need more space on SSD ● => Low storage efficiency
  10. 10. 10 What is MyRocks ● InnoDB’s limitations ● LSM Trees ● RocksDB + MyRocks
  11. 11. 11 Log-structured merge tree ● Writes go to – MemTable – Log (for durablility) ● When MemTable is full, it is flushed to disk ● SST= SortedStringTable – Is written sequentially – Allows for lookups MemTableWrite Log SST MemTable
  12. 12. 12 Log-structured merge tree ● Writing produces more SSTs ● SSTs are immutable ● SSTs may have multiple versions of data MemTableWrite Log SST MemTable SST ...
  13. 13. 13 Reads in LSM tree ● Need to merge the data on read – Read amplification suffers ● Should not have too many SSTs. MemTable Read Log SST MemTable SST ...SST
  14. 14. 14 Compaction ● Merges multiple SSTs into one ● Removes old data versions ● Reduces the number of files ● Write amplification++ SST SST . . .SST SST
  15. 15. 15 LSM Tree summary ● LSM architecture – Data is stored in the log, then SST files – Writes to SST files are sequential – Have to look in several SST files to read the data ● Efficiency – Write amplification is reduced – Space amplification is reduced – Read amplification increases
  16. 16. 16 What is MyRocks ● InnoDB’s limitations ● LSM Trees ● RocksDB + MyRocks
  17. 17. 17 RocksDB ● “An embeddable key-value store for fast storage environments” ● LSM architecture ● Developed at Facebook (2012) ● Used at Facebook and many other companies ● It’s a key-value store, not a database – C++ Library (local only) – NoSQL-like API (Get, Put, Delete, Scan) – No datatypes, tables, secondary indexes, ...
  18. 18. 18 MyRocks ● A MySQL storage engine ● Uses RocksDB for storage ● Implements a MySQL storage engine on top – Secondary indexes – Data types – SQL transactions – … ● Developed* and used by Facebook – *-- with some MariaDB involvement MyRocks RocksDB MySQL InnoDB
  19. 19. 19 Size amplification benchmark ● Benchmark data and chart from Facebook ● Linkbench run ● 24 hours
  20. 20. 20 Write amplification benchmark ● Benchmark data and chart from Facebook ● Linkbench
  21. 21. 21 QPS ● Benchmark data and chart from Facebook ● Linkbench
  22. 22. 22 QPS on in-memory workload ● Benchmark data and chart from Facebook ● Sysbench read-write, in-memory ● MyRocks doesn’t always beat InnoDB
  23. 23. 23 QPS ● Benchmark data and chart from Facebook ● Linkbench
  24. 24. 24 Another write amplification test ● InnoDB vs ● MyRocks serving 2x data InnoDB 2x RocksDB 0 4 8 12 Flash GC Binlog / Relay log Storage engine
  25. 25. 25 A real efficiency test For a certain workload we could reduce the number of MySQL servers by 50% – Yoshinori Matsunobu
  26. 26. - What is MyRocks - MyRocks’ way into MariaDB - Using MyRocks
  27. 27. 27 The source of MyRocks ● MyRocks is a part of Facebook’s MySQL tree – github.com/facebook/mysql-5.6 ● No binaries or packages ● No releases ● Extra features ● “In-house” experience
  28. 28. 28 MyRocks in MariaDB ● MyRocks is a part of MariaDB 10.2 and MariaDB 10.3 – MyRocks itself is the same across 10.2 and 10.3 – New related feature in 10.3: “Per-engine gtid_slave_pos” ● MyRocks is a loadable plugin with its own Maturity level. ● Releases – April, 2017: MyRocks is Alpha (in MariaDB 10.2.5 RC) – January, 2018: MyRocks is Beta – February, 2018: MyRocks is RC ● Release pending
  29. 29. - What is MyRocks - MyRocks’ way into MariaDB - Using MyRocks
  30. 30. 30 Using MyRocks ● Installation ● Migration ● Tuning ● Replication ● Backup
  31. 31. 31 MyRocks packages and binaries Packages are available for modern distributions ● RPM-based (RedHat, CentOS) ● Deb-based (Debian, Ubuntu) ● Binary tarball – mariadb-10.2.12-linux-systemd-x86_64.tar.gz ● Windows ● ...
  32. 32. 32 Installing MyRocks ● Install the plugin – RPM – DEB – Bintar/windows: apt install mariadb-plugin-rocksdb MariaDB> install soname 'ha_rocksdb'; MariaDB> create table t1( … ) engine=rocksdb; yum install MariaDB-rocksdb-engine ● Restart the server ● Check ● Use MariaDB> SHOW PLUGINS;
  33. 33. 33 Using MyRocks ● Installation ● Migration ● Tuning ● Replication ● Backup
  34. 34. 34 Data loading – good news ● It’s a write-optimized storage engine ● The same data in MyRocks takes less space on disk than on InnoDB – 1.5x, 3x, or more ● Can load the data faster than InnoDB ● But…
  35. 35. 35 Data loading – limitations ● Limitation: Transaction must fit in memory mysql> ALTER TABLE big_table ENGINE=RocksDB; ERROR 2013 (HY000): Lost connection to MySQL server during query ● Uses a lot of memory, then killed by OOM killer. ● Need to use special settings for loading data – https://mariadb.com/kb/en/library/loading-data-into-myrocks/ – https://github.com/facebook/mysql-5.6/wiki/data-loading
  36. 36. 36 Sorted bulk loading MariaDB> set rocksdb_bulk_load=1; ... # something that inserts the data in PK order MariaDB> set rocksdb_bulk_load=0; Bypasses ● Transactional locking ● Seeing the data you are inserting ● Compactions ● ...
  37. 37. 37 Unsorted bulk loading MariaDB> set rocksdb_bulk_load_allow_unsorted=0; MariaDB> set rocksdb_bulk_load=1; ... # something that inserts data in any order MariaDB> set rocksdb_bulk_load=0; ● Same as previous but doesn’t require PK order
  38. 38. 38 Other speedups for lading ● rocksdb_commit_in_the_middle – Auto commit every rocksdb_bulk_load_size rows ● @@unique_checks – Setting to FALSE will bypass unique checks
  39. 39. 39 Creating indexes ● Index creation doesn’t need any special settings mysql> ALTER TABLE myrocks_table ADD INDEX(...); ● Compression is enabled by default – Lots of settings to tune it further
  40. 40. 40 max_row_locks as a safety setting ● Avoid run-away memory usage and OOM killer: MariaDB> set rocksdb_max_row_locks=10000; MariaDB> alter table t1 engine=rocksdb; ERROR 4067 (HY000): Status error 10 received from RocksDB: Operation aborted: Failed to acquire lock due to max_num_locks limit ● This setting is useful after migration, too.
  41. 41. 41 Using MyRocks ● Installation ● Migration ● Tuning ● Replication ● Backup
  42. 42. 42 Essential MyRocks tuning ● rocksdb_flush_log_at_trx_commit=1 – Same as innodb_flush_log_at_trx_commit=1 – Together with sync_binlog=1 makes the master crash- safe ● rocksdb_block_cache_size=... – Similar to innodb_buffer_pool_size – 500Mb by default (will use up to that amount * 1.5?) – Set higher if have more RAM ● Safety: rocksdb_max_row_locks=...
  43. 43. 43 Indexes ● Primary Key is the clustered index (like in InnoDB) ● Secondary indexes include PK columns (like in InnoDB) ● Non-unique secondary indexes are cheap – Index maintenance is read-free ● Unique secondary indexes are more expensive – Maintenance is not read-free – Reads can suffer from read amplification – Can use @@unique_check=false, at your own risk
  44. 44. 44 Charsets and collations ● Index entries are compared with memcmp(), “mem-comparable” ● “Reversible collations”: binary, latin1_bin, utf8_bin – Can convert values to/from their mem-comparable form ● “Restorable collations” – 1-byte characters, one weght per character – e.g. latin1_general_ci, latin1_swedish_ci, ... – Index stores mem-comparable form + restore data ● Other collations (e.g. utf8_general_ci) – Index-only scans are not supported – Table row stores the original value ‘A’ a 1 ‘a’ a 0
  45. 45. 45 Charsets and collations (2) ● Using indexes on “Other” (collation) produces a warning MariaDB> create table t3 (... a varchar(100) collate utf8_general_ci, key(a)) engine=rocksdb; Query OK, 0 rows affected, 1 warning (0.20 sec) MariaDB> show warningsG *************************** 1. row *************************** Level: Warning Code: 1815 Message: Internal error: Indexed column test.t3.a uses a collation that does not allow index-only access in secondary key and has reduced disk space efficiency in primary key.
  46. 46. 46 Using MyRocks ● Installing ● Migration ● Tuning ● Replication ● Backup
  47. 47. 47 MyRocks only supports RBR ● MyRocks’ highest isolation level is “snapshot isolation”, that is REPEATABLE-READ ● Statement-Based Replication – Slave runs the statements sequentially (~ serializable) – InnoDB uses “Gap Locks” to provide isolation req’d by SBR – MyRocks doesn’t support Gap Locks ● Row-Based Replication must be used – binlog_format=row
  48. 48. 48 Parallel replication ● Facebook/MySQL-5.6 is based on MySQL 5.6 – Parallel replication for data in different databases ● MariaDB has more advanced parallel slave (controlled by @@slave_parallel_mode) – Conservative (group-commit based) – Optimistic (apply transactions in parallel, on conflict roll back the transaction with greater seq_no) – Aggressive
  49. 49. 49 Parallel replication (2) ● Conservative parallel replication works with MyRocks – Different execution paths for log_slave_updates=ON|OFF – ON might be faster ● Optimistic and aggressive do not provide speedups over conservative.
  50. 50. 50 mysql.gtid_slave_pos ● mysql.gtid_slave_pos stores slave’s position ● Use a transactional storage engine for it – Database contents will match slave position after crash recovery – Makes the slave crash-safe. ● mysql.gtid_slave_pos uses a different engine? – Cross-engine transaction (slow).
  51. 51. 51 Per-engine mysql.gtid_slave_pos ● The idea: – Have mysql.gtid_slave_pos_${engine} for each storage engine – Slave position is the biggest position in all tables. – Transaction affecting only MyRocks will only update mysql.gtid_slave_pos_rocksdb ● Configuration: --gtid-pos-auto-engines=engine1,engine2,...
  52. 52. 52 Per-engine mysql.gtid_slave_pos ● Available in MariaDB 10.3 ● Thanks for the patch to – Kristian Nielsen (implementation) – Booking.com (request and funding) ● Don’t let your slave be 2-3x slower: – 10.3: my.cnf: gtid-pos-auto-engines=INNODB,ROCKSDB – 10.2: MariaDB> ALTER TABLE mysql.gtid_slave_pos ENGINE=RocksDB;
  53. 53. 53 Replication takeaways ● Must use binlog_format=row ● Parallel replication works (slave_parallel_mode=conservative) ● Set your mysql.gtid_slave_pos correctly – 10.2: preferred engine – 10.3: per-engine
  54. 54. 54 Using MyRocks ● Installation ● Migration ● Tuning ● Replication ● Backup
  55. 55. 55 myrocks_hotbackup tool ● Takes a hot (non-blocking) backup of RocksDB data mkdir /path/to/checkpoint-dir myrocks_hotbackup --user=... --port=... --checkpoint_dir=/path/to/checkpoint-dir | ssh "tar -xi ..." myrocks_hotbackup --move_back --datadir=/path/to/restore-dir --rocksdb_datadir=/path/to/restore-dir/#rocksdb --rocksdb_waldir= /path/to/restore-dir/#rocksdb --backup_dir=/where/tarball/was/unpacked ● Making a backup ● Starting from backup
  56. 56. 56 myrocks_hotbackup limitations ● Targets MyRocks-only installs – Copies MyRocks data and supplementary files – Does *not* work for mixed MyRocks+InnoDB setups ● Assumes no DDL is running concurrently with the backup – May get a wrong frm file ● Not very user-friendly
  57. 57. 57 Further backup considerations ● mariabackup does not work with MyRocks currently ● myrocks_hotbackup is similar (actually simpler) than xtrabackup ● Will look at making mariabackup work with MyRocks
  58. 58. 58 Using MyRocks ● Installation ● Migration ● Tuning ● Replication ● Backup ✔ ✔ ✔ ✔ ✔
  59. 59. 59 Documentation ● https://mariadb.com/kb/en/library/myrocks/ ● https://github.com/facebook/mysql-5.6/wiki ● https://mariadb.com/kb/en/library/differences- between-myrocks-variants/
  60. 60. 60 Conclusions ● MyRocks is a write-optimized storage engine based on LSM-trees – Developed and used by Facebook ● Available in MariaDB 10.2 and MariaDB 10.3 – Currently RC, Stable very soon ● It has some limitations and requires some tuning – Easy to tackle if you were here for this talk :-)
  61. 61. Thanks! Q&A
  • MGHossainHimu

    Oct. 13, 2019

M|18 How to use MyRocks with MariaDB Server

Views

Total views

2,173

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

25

Shares

0

Comments

0

Likes

1

×