Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

M|18 How to use MyRocks with MariaDB Server

414 views

Published on

M|18 How to use MyRocks with MariaDB Server

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

M|18 How to use MyRocks with MariaDB Server

  1. 1. MyRocks in MariaDB Sergei Petrunia sergey@mariadb.com
  2. 2. - What is MyRocks - MyRocks’ way into MariaDB - Using MyRocks
  3. 3. 3 What is MyRocks ● InnoDB’s limitations ● LSM Trees ● RocksDB + MyRocks
  4. 4. 4 Data vs Disk  Put some data into the database  How much is written to disk? INSERT INTO tbl1 VALUES ('foo')  amplification = size of data size of data on disk
  5. 5. 5 Amplification  Write amplification  Space amplification  Read amplification amplification = size of data size of data on disk foo
  6. 6. 6 Amplification in InnoDB ● B*-tree ● Read amplification – Assume random data lookup – Locate the page, read it – Root page is cached ● ~1 disk read for the leaf page – Read amplification is ok
  7. 7. 7 Write amplification in InnoDB ● Locate the page to update ● Read; modify; write – Write more if we caused a page split ● One page write per update! – page_size / sizeof(int) = 16K/8 ● Write amplification is an issue.
  8. 8. 8 Space amplification in InnoDB ● Page fill factor <100% – to allow for updates ● Compression is done per-page – Compressing bigger portions would be better – Page alignment ● Compressed to 3K ? Still on 4k page. ● => Space amplification is an issue
  9. 9. 9 InnoDB amplification summary ● Read amplification is ok ● Write and space amplification is an issue – Saturated IO – SSDs wear out faster – Need more space on SSD ● => Low storage efficiency
  10. 10. 10 What is MyRocks ● InnoDB’s limitations ● LSM Trees ● RocksDB + MyRocks
  11. 11. 11 Log-structured merge tree ● Writes go to – MemTable – Log (for durablility) ● When MemTable is full, it is flushed to disk ● SST= SortedStringTable – Is written sequentially – Allows for lookups MemTableWrite Log SST MemTable
  12. 12. 12 Log-structured merge tree ● Writing produces more SSTs ● SSTs are immutable ● SSTs may have multiple versions of data MemTableWrite Log SST MemTable SST ...
  13. 13. 13 Reads in LSM tree ● Need to merge the data on read – Read amplification suffers ● Should not have too many SSTs. MemTable Read Log SST MemTable SST ...SST
  14. 14. 14 Compaction ● Merges multiple SSTs into one ● Removes old data versions ● Reduces the number of files ● Write amplification++ SST SST . . .SST SST
  15. 15. 15 LSM Tree summary ● LSM architecture – Data is stored in the log, then SST files – Writes to SST files are sequential – Have to look in several SST files to read the data ● Efficiency – Write amplification is reduced – Space amplification is reduced – Read amplification increases
  16. 16. 16 What is MyRocks ● InnoDB’s limitations ● LSM Trees ● RocksDB + MyRocks
  17. 17. 17 RocksDB ● “An embeddable key-value store for fast storage environments” ● LSM architecture ● Developed at Facebook (2012) ● Used at Facebook and many other companies ● It’s a key-value store, not a database – C++ Library (local only) – NoSQL-like API (Get, Put, Delete, Scan) – No datatypes, tables, secondary indexes, ...
  18. 18. 18 MyRocks ● A MySQL storage engine ● Uses RocksDB for storage ● Implements a MySQL storage engine on top – Secondary indexes – Data types – SQL transactions – … ● Developed* and used by Facebook – *-- with some MariaDB involvement MyRocks RocksDB MySQL InnoDB
  19. 19. 19 Size amplification benchmark ● Benchmark data and chart from Facebook ● Linkbench run ● 24 hours
  20. 20. 20 Write amplification benchmark ● Benchmark data and chart from Facebook ● Linkbench
  21. 21. 21 QPS ● Benchmark data and chart from Facebook ● Linkbench
  22. 22. 22 QPS on in-memory workload ● Benchmark data and chart from Facebook ● Sysbench read-write, in-memory ● MyRocks doesn’t always beat InnoDB
  23. 23. 23 QPS ● Benchmark data and chart from Facebook ● Linkbench
  24. 24. 24 Another write amplification test ● InnoDB vs ● MyRocks serving 2x data InnoDB 2x RocksDB 0 4 8 12 Flash GC Binlog / Relay log Storage engine
  25. 25. 25 A real efficiency test For a certain workload we could reduce the number of MySQL servers by 50% – Yoshinori Matsunobu
  26. 26. - What is MyRocks - MyRocks’ way into MariaDB - Using MyRocks
  27. 27. 27 The source of MyRocks ● MyRocks is a part of Facebook’s MySQL tree – github.com/facebook/mysql-5.6 ● No binaries or packages ● No releases ● Extra features ● “In-house” experience
  28. 28. 28 MyRocks in MariaDB ● MyRocks is a part of MariaDB 10.2 and MariaDB 10.3 – MyRocks itself is the same across 10.2 and 10.3 – New related feature in 10.3: “Per-engine gtid_slave_pos” ● MyRocks is a loadable plugin with its own Maturity level. ● Releases – April, 2017: MyRocks is Alpha (in MariaDB 10.2.5 RC) – January, 2018: MyRocks is Beta – February, 2018: MyRocks is RC ● Release pending
  29. 29. - What is MyRocks - MyRocks’ way into MariaDB - Using MyRocks
  30. 30. 30 Using MyRocks ● Installation ● Migration ● Tuning ● Replication ● Backup
  31. 31. 31 MyRocks packages and binaries Packages are available for modern distributions ● RPM-based (RedHat, CentOS) ● Deb-based (Debian, Ubuntu) ● Binary tarball – mariadb-10.2.12-linux-systemd-x86_64.tar.gz ● Windows ● ...
  32. 32. 32 Installing MyRocks ● Install the plugin – RPM – DEB – Bintar/windows: apt install mariadb-plugin-rocksdb MariaDB> install soname 'ha_rocksdb'; MariaDB> create table t1( … ) engine=rocksdb; yum install MariaDB-rocksdb-engine ● Restart the server ● Check ● Use MariaDB> SHOW PLUGINS;
  33. 33. 33 Using MyRocks ● Installation ● Migration ● Tuning ● Replication ● Backup
  34. 34. 34 Data loading – good news ● It’s a write-optimized storage engine ● The same data in MyRocks takes less space on disk than on InnoDB – 1.5x, 3x, or more ● Can load the data faster than InnoDB ● But…
  35. 35. 35 Data loading – limitations ● Limitation: Transaction must fit in memory mysql> ALTER TABLE big_table ENGINE=RocksDB; ERROR 2013 (HY000): Lost connection to MySQL server during query ● Uses a lot of memory, then killed by OOM killer. ● Need to use special settings for loading data – https://mariadb.com/kb/en/library/loading-data-into-myrocks/ – https://github.com/facebook/mysql-5.6/wiki/data-loading
  36. 36. 36 Sorted bulk loading MariaDB> set rocksdb_bulk_load=1; ... # something that inserts the data in PK order MariaDB> set rocksdb_bulk_load=0; Bypasses ● Transactional locking ● Seeing the data you are inserting ● Compactions ● ...
  37. 37. 37 Unsorted bulk loading MariaDB> set rocksdb_bulk_load_allow_unsorted=0; MariaDB> set rocksdb_bulk_load=1; ... # something that inserts data in any order MariaDB> set rocksdb_bulk_load=0; ● Same as previous but doesn’t require PK order
  38. 38. 38 Other speedups for lading ● rocksdb_commit_in_the_middle – Auto commit every rocksdb_bulk_load_size rows ● @@unique_checks – Setting to FALSE will bypass unique checks
  39. 39. 39 Creating indexes ● Index creation doesn’t need any special settings mysql> ALTER TABLE myrocks_table ADD INDEX(...); ● Compression is enabled by default – Lots of settings to tune it further
  40. 40. 40 max_row_locks as a safety setting ● Avoid run-away memory usage and OOM killer: MariaDB> set rocksdb_max_row_locks=10000; MariaDB> alter table t1 engine=rocksdb; ERROR 4067 (HY000): Status error 10 received from RocksDB: Operation aborted: Failed to acquire lock due to max_num_locks limit ● This setting is useful after migration, too.
  41. 41. 41 Using MyRocks ● Installation ● Migration ● Tuning ● Replication ● Backup
  42. 42. 42 Essential MyRocks tuning ● rocksdb_flush_log_at_trx_commit=1 – Same as innodb_flush_log_at_trx_commit=1 – Together with sync_binlog=1 makes the master crash- safe ● rocksdb_block_cache_size=... – Similar to innodb_buffer_pool_size – 500Mb by default (will use up to that amount * 1.5?) – Set higher if have more RAM ● Safety: rocksdb_max_row_locks=...
  43. 43. 43 Indexes ● Primary Key is the clustered index (like in InnoDB) ● Secondary indexes include PK columns (like in InnoDB) ● Non-unique secondary indexes are cheap – Index maintenance is read-free ● Unique secondary indexes are more expensive – Maintenance is not read-free – Reads can suffer from read amplification – Can use @@unique_check=false, at your own risk
  44. 44. 44 Charsets and collations ● Index entries are compared with memcmp(), “mem-comparable” ● “Reversible collations”: binary, latin1_bin, utf8_bin – Can convert values to/from their mem-comparable form ● “Restorable collations” – 1-byte characters, one weght per character – e.g. latin1_general_ci, latin1_swedish_ci, ... – Index stores mem-comparable form + restore data ● Other collations (e.g. utf8_general_ci) – Index-only scans are not supported – Table row stores the original value ‘A’ a 1 ‘a’ a 0
  45. 45. 45 Charsets and collations (2) ● Using indexes on “Other” (collation) produces a warning MariaDB> create table t3 (... a varchar(100) collate utf8_general_ci, key(a)) engine=rocksdb; Query OK, 0 rows affected, 1 warning (0.20 sec) MariaDB> show warningsG *************************** 1. row *************************** Level: Warning Code: 1815 Message: Internal error: Indexed column test.t3.a uses a collation that does not allow index-only access in secondary key and has reduced disk space efficiency in primary key.
  46. 46. 46 Using MyRocks ● Installing ● Migration ● Tuning ● Replication ● Backup
  47. 47. 47 MyRocks only supports RBR ● MyRocks’ highest isolation level is “snapshot isolation”, that is REPEATABLE-READ ● Statement-Based Replication – Slave runs the statements sequentially (~ serializable) – InnoDB uses “Gap Locks” to provide isolation req’d by SBR – MyRocks doesn’t support Gap Locks ● Row-Based Replication must be used – binlog_format=row
  48. 48. 48 Parallel replication ● Facebook/MySQL-5.6 is based on MySQL 5.6 – Parallel replication for data in different databases ● MariaDB has more advanced parallel slave (controlled by @@slave_parallel_mode) – Conservative (group-commit based) – Optimistic (apply transactions in parallel, on conflict roll back the transaction with greater seq_no) – Aggressive
  49. 49. 49 Parallel replication (2) ● Conservative parallel replication works with MyRocks – Different execution paths for log_slave_updates=ON|OFF – ON might be faster ● Optimistic and aggressive do not provide speedups over conservative.
  50. 50. 50 mysql.gtid_slave_pos ● mysql.gtid_slave_pos stores slave’s position ● Use a transactional storage engine for it – Database contents will match slave position after crash recovery – Makes the slave crash-safe. ● mysql.gtid_slave_pos uses a different engine? – Cross-engine transaction (slow).
  51. 51. 51 Per-engine mysql.gtid_slave_pos ● The idea: – Have mysql.gtid_slave_pos_${engine} for each storage engine – Slave position is the biggest position in all tables. – Transaction affecting only MyRocks will only update mysql.gtid_slave_pos_rocksdb ● Configuration: --gtid-pos-auto-engines=engine1,engine2,...
  52. 52. 52 Per-engine mysql.gtid_slave_pos ● Available in MariaDB 10.3 ● Thanks for the patch to – Kristian Nielsen (implementation) – Booking.com (request and funding) ● Don’t let your slave be 2-3x slower: – 10.3: my.cnf: gtid-pos-auto-engines=INNODB,ROCKSDB – 10.2: MariaDB> ALTER TABLE mysql.gtid_slave_pos ENGINE=RocksDB;
  53. 53. 53 Replication takeaways ● Must use binlog_format=row ● Parallel replication works (slave_parallel_mode=conservative) ● Set your mysql.gtid_slave_pos correctly – 10.2: preferred engine – 10.3: per-engine
  54. 54. 54 Using MyRocks ● Installation ● Migration ● Tuning ● Replication ● Backup
  55. 55. 55 myrocks_hotbackup tool ● Takes a hot (non-blocking) backup of RocksDB data mkdir /path/to/checkpoint-dir myrocks_hotbackup --user=... --port=... --checkpoint_dir=/path/to/checkpoint-dir | ssh "tar -xi ..." myrocks_hotbackup --move_back --datadir=/path/to/restore-dir --rocksdb_datadir=/path/to/restore-dir/#rocksdb --rocksdb_waldir= /path/to/restore-dir/#rocksdb --backup_dir=/where/tarball/was/unpacked ● Making a backup ● Starting from backup
  56. 56. 56 myrocks_hotbackup limitations ● Targets MyRocks-only installs – Copies MyRocks data and supplementary files – Does *not* work for mixed MyRocks+InnoDB setups ● Assumes no DDL is running concurrently with the backup – May get a wrong frm file ● Not very user-friendly
  57. 57. 57 Further backup considerations ● mariabackup does not work with MyRocks currently ● myrocks_hotbackup is similar (actually simpler) than xtrabackup ● Will look at making mariabackup work with MyRocks
  58. 58. 58 Using MyRocks ● Installation ● Migration ● Tuning ● Replication ● Backup ✔ ✔ ✔ ✔ ✔
  59. 59. 59 Documentation ● https://mariadb.com/kb/en/library/myrocks/ ● https://github.com/facebook/mysql-5.6/wiki ● https://mariadb.com/kb/en/library/differences- between-myrocks-variants/
  60. 60. 60 Conclusions ● MyRocks is a write-optimized storage engine based on LSM-trees – Developed and used by Facebook ● Available in MariaDB 10.2 and MariaDB 10.3 – Currently RC, Stable very soon ● It has some limitations and requires some tuning – Easy to tackle if you were here for this talk :-)
  61. 61. Thanks! Q&A

×