Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

M|18 How Facebook Migrated to MyRocks

274 views

Published on

M|18 How Facebook Migrated to MyRocks

Published in: Data & Analytics
  • Be the first to comment

M|18 How Facebook Migrated to MyRocks

  1. 1. MyRocks deployment at Facebook and Roadmaps Yoshinori Matsunobu Production Engineer, Facebook Feb/2018
  2. 2. Agenda ▪ MySQL at Facebook ▪ MyRocks overview ▪ Production Deployment ▪ MyRocks configuration and monitoring ▪ Future Plans
  3. 3. MySQL “User Database (UDB)” at Facebook▪ Storing Social Graph ▪ Massively Sharded ▪ Low latency ▪ Automated Operations ▪ Pure Flash Storage (Constrained by space, not by CPU/IOPS)
  4. 4. What is MyRocks ▪ MySQL on top of RocksDB (RocksDB storage engine) ▪ Open Source, distributed from MariaDB and Percona as well MySQL Clients InnoDB RocksDB Parser Optimizer Replication etc SQL/Connector MySQL http://myrocks.io/
  5. 5. MyRocks Initial Goal at Facebook InnoDB in main database 90% SpaceIOCPU Machine limit 15%20% MyRocks in main database 45% SpaceIOCPU Machine limit 15%21% 21% 15% 45%
  6. 6. MyRocks features ▪ Clustered Index (same as InnoDB) ▪ Bloom Filter and Column Family ▪ Transactions, including consistency between binlog and RocksDB ▪ Faster data loading, deletes and replication ▪ Dynamic Options ▪ TTL ▪ Online logical and binary backup
  7. 7. MyRocks vs InnoDB ▪ MyRocks pros ▪ Much smaller space (half compared to compressed InnoDB) ▪ Gives better cache hit rate ▪ Writes are faster = Faster Replication ▪ Much smaller bytes written ▪ MyRocks cons (improvements in progress) ▪ Lack of several features ▪ No SBR, Gap Lock, Foreign Key, Fulltext Index, Spatial Index support. Need to use case sensitive collation for perf ▪ Reads are slower, especially if your data fits in memory ▪ More dependent on filesystem and OS. Lack of solid direct i/o. Must use newer 4.6 kernel
  8. 8. MyRocks migration -- Technical Challenges ▪ Initial Migration ▪ Creating MyRocks instances without downtime ▪ Loading into MyRocks tables within reasonable time ▪ Verifying data consistency between InnoDB and MyRocks ▪ Continuous Monitoring ▪ Resource Usage like space, iops, cpu and memory ▪ Query plan outliers ▪ Stalls and crashes
  9. 9. MyRocks migration -- Technical Challenges (2)▪ When running MyRocks on master ▪ RBR (Row based binary logging) ▪ Removing queries relying on InnoDB Gap Lock
  10. 10. Creating first MyRocks instance without downtime ▪ Picking one of the InnoDB slave instances, then starting logical dump and restore ▪ Stopping one slave does not affect services Master (InnoDB) Slave1 (InnoDB) Slave2 (InnoDB) Slave3 (InnoDB) Slave4 (MyRocks) Stop & Dump & Load
  11. 11. Faster Data Loading Normal Write Path in MyRocks/RocksDB Write Requests MemTableWAL Level 0 SST Level 1 SST Level max SST …. Flush Compaction Compaction Faster Write Path Write Requests Level max SST “SET SESSION rocksdb_bulk_load=1;” Original data must be sorted by primary key
  12. 12. Data migration steps ▪ Dst) Create table … ENGINE=ROCKSDB; (creating MyRocks tables with proper column families) ▪ Dst) ALTER TABLE DROP INDEX; (dropping secondary keys) ▪ Src) STOP SLAVE; ▪ mysqldump –host=innodb-host --order-by-primary --rocksdb-bulk-load | mysql –host=myrocks- host –init-command=“set sql_log_bin=0” ▪ Dst) ALTER TABLE ADD INDEX; (adding secondary keys) ▪ Src, Dst) START SLAVE;
  13. 13. Data Verification ▪ MyRocks/RocksDB is relatively new database technology ▪ Might have more bugs than robust InnoDB ▪ Ensuring data consistency helps avoid showing conflicting results
  14. 14. Verification tests ▪ Index count check between primary key and secondary keys ▪ If any index is broken, it can be detected ▪ SELECT ‘PRIMARY’, COUNT(*) FROM t FORCE INDEX (PRIMARY) UNION SELECT ‘idx1’, COUNT(*) FROM t FORCE INDEX (idx1) ▪ Can’t be used if there is no secondary key ▪ Index stats check ▪ Checking if “rows” show SHOW TABLE STATUS is not far different from actual row count ▪ Checksum tests w/ InnoDB ▪ Comparing between InnoDB instance and MyRocks instance ▪ Creating a transaction consistent snapshot at the same GTID position, scan, then compare checksum ▪ Shadow correctness check ▪ Capturing read traffics
  15. 15. Shadow traffic tests ▪ We have a shadow test framework ▪ MySQL audit plugin to capture read/write queries from production instances ▪ Replaying them into shadow master instances ▪ Shadow master tests ▪ Client errors ▪ Rewriting queries relying on Gap Lock ▪ gap_lock_raise_error=1, gap_lock_write_log=1
  16. 16. Creating second MyRocks instance without downtime Master (InnoDB) Slave1 (InnoDB) Slave2 (InnoDB) Slave3 (MyRocks) Slave4 (MyRocks) myrocks_hotbackup (Online binary backup)
  17. 17. Promoting MyRocks as a master Master (MyRocks) Slave1 (InnoDB) Slave2 (InnoDB) Slave3 (InnoDB) Slave4 (MyRocks)
  18. 18. Promoting MyRocks as a master Master (MyRocks) Slave1 (MyRocks) Slave2 (MyRocks) Slave3 (MyRocks) Slave4 (MyRocks)
  19. 19. Our MyRocks configurations ▪ Using the latest 4.6 Linux kernel that fixed many filesystem (XFS) and memory allocation issues ▪ Building MySQL/MyRocks with jemalloc ▪ Using 16KB block size ▪ Five major column families, depending on data characteristics ▪ Using ZSTD compression in the bottommost level, LZ4 for the rest ▪ Not creating bloom filter in the bottommost level, to save space and memory
  20. 20. Monitoring ▪ Column Family Statistics ▪ Lock contentions ▪ Bloom filter hit rates ▪ Tombstones (delete-marker)
  21. 21. SHOW ENGINE ROCKSDB STATUS ▪ Column Family Statistics, including size, read and write amp per level ▪ Memory usage *************************** 7. row *************************** Type: CF_COMPACTION Name: default Status: ** Compaction Stats [default] ** Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop --------------------------------------------------------------------------------------------------------------------------------------------------------------------- L0 2/0 51.58 0.5 0.0 0.0 0.0 0.3 0.3 0.0 0.0 0.0 40.3 7 10 0.669 0 0 L3 6/0 109.36 0.9 0.7 0.7 0.0 0.6 0.6 0.0 0.9 43.8 40.7 16 3 5.172 7494K 297K L4 61/0 1247.31 1.0 2.0 0.3 1.7 2.0 0.2 0.0 6.9 49.7 48.5 41 9 4.593 15M 176K L5 989/0 12592.86 1.0 2.0 0.3 1.8 1.9 0.1 0.0 7.4 8.1 7.4 258 8 32.209 17M 726K L6 4271/0 127363.51 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0 Sum 5329/0 141364.62 0.0 4.7 1.2 3.5 4.7 1.2 0.0 17.9 15.0 15.0 321 30 10.707 41M 1200K
  22. 22. Monitoring Lock Contentions ▪ “Snapshot Conflict” errors with Repeatable Read isolation ▪ Because of implementation differences in MyRocks (PostgreSQL Style snapshot isolation) ▪ We use tx-isolation=READ-COMMITTED to solve the issue ▪ MyRocks supports basic lock contention metrics ▪ Queries hit deadlock errors ▪ The number of deadlocks and lock wait timeouts ▪ The number of lock waits and lock wait time
  23. 23. SHOW GLOBAL STATUS mysql> show global status like 'rocksdb%'; +---------------------------------------+-------------+ | Variable_name | Value | +---------------------------------------+-------------+ | rocksdb_rows_deleted | 216223 | | rocksdb_rows_inserted | 1318158 | | rocksdb_rows_read | 7102838 | | rocksdb_rows_updated | 1997116 | .... | rocksdb_bloom_filter_prefix_checked | 773124 | | rocksdb_bloom_filter_prefix_useful | 308445 | | rocksdb_bloom_filter_useful | 10108448 | ....
  24. 24. Lessons Learned ▪ Linux kernel VM allocation stalls ▪ We use buffered i/o for MyRocks/RocksDB ▪ Linux fixed vm allocation stalls in newer 4.6 ▪ Don’t delete files too fast on Flash ▪ Avoid multi-second TRIM stalls ▪ We stopped mounting with discard option for binlog and WAL ▪ Hard to generate core files ▪ MyRocks + jemalloc uses too much VIRT size
  25. 25. Our current production status We COMPLETED InnoDB to MyRocks migration in UDB We saved 50% space in UDB compared to compressed InnoDB We started working on migrating other large database tiers
  26. 26. Development Roadmaps ▪ Helping MariaDB and Percona Server to release with stable MyRocks ▪ Matching read performance vs InnoDB ▪ https://smalldatum.blogspot.com ▪ Supporting Mixed Engines ▪ Better Replication ▪ Supporting Bigger Instance Size
  27. 27. Mixed Engines ▪ Currently our production use case is either “MyRocks only” or “InnoDB only” instance ▪ There are several internal/external use cases that want to use InnoDB and MyRocks within the same instance, though single transaction does not overlap engines ▪ Online logical/binary Backup support and benchmarks are concerns ▪ Current plan is extending xtrabackup to integrate myrocks_hotbackup ▪ Considering to backporting gtid_pos_auto_engines from MariaDB
  28. 28. Better Replication ▪ Removing engine log ▪ Both internal and external benchmarks (e.g. Amazon, Alibaba) show that qps improves significantly with binlog disabled ▪ Real Problem would be two logs – binlog and engine log, which requires 2pc and ordered commits ▪ One Log - use one log as the source of truth for commits -- either binlog, binlog-like service or RocksDB WAL ▪ We heavily rely on binlogs (for semisync, binlog consumers), TBD is how much perf we gain by stopping writing to WAL ▪ Parallel replication apply ▪ Batching ▪ Skipping using transactions on slaves
  29. 29. Supporting Bigger Instance Size ▪ Problem Statement: Shared Nothing database is not general purpose database ▪ MySQL Cluster, Spider, Vitess ▪ Good if you have specific purposes. Might have issues if people lack of expertise about atomic transactions, joins and secondary keys ▪ Suggestion: Now we have 256GB+ RAM and 10TB+ Flash on commodity servers. Why not run one big instance and put everything there? ▪ Bigger instances may help general purpose small-mid applications ▪ They don’t have to worry about sharding. Atomic trans, joins and secondary keys just work ▪ e.g. Amazon Aurora (supporting up to 60TB instance) and Alibaba PolarDB (~100TB instance)
  30. 30. Future Plans to support Bigger Instance (1) ▪ Parallel transactional mysqldump ▪ Parallel Query ▪ e.g. how to make mysqldump finish within 24 hours from 20TB table? ▪ Parallel binary copy ▪ e.g. how quickly can we create a 60TB replica instance in a remote region? ▪ Parallel DDL, Parallel Loading ▪ Resumable DDL ▪ e.g. if the DDL is expected to take 10 days, what will happen if mysqld restarts after 8 days?
  31. 31. Future Plans to support Bigger Instance (2) ▪ Better join algorithm ▪ Much faster replication ▪ Can handle 10x connection requests and queries ▪ Good resource control ▪ H/W perspective: Shared Storage and Elastic Computing Units ▪ Can scale read replicas from the same shared storage
  32. 32. Summary ▪ We finished deploying MyRocks in our production user database (UDB) ▪ You can start deploying slaves, with consistency check ▪ We have added many status counters for instance monitoring ▪ More interesting features will come this year
  33. 33. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

×