Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Perconalive feb-2011-share

1,393 views

Published on

Slide deck from talk at Percona Live in February 2011

Published in: Data & Analytics

Perconalive feb-2011-share

  1. 1. High Value Transaction Processing Mark Callaghan
  2. 2. What do I mean by value? ▪ Low price? ▪ High price/performance? ▪ Valuable data
  3. 3. OLTP in the datacenter ▪ Sharding ▪ Availability ▪ Legacy applications ▪ Used by many applications
  4. 4. Sharding ▪ Sharding is easy, resharding is hard ▪ Joins within a shard are still frequent and useful ▪ Some all-shards joins must use Hive ▪ Provides some fault-isolation benefits
  5. 5. Availability ▪ Sources of downtime ▪ Schema change (but now we have OSC) ▪ Manual failover ▪ Misbehaving applications ▪ Oops
  6. 6. Used by many applications If your company is successful then ▪ Your database will be accessed by many different applications ▪ Application authors might not be MySQL experts ▪ Application owners might have different priorities than the DB team
  7. 7. Legacy applications If your company is successful then you will have ▪ Applications written many years ago by people who are gone ▪ Design decisions that are not good for your current size ▪ Not enough resources or time to rewrite applications
  8. 8. Our busy OLTP deployment ▪ Query response time ▪ 4 ms reads, 5ms writes ▪ Network bytes per second ▪ 38GB peak ▪ Queries per second ▪ 13M peak ▪ Rows read per second ▪ 450M peak ▪ Rows changed per second ▪ 3.5M peak ▪ InnoDB page IO per second ▪ 5.2M peak
  9. 9. Recent improvements ▪ Joint work by Facebook, Percona and Oracle/MySQL ▪ Prevent InnoDB stalls ▪ Stalls from caches ▪ Stalls from mutexes ▪ IO efficiency ▪ Improve monitoring ▪ Improve XtraBackup
  10. 10. How do you measure performance? ▪ Response time variance leads to bad user experiences ▪ Optimizations that defer work must handle steady-state loads ▪ When designing a server the choices are: ▪ No concurrency (and no mutexes) ▪ One mutex ▪ More than one mutex
  11. 11. This has good average performance
  12. 12. Which metric matters?
  13. 13. Stalls from caches Caches that defer expensive operations must eventually complete them at the same rate at which they are deferred. ▪ InnoDB purge ▪ InnoDB insert buffer ▪ Async writes are not async ▪ Fuzzy checkpoint constraint enforcement
  14. 14. InnoDB purge stalls ▪ InnoDB purge removes delete-marked rows ▪ Done by the main background thread in 5.1 plugin ▪ Optionally done by a separate thread in 5.5 ▪ Purge is single-threaded and might be stalled by disk reads ▪ Further it gets behind, more likely it won’t catch up ▪ Need multiple purge threads as the main background thread can become the dedicated purge thread and that isn’t enough do { n_pages_purged = trx_purge(); } while (n_pages_purged);
  15. 15. InnoDB insert buffer stalls ▪ The insert buffer is not drained as fast as it can get full ▪ Drain rate is 5% of innodb_io_capacity ▪ bugs.mysql.com/59214 ▪ Fixed in the Facebook patch and XtraDB ▪ Patch pending for MySQL 5.5
  16. 16. Performance drops when ibuf is full
  17. 17. Otherwise, the insert buffer is awesome
  18. 18. Fuzzy checkpoint constraint ▪ TotalLogSize = #log_files X innodb_log_file_size ▪ AsyncLimit = 0.70 X TotalLogSize ▪ SyncLimit = 0.75 X TotalLogSize ▪ OldestDirtyLSN is the smallest oldest_modification LSN of all dirty pages in the buffer pool ▪ Age = CurrentLSN – OldestDirtyLSN Fuzzy Checkpoint Constraint ▪ If Age > SyncLimit then flush_dirty_pages_synch() ▪ Else if Age > AsyncLimit then flush_dirty_pages_async()
  19. 19. Async page writes are not async ▪ Async page write requests submitted per fuzzy checkpoint constraint are not async ▪ User transactions may do this via log_preflush_pool_modified_pages ▪ Caller does large write for doublewrite buffer ▪ Caller then submits in-place write requests for background write threads ▪ Caller then waits for background write threads to finish ▪ bugs.mysql.com/55004 ▪ Fixed in the Facebook patch
  20. 20. Fuzzy checkpoint constraint enforcement Prior to InnoDB plugin 5.1.38, page writes done to enforce the fuzzy checkpoint constraint were not submitted by the main background thread. ▪ InnoDB plugin added innodb_adaptive_flushing in 5.1.38 plugin ▪ Percona added innodb_adaptive_checkpoint ▪ Facebook patch added innodb_background_checkpoint
  21. 21. Sysbench QPS at 20 second intervals with checkpoint stalls
  22. 22. Stalls from mutexes ▪ Extending InnoDB files ▪ Opening InnoDB tables ▪ Purge/undo lock conflicts ▪ TRUNCATE table and LOCK_open ▪ DROP table and LOCK_open ▪ Buffer pool invalidate ▪ LOCK_open and kernel_mutex ▪ Excessive calls to fcntl ▪ Deadlock detection overhead ▪ innodb_thread_concurrency
  23. 23. Stalls from extending InnoDB files ▪ A global mutex is locked when InnoDB tables are extended while writes are done to extend the file ▪ All reads on the file are blocked until the writes are done ▪ bugs.mysql.com/56433 ▪ To be fixed real soon in the Facebook patch
  24. 24. Stalls from opening InnoDB tables ▪ Opening table handler instances is serialized on LOCK_open. Index cardinality stats might then be computed using random reads ▪ bugs.mysql.com/49463 and bugs.mysql.com/53046 ▪ Fixed in the Facebook patch and MySQL 5.5 ▪ When stats are recomputed many uses of that table will stall ▪ Fixed in the Facebook patch ▪ Index stats could be recomputed too frequently ▪ bugs.mysql.com/56340 ▪ Fixed in the Facebook patch, MySQL 5.1 and MySQL 5.5
  25. 25. Stalls from purge/undo lock conflicts ▪ Purge and undo are not concurrent on the same InnoDB table ▪ Purge gets a share lock on the table ▪ Undo gets an exclusive lock on the table ▪ REPLACE statements that use insert-then-undo can generate undo ▪ bugs.mysql.com/54538 ▪ Fixed in MySQL 5.1.55 and MySQL 5.5
  26. 26. TRUNCATE table and LOCK_open ▪ LOCK_open is held when the truncate is done by InnoDB ▪ When file-per-table is used the file must be removed and that can take too long ▪ The InnoDB buffer pool LRU must be scanned ▪ New queries cannot be started ▪ bugs.mysql.com/41158 and bugs.mysql.com/56696 ▪ Fixed in MySQL 5.5 courtesy of meta-data locking
  27. 27. DROP table and LOCK_open ▪ LOCK_open is held when the drop is done by InnoDB ▪ When file-per-table is used the file must be removed and that can take too long ▪ The InnoDB buffer pool LRU must be scanned ▪ New queries cannot be started ▪ bugs.mysql.com/56655 ▪ Fixed in the Facebook patch ▪ Do most InnoDB processing in the background drop queue ▪ Fixed in MySQL 5.5 courtesy of meta-data locking
  28. 28. TRUNCATE/DROP table and invalidate ▪ Pages for table removed from buffer pool and adaptive hash ▪ InnoDB buffer pool mutex locked while the LRU is scanned ▪ This is slow with a large buffer pool ▪ Most threads in InnoDB will block waiting for the buffer pool mutex ▪ bugs.mysql.com/51325 and bugs.mysql.com/56332 ▪ I hope Yasufumi can fix it
  29. 29. LOCK_open and kernel_mutex conflicts ▪ Thread A ▪ Gather table statistics while holding LOCK_open ▪ Block on kernel_mutex while starting a transaction ▪ Thread B ▪ Hold kernel_mutex while doing deadlock detection ▪ All other threads block on LOCK_open or kernel_mutex ▪ bugs.mysql.com/51557 ▪ Fixed in MySQL 5.5
  30. 30. Stalls from excessive calls to fcntl ▪ fcntl ▪ My Linux kernels get the big kernel lock on fcntl calls ▪ MySQL called fcntl too often ▪ Doubled peak QPS by hacking MySQL to call fcntl less ▪ Almost 200,000 QPS without using HandlerSocket ▪ bugs.mysql.com/54790 ▪ Fixed in Facebook patch, then reverted because it broke SSL tests ▪ Not sure where or when this will be fixed
  31. 31. Sysbench read-only with fcntl fix
  32. 32. Stalls from deadlock detection overhead ▪ InnoDB deadlock detection was very inefficient. Worst case when all threads waited on the same row lock. ▪ Added option to disable it in the Facebook patch and rely on lock wait timeout ▪ MySQL made it more efficient in MySQL 5.1 ▪ bugs.mysql.com/49047
  33. 33. Stalls from innodb_thread_concurrency ▪ When there are 1000+ sleeping threads it can take too long to wake up a specific thread ▪ Change innodb_thread_concurrency to use FIFO scheduling in addition to existing use of LIFO and FIFO+LIFO = FLIFO ▪ Fixed in the Facebook patch
  34. 34. Sysbench TPS with FLIFO
  35. 35. IO efficiency High priority problems for me are: ▪ Reducing IOPs used for my workload ▪ Supporting very large databases Significant improvements: ▪ Switch from mysqldump to XtraBackup ▪ Run innosim to confirm storage performance ▪ Tune InnoDB ▪ Improve schemas and queries
  36. 36. mysqldump vs XtraBackup ▪ mysqldump is slower for backup ▪ Clustered index is scanned row-at-a-time in key order (lots of random reads) ▪ Backup accounts for half of the disk reads for servers I watch ▪ Single-table restore is easy with mysqldump ▪ Possible with XtraBackup thanks to work by Vamsi from Facebook ▪ Incremental backup ▪ Not possible with mysqldump ▪ XtraBackup has incremental (scan all data, write only the changed blocks) ▪ Vamsi from Facebook added support for really incremental, scan & write only the changed blocks
  37. 37. innosim storage benchmark ▪ InnoDB IO simulator that models ▪ Doublewrite buffer ▪ Dirty page writes ▪ Transaction log and binlog fsync and IO ▪ User transactions that do read, write and commit ▪ Search for “facebook innosim” ▪ Source code on launchpad
  38. 38. Tune InnoDB ▪ It is not easy to support many concurrent disk reads ▪ Innodb_thread_concurrency tickets not released when waiting for a read ▪ If innodb_thread_concurrency is too high then writers suffer ▪ If innodb_thread_concurrency is too low then readers suffer ▪ Smaller pages are better for some but not all tables ▪ A large log file can reduce the dirty page flush rate ▪ A large buffer pool can reduce the page read rate
  39. 39. IOPs is a function of size and concurrency
  40. 40. Smaller pages aren’t always better
  41. 41. Checkpoint IO rate by log file size
  42. 42. Page read rate by buffer pool size
  43. 43. Improve schemas ▪ Make your performance critical queries index only ▪ Primary key columns are included in the secondary index ▪ Understand how the insert buffer makes index maintenance cheaper ▪ Figure out how to do schema changes with minimal downtime ▪ We used the Online Schema Change tool (thanks Vamsi) ▪ You can also do the schema change on a slave first and then promote it
  44. 44. Monitoring ▪ Per table, index, account via information_schema tables ▪ Efficient and always enabled ▪ Easy to use ▪ Enhanced slow query log ▪ Facebook patch added options to do sampling for the slow query log ▪ Sample from all queries and from all queries that have an error ▪ Error is limited to errno, error text must wait for 5.5 plugin ▪ Aggregate by query text and URL from query commen
  45. 45. Open Problems ▪ Parallel replication apply ▪ Support max concurrent queries ▪ Automate slave failover when a master fails ▪ Use InnoDB compression for OLTP ▪ Multi-master replication with conflict resolution
  46. 46. Parallel replication apply ▪ Replication apply is single-threaded. This causes lag on IO-bound slaves even when SQL is simple ▪ mk-slave-prefetch can help but something better is needed ▪ Is a thread running BEGIN; replay-slave-sql; ROLLBACK better? ▪ I want: ▪ N replay queues ▪ Binlog events (SBR or RBR) hashed to queues by database names ▪ Each queue replayed in parallel
  47. 47. Max concurrent queries ▪ Use large values for max concurrent connections per account ▪ Enforce smaller values for max concurrent queries ▪ We have begun testing an implementation. ▪ Enforce at statement entry ▪ Account for threads that block (row lock, disk IO, network IO)
  48. 48. Automate slave failover ▪ Global transactions IDs from the Google patch is awesome ▪ But I don’t have the skills to port or support it ▪ A unique ID per binlog group or event might be sufficient ▪ Add an attribute to binlog event metadata ▪ Preserve on the slave similar to server ID
  49. 49. InnoDB compression for OLTP ▪ Change InnoDB to not log page images for compressed pages ▪ Logging them increases the log IO rate ▪ Increasing the log IO rate then increases the checkpoint IO rate ▪ Change InnoDB to use QuickLZ instead of zlib for compression ▪ Add an option to limit compression to the PK index ▪ Add per-table compression statistics
  50. 50. MySQL in the datacenter ▪ Previously dominated the market ▪ Now it must learn to share ▪ PostgreSQL continues to improve for OLTP ▪ Hbase, Cassandra, MongoDB are getting transactions today
  51. 51. Why NoSQL ▪ Do less, but do it better ▪ Some offer write-optimized data stores ▪ Some don’t require sharding ▪ Interesting HA models ▪ Cassandra doesn’t have the notion of failover ▪ HBase doesn’t require failover when a server dies ▪ Healthy development communities improve code quickly
  52. 52. What comes next ▪ Batch extraction is not the answer for MySQL/NoSQL integration ▪ NoSQL deployments will be reminded that ▪ Some of your problems are independent of technology ▪ You need better monitoring ▪ There is downtime when you need to modify the clustered index ▪ Database ops is hard with legacy apps and multi-user deployments ▪ In a few years someone will document the many stalls in HBase
  53. 53. The End Thank you

×