Monitor some of the things

1,814 views

Published on

Baron Schwartz slides about Monitoring at Devops Days NYC

Published in: Software
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,814
On SlideShare
0
From Embeds
0
Number of Embeds
33
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Monitor some of the things

  1. 1. 2013-10-18 MONITOR SOME OF THE THINGS
  2. 2. Optimization, Backups, Replication, and more 3rd Edition Covers Version 5.5 High Performance MySQL Baron Schwartz, Peter Zaitsev & Vadim Tkachenko ME • Cofounder of @VividCortex • Author of High Performance MySQL • @xaprb on Twitter • baron@vividcortex.com • http://www.linkedin.com/in/xaprb
  3. 3. RANT, RECAPPED • The sky is falling • Tools drive processes, and we need better tools designed for methods • Pay attention to CAPS (Capacity, Availability, Performance, Scalability) • Monitoring tools need to be a lot smarter • Measure and monitor “work getting done”
  4. 4. HARD CAPACITY • Disk volume • CPU Cycles • max_connections • File descriptors, sockets, TCP port numbers, etc • %used, absolute quantity available
  5. 5. SOFT CAPACITY • Neil Gunther’s Universal Scalability Law • %used, absolute quantity available • Throughput, concurrency, errors
  6. 6. AVAILABILITY • Availability is absence of downtime • %used, absolute quantity available • Throughput, concurrency, errors • MTBF, MTTR, MTTD, %availability
  7. 7. TASK PERFORMANCE • Task performance is consistently fast response time. • Measure an SLA in percentile response time per task, over observation intervals • %used, absolute quantity available • Throughput, concurrency, errors • MTBF, MTTR, MTTD, %availability • Response time, 95% response time
  8. 8. RESOURCE PERFORMANCE • Resource performance is ability to run tasks consistently fast. • %used, absolute quantity available • Throughput, concurrency, errors • MTBF, MTTR, MTTD, %availability • Response time, 95% response time • Throughput, concurrency, busy time, total response time, backlog/queue
  9. 9. SCALABILITY • Universal Scalability Law again • %used, absolute quantity available • Throughput, concurrency, errors • MTBF, MTTR, MTTD, %availability • Response time, 95% response time • Throughput, concurrency, busy time, total response time, backlog/queue
  10. 10. STALL DETECTION • Overloaded or underperforming? • %used, absolute quantity available • Throughput, concurrency, errors • MTBF, MTTR, MTTD, %availability • Response time, 95% response time • Throughput, concurrency, busy time, total response time, backlog/queue • Utilization, saturation, errors, sources of load/demand
  11. 11. GIT ‘ER DONE MONITOR WORK AND RESOURCES
  12. 12. WHAT NOT TO DO • Don’t use top-N lists from Google • Don’t just do what’s included in some Nagios plugin
  13. 13. №1 TOP 10 LIST 1. MySQL availability 2. Presence of insecure users and databases 3. Aborted connects 4. Error log 5. Deadlocks 6. Change in server configuration 7. Slow query log 8. Slave lag 9. Percentage of maximum allowed connections 10. Percentage of full table scans
  14. 14. №2 TOP 10 LIST 1. Threads_connected 2. Created_tmp_disk_tables 3. Handler_read_first 4. Innodb_buffer_pool_wait_free 5. Key_reads 6. Max_used_connections 7. Open_tables 8. Select_full_join 9. Slow_queries 10. Uptime
  15. 15. №1 PLUGIN 1. threadcache-hitrate (Hit rate of the thread-cache) 2. slave-io-running (Slave io running: Yes) 3. slave-sql-running (Slave sql running: Yes) 4. qcache-hitrate (Query cache hitrate) 5. qcache-lowmem-prunes (Query cache entries pruned because of low memory) 6. keycache-hitrate (MyISAM key cache hitrate) 7. bufferpool-hitrate (InnoDB buffer pool hitrate) 8. bufferpool-wait-free (InnoDB buffer pool waits for clean page available) 9. log-waits (InnoDB log waits because of a too small log buffer) 10. tablecache-hitrate (Table cache hitrate) 11. table-lock-contention (Table lock contention) 12. index-usage (Usage of indices) 13. tmp-disk-tables (Percent of temp tables created on disk) 14. long-running-procs (long running processes)
  16. 16. №2 PLUGIN 1. connection-time 2. uptime 3. threads-connected 4. threadcache-hitrate 5. q[uery]cache-hitrate 6. q[uery]cache-lowmem-prunes 7. [myisam-]keycache-hitrate 8. [innodb-]bufferpool-hitrate 9. [innodb-]bufferpool-wait-free 10. [innodb-]log-waits 11. tablecache-hitrate 12. table-lock-contention 13. index-usage 14. tmp-disk-tables 15. slow-queries 16. long-running-procs 17. slave-lag 18. slave-io-running 19. slave-sql-running 20. sql 21. open-files 22. encode 23. cluster-ndb-running
  17. 17. №3 PLUGIN
  18. 18. HTTP://WWW.FLICKR.COM/PHOTOS/NASAMARSHALL/5926864640/ SURFACE AREA
  19. 19. DUPLICATE SIGNALS • Queries • Com_admin_commands • Com_assign_to_keycache • Com_alter_db • Com_alter_db_upgrade • Com_alter_event • Com_alter_function • Com_alter_procedure • Com_alter_server • Com_alter_table • Com_alter_tablespace • Com_alter_user • Com_analyze • Com_begin • Com_binlog • Com_ad_nauseum
  20. 20. DESIRABLE METRICS • %used, absolute quantity available • Throughput, concurrency, errors • MTBF, MTTR, MTTD, %availability • Response time, 95% response time • Throughput, concurrency, busy time, total response time, backlog/queue • Utilization, saturation, errors, sources of load/demand
  21. 21. Desirable Easy
  22. 22. Desirable Easy
  23. 23. IRRELEVANT EXAMPLE PLEASE?
  24. 24. RESOURCE LIMITS • Threads_connected near max_connections? • %table cache used? • Open file handles? • Long-running queries/transactions?
  25. 25. ERRORS • Deadlocks? • Aborted connects?
  26. 26. AVAILABILITY • Ability to connect and run a query? • Uptime is small? • Replication is running?
  27. 27. PERFORMANCE • You can get throughput (Queries) and concurrency (Threads_running) from MySQL • But in a Nagios check, no context to know whether they’re good or bad • You generally can’t get response time, busy time, utilization, backlog, etc • You can aggregate thread states, thread times, users, databases, query abstracts...
  28. 28. NAGIOS IS BEST AT LIVING IN THE MOMENT
  29. 29. THOU SHALT NOT • Cache hit ratios • Thread cache hit ratio • Buffer pool cache hit ratio • Table cache hit ratio • Key cache hit ratio • Query cache hit ratio • Rates of “bad” queries • % temp tables on disk • % full table scans • % slow queries • Unfixable things • Replication delay
  30. 30. WHY NOT? • Those are properties of the workload and application • They are not conditions to alert/warn about • They are not fixable / actionable in the service
  31. 31. ALERTS ARE BETTER TOGETHER
  32. 32. QUESTION: WHAT IS BETTER?
  33. 33. №1 ALERT!!!!! Disk CRIT 100% /dev/sda2
  34. 34. №2 ALERT!!!!! Replication CRIT Slave I/O Thread No
  35. 35. №3 ALERT!!!!! Replication CRIT Slave SQL Thread No
  36. 36. №4 ALERT!!!!! Replication CRIT Seconds_Behind_Master NULL
  37. 37. №5 ALERT!!!!! MySQL CRIT oldest transaction: 86400 seconds
  38. 38. - OR -
  39. 39. №1 ALERT!!!!! CRIT * Disk /dev/sda2 full * Replication stopped * Oldest transaction 86400 seconds * 4999 threads in status “Waiting for table metadata lock”
  40. 40. HOLLER AT ME QUESTIONS? @XAPRB / BARON@VIVIDCORTEX.COM
  41. 41. RESOURCES • Chapter 3 of High Performance MySQL, 3rd Edition • Percona White Papers • Causes of Downtime in Production MySQL Servers • Preventing MySQL Emergencies • Goal-Driven Performance Optimization • Forecasting MySQL Scalability with the Universal Scalability Law • Method R: Optimizing Oracle Performance, Cary Millsap • The Goal, Eli Goldratt • The USE Method (Brendan Gregg) & his new book • Guerrilla Capacity Planning, Neil J. Gunther • Fundamental Performance & Scalability Instrumentation

×