Monitoring MySQL with OpenTSDB

9,128 views

Published on

A monitoring system is arguably the most crucial system to have in place when administering and tweaking the performance of any database system. DBAs also find themselves with a variety of monitoring systems and plugins to use; ranging from small scripts in cron to complex data collection systems. In this talk, I’ll discuss how Box made a shift from the Cacti monitoring system and other various shell scripts to OpenTSDB and the changes made to our servers and daily interaction with monitoring to increase our agility in identifying and addressing changes in database behavior.

Published in: Technology, Design
0 Comments
13 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
9,128
On SlideShare
0
From Embeds
0
Number of Embeds
3,219
Actions
Shares
0
Downloads
149
Comments
0
Likes
13
Embeds 0
No embeds

No notes for slide
  • Will be talking about OpenTSDBHow OpenTSDB changed monitoring at boxHow we leverage it’s abilities for day-to-day management of MySQL DBs
  • Youprobablyhave the perconacactigraphs and monitoring plugins
  • Youaddsomeothernagioschecks for funedgecases
  • And you use different tools from the percona toolkit like:StalkPoor man’s profiler (PMP)Query Digest
  • Suddenly finding problems and correlating issues is difficultMaybe you don’t have a NOC yetMaybe you do, and they need better graphs
  • IT’S BIGGER ON THE INSIDE – just kiddingFast!Easy to build graphs on the flyHella easy to scale – just add nodes (HBase or TSDs)Very easy to put data into it – NEXT SLIDES TALK ABOUT THIS YO
  • Running threads follows the CPU spikes PERFECTLYBox has a “long query” killer that gets more aggressive as more threads stack upShould get a look at queries on the server
  • Zoom in to get the exact time interval
  • Know the exact time of a high stack upGo to check Box Anemometer to see what query is there
  • This is the URL for thatCan easily paste this to anyone to see the same interactive graph
  • If you prefer text, that’s also an option via APIYou can build cool tools using the APIWeek over Week graphsSimplifies anomaly detectionURL is pretty simpleEffectively just use “q?” and add “&ascii”
  • Get audit log:LoginsTypes of statements issuedEtc.
  • Get performance information about:Row and index change activityRow read activity
  • Generate daily reports of:Are auto increments columns nearing a boundary on a table?Number of records in a tableSize of a datafile for a table
  • Using pt-tcp-modelAllows us to identify when server stops doing work5min interval
  • Aggregate graphs are the defaultDrill down only when problems in aggregate
  • Aggregatesare thedefault–shift in thinking from lookingatspecificimportantservers.Zooming in on a timeslice was painfullymanual– I wroteup a patch to addmouse-zooming and upstreamed. Thiscementedopentsdb as a powerful monitoring tool for Box, overnightAuto-suggest for metricsisspotty– we wrote a quick cron job that dumps full metric list into JSON “Graphs aren’t pretty” – a few changes to the base GNUPlot options solved this. There’s also a “Smooth” option in the interface nowMigrating from POC – we had a single-node setup for the longest time until that fell over...a lotPlan for 3+ machines – it’s enough to run all the needed bits for a light-weight distributed HBase and TSD setupData pruning – ~4 bytes per metric before HDFS replication add up quicklymysql_tcollector - 370 metrics -- ~1.5k per server. X 30s interval = ~4.2MB/dayeither have a plan to prune old data or build out extra capacity and predict storage needs per server/metric added
  • Monitoring MySQL with OpenTSDB

    1. 1. Monitoring MySQL with OpenTSDBPercona live 2013 Geoffrey Anderson, Box Inc.@geodbz
    2. 2. WhoGeoffrey Anderson• Database Operations Engineer @ Box, Inc.• a.k.a. DBA• Tooling for MySQL and HBase• #DBHangOps
    3. 3. TheSituation
    4. 4. ThenYouGetMoreServers
    5. 5. Enter OpenTSDB
    6. 6. OpenTSDB is...• Distributed• Scalable• Time Series Database• Runs on HBase• Created ByBenoit SigoureHBaseTSD forQueryingmydb.example.comHAProxyfe1.example.comTSD forStoringPushMetricsQuery via API
    7. 7. • FAST• EASY to Scale• EASY to Populate• EASY to collect data• EASY to QueryWhy OpenTSDB?
    8. 8. CollectingData
    9. 9. #!/usr/bin/env bashtimestamp=$(date +%s)mysql -ss -e "SHOW GLOBAL STATUS" | while read var valdoecho "mysql.$var $timestamp $val host=$HOSTNAME"doneganderson@mydb.example.com:~$ _./mysql_collector.shmysql.Aborted_connects 1366399993 0 host=mydb.example.commysql.Binlog_cache_disk_use 1366399993 0 host=mydb.example.commysql.Binlog_cache_use 1366399993 0 host=mydb.example.commysql.Binlog_stmt_cache_disk_use 1366399993 0 host=mydb.example.commysql.Binlog_stmt_cache_use 1366399993 0 host=mydb.example.commysql.Bytes_received 1366399993 19453687 host=mydb.example.commysql.Bytes_sent 1366399993 1238166682 host=mydb.example.commysql.Com_admin_commands 1366399993 1 host=mydb.example.commysql.Com_assign_to_keycache 1366399993 0 host=mydb.example.com...Example: mysql_collector.sh
    10. 10. #!/usr/bin/env bashtimestamp=$(date +%s)mysql -ss -e "SHOW GLOBAL STATUS" | while read var valdoecho "mysql.$var $timestamp $val host=$HOSTNAME"doneganderson@mydb.example.com:~$ _./mysql_collector.shmysql.Aborted_connects 1366399993 0 host=mydb.example.commysql.Binlog_cache_disk_use 1366399993 0 host=mydb.example.commysql.Binlog_cache_use 1366399993 0 host=mydb.example.commysql.Binlog_stmt_cache_disk_use 1366399993 0 host=mydb.example.commysql.Binlog_stmt_cache_use 1366399993 0 host=mydb.example.commysql.Bytes_received 1366399993 19453687 host=mydb.example.commysql.Bytes_sent 1366399993 1238166682 host=mydb.example.commysql.Com_admin_commands 1366399993 1 host=mydb.example.commysql.Com_assign_to_keycache 1366399993 0 host=mydb.example.com...Example: mysql_collector.shMetric name Timestamp Value “Tags” (key=val)
    11. 11. * * * * * mysql_collector.sh | nc opentsdb.example.com 4242Example: adding a cron for OpenTSDB
    12. 12. ganderson@mydb.example.com:tcollector$ tree.|-- collectors| |-- 0| | |-- ifstat.py| | |-- iostat.py| | |-- procnettcp.py| | |-- procstats.py| |-- 15| | `-- dfstat.py| |-- 30| | |-- mysql_collector.sh| |-- 300| | `-- ptTcpModel.sh| `-- etc| |-- config.py|-- config|-- startstop`-- tcollector.pyRun foreverRun every 15 secondsRun every 5 minutesRun every 30 seconds
    13. 13. QueryingData
    14. 14. http://opentsdb.example.com/#start=2013/04/10-07:32:29&end=2013/04/10-07:57:57&m=sum:proc.stat.cpu.percentage_idle{host=db22}&o=axis x1y1&m=sum:db.threads_running{host=db22}&o=axis x1y2&ylabel=CPU idle&y2label=Threads Running&yrange=[0:]&wxh=1475x600&png
    15. 15. http://opentsdb.example.com/q?start=2013/04/10-07:32:29&end=2013/04/10-07:57:57&m=sum:proc.stat.cpu.percentage_idle{host=db22}&o=axis x1y1&m=sum:db.threads_running{host=db22}&o=axis x1y2&ylabel=CPU idle&y2label=Threads Running&yrange=[0:]&ascii
    16. 16. Leveraging OpenTSDB For MySQL
    17. 17. user_statistics monitoring
    18. 18. table_statistics monitoring
    19. 19. Table Info from I_SSELECT *, DATA_LENGTH+INDEX_LENGTH AS TOTAL_LENGTHFROM INFORMATION_SCHEMA.TABLESWHERE TABLE_SCHEMA NOT IN(PERFORMANCE_SCHEMA,INFORMATION_SCHEMA)
    20. 20. Query Throughput
    21. 21. And other “common” metrics• Various MySQL status counters• QPS (questions)• Threads connected• Temporary tables on disk• Etc.• Various server statistics• %CPU Idle• Free disk space• I/O utilization• Network traffic• Etc.
    22. 22. Future collectors• pt-query-digest/mysqlslow query statistics• Data from “show engine innodb status”• (that is missing from counters)• PERFORMANCE_SCHEMA (MySQL 5.6+)• Query statistics• Processlist information• Background thread information
    23. 23. How does this change things?
    24. 24. In all seriousness, though...• Easily see aggregate graphs• Easily build graphs on-the-fly• Full granularity forever• API request for raw data• Cluster-wide nagios checks with check_tsd
    25. 25. Challenges Switching• Aggregates are the default• Mouse-zooming (patched!)• Auto-suggest for metrics• “The graphs aren’t pretty”• Migrating from proof of concept• Plan for 3+ machines• Data pruning may be required
    26. 26. SomeQuickNumbers OpenTSDB @ Box 21,294 metrics 72 tag keys 5,145,745 tag values 90% Interactive graphsreturn <300ms
    27. 27. Next Steps
    28. 28. Enjoy #PerconaLive 2013We’re hiring!https://www.box.com/about-us/careers/geoff@box.com
    29. 29. Image credits http://upload.wikimedia.org/wikipedia/commons/7/7b/Batelco_Network_Operations_Centre_(NOC).JPG http://www.flickr.com/photos/hoyvinmayvin/5873697252/ http://www.percona.com/doc/percona-monitoring-plugins http://www.2cto.com/uploadfile/2012/0731/20120731112415744.jpg http://media.tumblr.com/tumblr_lvfspoenWU1qi19a2.png http://img.izismile.com/img/img4/20110527/640/you_can_be_a_superhero_640_01.jpg http://openclipart.org/image/250px/svg_to_png/26427/Anonymous_notebook.png http://images.alphacoders.com/768/2560-1600-76893.jpg http://www.flickr.com/photos/in365/4861180503/ http://openclipart.org/image/250px/svg_to_png/130915/Prohibido_3D.png http://www.flickr.com/photos/61114149@N02/5566484951/ http://opentsdb.net/img/tsd-sample.png http://images2.wikia.nocookie.net/__cb20080911160202/bttf/images/5/57/WhatdidItellyou-HQ.jpg http://www.flickr.com/photos/lisakayaks/3028350539/ http://www.flickr.com/photos/25566302@N00/1472400115 http://www.flickr.com/photos/grandmaitre/5846058698/ http://www.flickr.com/photos/7518432@N06/2673347604/

    ×