Advertisement
Advertisement

More Related Content

Similar to HBaseCon 2013: OpenTSDB at Box(20)

Advertisement

More from Cloudera, Inc.(20)

Advertisement

HBaseCon 2013: OpenTSDB at Box

  1. OpenTSDB at Box #HBaseCon2013 Jonathan Creasy Geoffrey Anderson @geodbz
  2. Jonathan Creasy • SysAdmin @ Box, Inc. • Hadoop for Analytics
  3. Geoffrey Anderson • DBA @ Box, Inc. • Tooling for MySQL and HBase • #DBHangOps
  4. The Situation
  5. •Storing •RRDs •Flat files •Pre-defined •Graphs •Data to collect •Poll model These are problematic because...
  6. Enter OpenTSDB
  7. OpenTSDB is... • Distributed • Scalable • Time Series Database • Runs on HBase • Created By Benoit Sigoure HBase TSD for Querying mydb.example.com HAProxy fe1.example.com TSD for Storing Push Metrics Query via API
  8. • FAST • EASY to Scale • EASY to Populate • EASY to collect data • EASY to Query Why OpenTSDB?
  9. Collecting Data
  10. #!/usr/bin/env bash timestamp=$(date +%s) mysql -ss -e "SHOW GLOBAL STATUS" | while read var val do echo "mysql.$var $timestamp $val host=$HOSTNAME" done ganderson@mydb.example.com:~$ _./mysql_collector.sh mysql.Aborted_connects 1366399993 0 host=mydb.example.com mysql.Binlog_cache_disk_use 1366399993 0 host=mydb.example.com mysql.Binlog_cache_use 1366399993 0 host=mydb.example.com mysql.Binlog_stmt_cache_disk_use 1366399993 0 host=mydb.example.com mysql.Binlog_stmt_cache_use 1366399993 0 host=mydb.example.com mysql.Bytes_received 1366399993 19453687 host=mydb.example.com mysql.Bytes_sent 1366399993 1238166682 host=mydb.example.com mysql.Com_admin_commands 1366399993 1 host=mydb.example.com mysql.Com_assign_to_keycache 1366399993 0 host=mydb.example.com ... Example: mysql_collector.sh
  11. #!/usr/bin/env bash timestamp=$(date +%s) mysql -ss -e "SHOW GLOBAL STATUS" | while read var val do echo "mysql.$var $timestamp $val host=$HOSTNAME" done ganderson@mydb.example.com:~$ _./mysql_collector.sh mysql.Aborted_connects 1366399993 0 host=mydb.example.com mysql.Binlog_cache_disk_use 1366399993 0 host=mydb.example.com mysql.Binlog_cache_use 1366399993 0 host=mydb.example.com mysql.Binlog_stmt_cache_disk_use 1366399993 0 host=mydb.example.com mysql.Binlog_stmt_cache_use 1366399993 0 host=mydb.example.com mysql.Bytes_received 1366399993 19453687 host=mydb.example.com mysql.Bytes_sent 1366399993 1238166682 host=mydb.example.com mysql.Com_admin_commands 1366399993 1 host=mydb.example.com mysql.Com_assign_to_keycache 1366399993 0 host=mydb.example.com ... Example: mysql_collector.sh Metric name Timestamp Value “Tags” (key=val)
  12. * * * * * mysql_collector.sh | nc opentsdb.example.com 4242 Example: adding a cron for OpenTSDB
  13. ganderson@mydb.example.com:tcollector$ tree . |-- collectors | |-- 0 | | |-- ifstat.py | | |-- iostat.py | | |-- procnettcp.py | | |-- procstats.py | |-- 15 | | `-- dfstat.py | |-- 30 | | |-- mysql_collector.sh | |-- 300 | | `-- ptTcpModel.sh | `-- etc | |-- config.py |-- config |-- startstop `-- tcollector.py Run forever Run every 15 seconds Run every 5 minutes Run every 30 seconds
  14. Querying Data
  15. http://opentsdb.example.com /#start=2013/06/05-17:00:00 &end=2013/06/05-19:00:00 &m=sum:hadoop.hbase.regionserver.requests {server_type=dwh-data} &o=axis x1y1 &m=sum:proc.stat.cpu.percentage_iowait {server_type=dwh-data,dc=lv7,host=data08} &o=axis x1y2 &ylabel=HBase Requests &y2label=&CPU IOWait &yrange=[0:] &wxh=1475x600
  16. http://opentsdb.example.com /q?start=2013/06/05-17:00:00 &end=2013/06/05-19:00:00 &m=sum:hadoop.hbase.regionserver.requests {server_type=dwh-data} &o=axis x1y1 &m=sum:proc.stat.cpu.percentage_iowait {server_type=dwh-data,dc=lv7,host=data08} &o=axis x1y2 &ylabel=HBase Requests &y2label=&CPU IOWait &yrange=[0:] &wxh=1475x600 &ascii
  17. How does this change things?
  18. In all seriousness, though... • Easily see aggregate graphs • Easily build graphs on-the-fly • Full granularity forever • API request for raw data • Cluster-wide nagios checks with check_tsd
  19. Challenges Switching • Aggregates are the default • Mouse-zooming (patched!) • Auto-suggest for metrics • “The graphs aren’t pretty” • Migrating from proof of concept • Plan for 5+ machines • Data pruning may be required
  20. Some Quick Numbers OpenTSDB @ Box • 24,448 metrics • 79 tag keys • 5,371,701 tag values • 150,000 data points per second
  21. To store metric data for anything that is measurable Collection Philosophy
  22. Next Steps
  23. Enjoy #Hbasecon2013! https://www.box.com/about-us/careers/ jcreasy@box.com geoff@box.com We’re Hiring!
  24. Image credits • http://upload.wikimedia.org/wikipedia/commons/7/7b/Batelco_Network_Operations_Centre_(NOC).JPG • http://www.flickr.com/photos/hoyvinmayvin/5873697252/ • http://www.percona.com/doc/percona-monitoring-plugins • http://www.2cto.com/uploadfile/2012/0731/20120731112415744.jpg • http://media.tumblr.com/tumblr_lvfspoenWU1qi19a2.png • http://img.izismile.com/img/img4/20110527/640/you_can_be_a_superhero_640_01.jpg • http://openclipart.org/image/250px/svg_to_png/26427/Anonymous_notebook.png • http://images.alphacoders.com/768/2560-1600-76893.jpg • http://www.flickr.com/photos/in365/4861180503/ • http://openclipart.org/image/250px/svg_to_png/130915/Prohibido_3D.png • http://www.flickr.com/photos/61114149@N02/5566484951/ • http://opentsdb.net/img/tsd-sample.png • http://images2.wikia.nocookie.net/__cb20080911160202/bttf/images/5/57/WhatdidItellyou-HQ.jpg • http://www.flickr.com/photos/lisakayaks/3028350539/ • http://www.flickr.com/photos/25566302@N00/1472400115 • http://www.flickr.com/photos/grandmaitre/5846058698/ • http://www.flickr.com/photos/7518432@N06/2673347604/

Editor's Notes

  1. Will be talking about OpenTSDBHow OpenTSDB changed monitoring at box
  2. Running gangliaGet pushed metricsHave to define aggregatesRRD format
  3. Cacti has an easy centralized interfaceLots of templates accessibleUses polling model
  4. Graphite!Get pushedmetricsfromvarious servicesNeed to define the graphs youwantNeed to define aggregations
  5. RRDs auto-downsampleFlat files can be hard to manage at scalePre-definedNeed to know what you want to monitorPainful to setup new collections/graphsPollDoesn’t scale horizontally wellFalls behind and data gets droppedOccasionally drop important metrics to make it catch uphttp://monami.sourceforge.net/gfx/ganglia.pnghttp://www.cacti.net/images/cacti_promo_main.pnghttp://graphite.wdfiles.com/local--files/screen-shots/graphite_cli_800.pnghttp://nagios.sourceforge.net/images/screens/new/service-detail.png
  6. Suddenly finding problems and correlating issues is difficultMaybe you don’t have a NOC yetMaybe you do, and they need better graphs
  7. IT’S BIGGER ON THE INSIDE – just kiddingFast!Easy to build graphs on the flyHella easy to scale – just add nodes (HBase or TSDs)Very easy to put data into it – NEXT SLIDES TALK ABOUT THIS YO
  8. Running threads follows the CPU spikes PERFECTLYBox has a “long query” killer that gets more aggressive as more threads stack upShould get a look at queries on the server
  9. Running threads follows the CPU spikes PERFECTLYBox has a “long query” killer that gets more aggressive as more threads stack upShould get a look at queries on the server
  10. Running threads follows the CPU spikes PERFECTLYBox has a “long query” killer that gets more aggressive as more threads stack upShould get a look at queries on the server
  11. If you prefer text, that’s also an option via APIYou can build cool tools using the APIWeek over Week graphsSimplifies anomaly detectionURL is pretty simpleEffectively just use “q?” and add “&ascii”
  12. Aggregatesare thedefault–shift in thinking from lookingatspecificimportantservers.Zooming in on a timeslice was painfullymanual– I wroteup a patch to addmouse-zooming and upstreamed. Thiscementedopentsdb as a powerful monitoring tool for Box, overnightAuto-suggest for metricsisspotty– we wrote a quick cron job that dumps full metric list into JSON “Graphs aren’t pretty” – a few changes to the base GNUPlot options solved this. There’s also a “Smooth” option in the interface nowMigrating from POC – we had a single-node setup for the longest time until that fell over...a lotPlan for 3+ machines – it’s enough to run all the needed bits for a light-weight distributed HBase and TSD setupData pruning – ~4 bytes per metric before HDFS replication add up quicklymysql_tcollector - 370 metrics -- ~1.5k per server. X 30s interval = ~4.2MB/dayeither have a plan to prune old data or build out extra capacity and predict storage needs per server/metric added
  13. New metrics available with every code push
Advertisement