• Save
HBaseCon 2013: OpenTSDB at Box
 

HBaseCon 2013: OpenTSDB at Box

on

  • 4,064 views

Presented by: Jonathan Creasy (Box) and Geoffrey Anderson (Box)

Presented by: Jonathan Creasy (Box) and Geoffrey Anderson (Box)

Statistics

Views

Total Views
4,064
Views on SlideShare
4,047
Embed Views
17

Actions

Likes
22
Downloads
0
Comments
0

5 Embeds 17

http://www.cloudera.com 8
https://twitter.com 3
http://author01.mtv.cloudera.com 3
http://author01.core.cloudera.com 2
http://cloudera.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Will be talking about OpenTSDBHow OpenTSDB changed monitoring at box
  • Running gangliaGet pushed metricsHave to define aggregatesRRD format
  • Cacti has an easy centralized interfaceLots of templates accessibleUses polling model
  • Graphite!Get pushedmetricsfromvarious servicesNeed to define the graphs youwantNeed to define aggregations
  • RRDs auto-downsampleFlat files can be hard to manage at scalePre-definedNeed to know what you want to monitorPainful to setup new collections/graphsPollDoesn’t scale horizontally wellFalls behind and data gets droppedOccasionally drop important metrics to make it catch uphttp://monami.sourceforge.net/gfx/ganglia.pnghttp://www.cacti.net/images/cacti_promo_main.pnghttp://graphite.wdfiles.com/local--files/screen-shots/graphite_cli_800.pnghttp://nagios.sourceforge.net/images/screens/new/service-detail.png
  • Suddenly finding problems and correlating issues is difficultMaybe you don’t have a NOC yetMaybe you do, and they need better graphs
  • IT’S BIGGER ON THE INSIDE – just kiddingFast!Easy to build graphs on the flyHella easy to scale – just add nodes (HBase or TSDs)Very easy to put data into it – NEXT SLIDES TALK ABOUT THIS YO
  • Running threads follows the CPU spikes PERFECTLYBox has a “long query” killer that gets more aggressive as more threads stack upShould get a look at queries on the server
  • Running threads follows the CPU spikes PERFECTLYBox has a “long query” killer that gets more aggressive as more threads stack upShould get a look at queries on the server
  • Running threads follows the CPU spikes PERFECTLYBox has a “long query” killer that gets more aggressive as more threads stack upShould get a look at queries on the server
  • If you prefer text, that’s also an option via APIYou can build cool tools using the APIWeek over Week graphsSimplifies anomaly detectionURL is pretty simpleEffectively just use “q?” and add “&ascii”
  • Aggregatesare thedefault–shift in thinking from lookingatspecificimportantservers.Zooming in on a timeslice was painfullymanual– I wroteup a patch to addmouse-zooming and upstreamed. Thiscementedopentsdb as a powerful monitoring tool for Box, overnightAuto-suggest for metricsisspotty– we wrote a quick cron job that dumps full metric list into JSON “Graphs aren’t pretty” – a few changes to the base GNUPlot options solved this. There’s also a “Smooth” option in the interface nowMigrating from POC – we had a single-node setup for the longest time until that fell over...a lotPlan for 3+ machines – it’s enough to run all the needed bits for a light-weight distributed HBase and TSD setupData pruning – ~4 bytes per metric before HDFS replication add up quicklymysql_tcollector - 370 metrics -- ~1.5k per server. X 30s interval = ~4.2MB/dayeither have a plan to prune old data or build out extra capacity and predict storage needs per server/metric added
  • New metrics available with every code push

HBaseCon 2013: OpenTSDB at Box HBaseCon 2013: OpenTSDB at Box Presentation Transcript

  • OpenTSDB at Box #HBaseCon2013 Jonathan Creasy Geoffrey Anderson @geodbz
  • Jonathan Creasy • SysAdmin @ Box, Inc. • Hadoop for Analytics
  • Geoffrey Anderson • DBA @ Box, Inc. • Tooling for MySQL and HBase • #DBHangOps
  • The Situation
  • •Storing •RRDs •Flat files •Pre-defined •Graphs •Data to collect •Poll model These are problematic because...
  • Enter OpenTSDB
  • OpenTSDB is... • Distributed • Scalable • Time Series Database • Runs on HBase • Created By Benoit Sigoure HBase TSD for Querying mydb.example.com HAProxy fe1.example.com TSD for Storing Push Metrics Query via API
  • • FAST • EASY to Scale • EASY to Populate • EASY to collect data • EASY to Query Why OpenTSDB?
  • Collecting Data
  • #!/usr/bin/env bash timestamp=$(date +%s) mysql -ss -e "SHOW GLOBAL STATUS" | while read var val do echo "mysql.$var $timestamp $val host=$HOSTNAME" done ganderson@mydb.example.com:~$ _./mysql_collector.sh mysql.Aborted_connects 1366399993 0 host=mydb.example.com mysql.Binlog_cache_disk_use 1366399993 0 host=mydb.example.com mysql.Binlog_cache_use 1366399993 0 host=mydb.example.com mysql.Binlog_stmt_cache_disk_use 1366399993 0 host=mydb.example.com mysql.Binlog_stmt_cache_use 1366399993 0 host=mydb.example.com mysql.Bytes_received 1366399993 19453687 host=mydb.example.com mysql.Bytes_sent 1366399993 1238166682 host=mydb.example.com mysql.Com_admin_commands 1366399993 1 host=mydb.example.com mysql.Com_assign_to_keycache 1366399993 0 host=mydb.example.com ... Example: mysql_collector.sh
  • #!/usr/bin/env bash timestamp=$(date +%s) mysql -ss -e "SHOW GLOBAL STATUS" | while read var val do echo "mysql.$var $timestamp $val host=$HOSTNAME" done ganderson@mydb.example.com:~$ _./mysql_collector.sh mysql.Aborted_connects 1366399993 0 host=mydb.example.com mysql.Binlog_cache_disk_use 1366399993 0 host=mydb.example.com mysql.Binlog_cache_use 1366399993 0 host=mydb.example.com mysql.Binlog_stmt_cache_disk_use 1366399993 0 host=mydb.example.com mysql.Binlog_stmt_cache_use 1366399993 0 host=mydb.example.com mysql.Bytes_received 1366399993 19453687 host=mydb.example.com mysql.Bytes_sent 1366399993 1238166682 host=mydb.example.com mysql.Com_admin_commands 1366399993 1 host=mydb.example.com mysql.Com_assign_to_keycache 1366399993 0 host=mydb.example.com ... Example: mysql_collector.sh Metric name Timestamp Value “Tags” (key=val)
  • * * * * * mysql_collector.sh | nc opentsdb.example.com 4242 Example: adding a cron for OpenTSDB
  • ganderson@mydb.example.com:tcollector$ tree . |-- collectors | |-- 0 | | |-- ifstat.py | | |-- iostat.py | | |-- procnettcp.py | | |-- procstats.py | |-- 15 | | `-- dfstat.py | |-- 30 | | |-- mysql_collector.sh | |-- 300 | | `-- ptTcpModel.sh | `-- etc | |-- config.py |-- config |-- startstop `-- tcollector.py Run forever Run every 15 seconds Run every 5 minutes Run every 30 seconds
  • Querying Data
  • http://opentsdb.example.com /#start=2013/06/05-17:00:00 &end=2013/06/05-19:00:00 &m=sum:hadoop.hbase.regionserver.requests {server_type=dwh-data} &o=axis x1y1 &m=sum:proc.stat.cpu.percentage_iowait {server_type=dwh-data,dc=lv7,host=data08} &o=axis x1y2 &ylabel=HBase Requests &y2label=&CPU IOWait &yrange=[0:] &wxh=1475x600
  • http://opentsdb.example.com /q?start=2013/06/05-17:00:00 &end=2013/06/05-19:00:00 &m=sum:hadoop.hbase.regionserver.requests {server_type=dwh-data} &o=axis x1y1 &m=sum:proc.stat.cpu.percentage_iowait {server_type=dwh-data,dc=lv7,host=data08} &o=axis x1y2 &ylabel=HBase Requests &y2label=&CPU IOWait &yrange=[0:] &wxh=1475x600 &ascii
  • How does this change things?
  • In all seriousness, though... • Easily see aggregate graphs • Easily build graphs on-the-fly • Full granularity forever • API request for raw data • Cluster-wide nagios checks with check_tsd
  • Challenges Switching • Aggregates are the default • Mouse-zooming (patched!) • Auto-suggest for metrics • “The graphs aren’t pretty” • Migrating from proof of concept • Plan for 5+ machines • Data pruning may be required
  • Some Quick Numbers OpenTSDB @ Box • 24,448 metrics • 79 tag keys • 5,371,701 tag values • 150,000 data points per second
  • To store metric data for anything that is measurable Collection Philosophy
  • Next Steps
  • Enjoy #Hbasecon2013! https://www.box.com/about-us/careers/ jcreasy@box.com geoff@box.com We’re Hiring!
  • Image credits • http://upload.wikimedia.org/wikipedia/commons/7/7b/Batelco_Network_Operations_Centre_(NOC).JPG • http://www.flickr.com/photos/hoyvinmayvin/5873697252/ • http://www.percona.com/doc/percona-monitoring-plugins • http://www.2cto.com/uploadfile/2012/0731/20120731112415744.jpg • http://media.tumblr.com/tumblr_lvfspoenWU1qi19a2.png • http://img.izismile.com/img/img4/20110527/640/you_can_be_a_superhero_640_01.jpg • http://openclipart.org/image/250px/svg_to_png/26427/Anonymous_notebook.png • http://images.alphacoders.com/768/2560-1600-76893.jpg • http://www.flickr.com/photos/in365/4861180503/ • http://openclipart.org/image/250px/svg_to_png/130915/Prohibido_3D.png • http://www.flickr.com/photos/61114149@N02/5566484951/ • http://opentsdb.net/img/tsd-sample.png • http://images2.wikia.nocookie.net/__cb20080911160202/bttf/images/5/57/WhatdidItellyou-HQ.jpg • http://www.flickr.com/photos/lisakayaks/3028350539/ • http://www.flickr.com/photos/25566302@N00/1472400115 • http://www.flickr.com/photos/grandmaitre/5846058698/ • http://www.flickr.com/photos/7518432@N06/2673347604/