OpenTSDB at Box
#HBaseCon2013
Jonathan Creasy
Geoffrey Anderson @geodbz
Jonathan Creasy
• SysAdmin @ Box, Inc.
• Hadoop for Analytics
Geoffrey Anderson
• DBA @ Box, Inc.
• Tooling for MySQL and HBase
• #DBHangOps
The
Situation
•Storing
•RRDs
•Flat files
•Pre-defined
•Graphs
•Data to collect
•Poll model
These are problematic because...
Enter OpenTSDB
OpenTSDB is...
• Distributed
• Scalable
• Time Series Database
• Runs on HBase
• Created By
Benoit Sigoure
HBase
TSD for
Querying
mydb.example.com
HAProxy
fe1.example.com
TSD for
Storing
Push
Metrics
Query via API
• FAST
• EASY to Scale
• EASY to Populate
• EASY to collect data
• EASY to Query
Why OpenTSDB?
Collecting
Data
#!/usr/bin/env bash
timestamp=$(date +%s)
mysql -ss -e "SHOW GLOBAL STATUS" | while read var val
do
echo "mysql.$var $timestamp $val host=$HOSTNAME"
done
ganderson@mydb.example.com:~$ _./mysql_collector.sh
mysql.Aborted_connects 1366399993 0 host=mydb.example.com
mysql.Binlog_cache_disk_use 1366399993 0 host=mydb.example.com
mysql.Binlog_cache_use 1366399993 0 host=mydb.example.com
mysql.Binlog_stmt_cache_disk_use 1366399993 0 host=mydb.example.com
mysql.Binlog_stmt_cache_use 1366399993 0 host=mydb.example.com
mysql.Bytes_received 1366399993 19453687 host=mydb.example.com
mysql.Bytes_sent 1366399993 1238166682 host=mydb.example.com
mysql.Com_admin_commands 1366399993 1 host=mydb.example.com
mysql.Com_assign_to_keycache 1366399993 0 host=mydb.example.com
...
Example: mysql_collector.sh
#!/usr/bin/env bash
timestamp=$(date +%s)
mysql -ss -e "SHOW GLOBAL STATUS" | while read var val
do
echo "mysql.$var $timestamp $val host=$HOSTNAME"
done
ganderson@mydb.example.com:~$ _./mysql_collector.sh
mysql.Aborted_connects 1366399993 0 host=mydb.example.com
mysql.Binlog_cache_disk_use 1366399993 0 host=mydb.example.com
mysql.Binlog_cache_use 1366399993 0 host=mydb.example.com
mysql.Binlog_stmt_cache_disk_use 1366399993 0 host=mydb.example.com
mysql.Binlog_stmt_cache_use 1366399993 0 host=mydb.example.com
mysql.Bytes_received 1366399993 19453687 host=mydb.example.com
mysql.Bytes_sent 1366399993 1238166682 host=mydb.example.com
mysql.Com_admin_commands 1366399993 1 host=mydb.example.com
mysql.Com_assign_to_keycache 1366399993 0 host=mydb.example.com
...
Example: mysql_collector.sh
Metric name Timestamp Value “Tags” (key=val)
* * * * * mysql_collector.sh | nc opentsdb.example.com 4242
Example: adding a cron for OpenTSDB
ganderson@mydb.example.com:tcollector$ tree
.
|-- collectors
| |-- 0
| | |-- ifstat.py
| | |-- iostat.py
| | |-- procnettcp.py
| | |-- procstats.py
| |-- 15
| | `-- dfstat.py
| |-- 30
| | |-- mysql_collector.sh
| |-- 300
| | `-- ptTcpModel.sh
| `-- etc
| |-- config.py
|-- config
|-- startstop
`-- tcollector.py
Run forever
Run every 15 seconds
Run every 5 minutes
Run every 30 seconds
Querying
Data
http://opentsdb.example.com
/#start=2013/06/05-17:00:00
&end=2013/06/05-19:00:00
&m=sum:hadoop.hbase.regionserver.requests
{server_type=dwh-data}
&o=axis x1y1
&m=sum:proc.stat.cpu.percentage_iowait
{server_type=dwh-data,dc=lv7,host=data08}
&o=axis x1y2
&ylabel=HBase Requests
&y2label=&CPU IOWait
&yrange=[0:]
&wxh=1475x600
http://opentsdb.example.com
/q?start=2013/06/05-17:00:00
&end=2013/06/05-19:00:00
&m=sum:hadoop.hbase.regionserver.requests
{server_type=dwh-data}
&o=axis x1y1
&m=sum:proc.stat.cpu.percentage_iowait
{server_type=dwh-data,dc=lv7,host=data08}
&o=axis x1y2
&ylabel=HBase Requests
&y2label=&CPU IOWait
&yrange=[0:]
&wxh=1475x600
&ascii
How does this change things?
In all seriousness, though...
• Easily see aggregate graphs
• Easily build graphs on-the-fly
• Full granularity forever
• API request for raw data
• Cluster-wide nagios checks with check_tsd
Challenges Switching
• Aggregates are the default
• Mouse-zooming (patched!)
• Auto-suggest for metrics
• “The graphs aren’t pretty”
• Migrating from proof of concept
• Plan for 5+ machines
• Data pruning may be required
Some
Quick
Numbers
OpenTSDB @ Box
• 24,448 metrics
• 79 tag keys
• 5,371,701 tag values
• 150,000 data points per second
To store metric data for
anything
that is
measurable
Collection Philosophy
Next Steps
Enjoy #Hbasecon2013!
https://www.box.com/about-us/careers/
jcreasy@box.com
geoff@box.com
We’re Hiring!
Image credits
• http://upload.wikimedia.org/wikipedia/commons/7/7b/Batelco_Network_Operations_Centre_(NOC).JPG
• http://www.flickr.com/photos/hoyvinmayvin/5873697252/
• http://www.percona.com/doc/percona-monitoring-plugins
• http://www.2cto.com/uploadfile/2012/0731/20120731112415744.jpg
• http://media.tumblr.com/tumblr_lvfspoenWU1qi19a2.png
• http://img.izismile.com/img/img4/20110527/640/you_can_be_a_superhero_640_01.jpg
• http://openclipart.org/image/250px/svg_to_png/26427/Anonymous_notebook.png
• http://images.alphacoders.com/768/2560-1600-76893.jpg
• http://www.flickr.com/photos/in365/4861180503/
• http://openclipart.org/image/250px/svg_to_png/130915/Prohibido_3D.png
• http://www.flickr.com/photos/61114149@N02/5566484951/
• http://opentsdb.net/img/tsd-sample.png
• http://images2.wikia.nocookie.net/__cb20080911160202/bttf/images/5/57/WhatdidItellyou-HQ.jpg
• http://www.flickr.com/photos/lisakayaks/3028350539/
• http://www.flickr.com/photos/25566302@N00/1472400115
• http://www.flickr.com/photos/grandmaitre/5846058698/
• http://www.flickr.com/photos/7518432@N06/2673347604/

HBaseCon 2013: OpenTSDB at Box

Editor's Notes

  • #2 Will be talking about OpenTSDBHow OpenTSDB changed monitoring at box
  • #6 Running gangliaGet pushed metricsHave to define aggregatesRRD format
  • #7 Cacti has an easy centralized interfaceLots of templates accessibleUses polling model
  • #8 Graphite!Get pushedmetricsfromvarious servicesNeed to define the graphs youwantNeed to define aggregations
  • #9 RRDs auto-downsampleFlat files can be hard to manage at scalePre-definedNeed to know what you want to monitorPainful to setup new collections/graphsPollDoesn’t scale horizontally wellFalls behind and data gets droppedOccasionally drop important metrics to make it catch uphttp://monami.sourceforge.net/gfx/ganglia.pnghttp://www.cacti.net/images/cacti_promo_main.pnghttp://graphite.wdfiles.com/local--files/screen-shots/graphite_cli_800.pnghttp://nagios.sourceforge.net/images/screens/new/service-detail.png
  • #10 Suddenly finding problems and correlating issues is difficultMaybe you don’t have a NOC yetMaybe you do, and they need better graphs
  • #12 IT’S BIGGER ON THE INSIDE – just kiddingFast!Easy to build graphs on the flyHella easy to scale – just add nodes (HBase or TSDs)Very easy to put data into it – NEXT SLIDES TALK ABOUT THIS YO
  • #19 Running threads follows the CPU spikes PERFECTLYBox has a “long query” killer that gets more aggressive as more threads stack upShould get a look at queries on the server
  • #20 Running threads follows the CPU spikes PERFECTLYBox has a “long query” killer that gets more aggressive as more threads stack upShould get a look at queries on the server
  • #21 Running threads follows the CPU spikes PERFECTLYBox has a “long query” killer that gets more aggressive as more threads stack upShould get a look at queries on the server
  • #23 If you prefer text, that’s also an option via APIYou can build cool tools using the APIWeek over Week graphsSimplifies anomaly detectionURL is pretty simpleEffectively just use “q?” and add “&ascii”
  • #25 Aggregatesare thedefault–shift in thinking from lookingatspecificimportantservers.Zooming in on a timeslice was painfullymanual– I wroteup a patch to addmouse-zooming and upstreamed. Thiscementedopentsdb as a powerful monitoring tool for Box, overnightAuto-suggest for metricsisspotty– we wrote a quick cron job that dumps full metric list into JSON “Graphs aren’t pretty” – a few changes to the base GNUPlot options solved this. There’s also a “Smooth” option in the interface nowMigrating from POC – we had a single-node setup for the longest time until that fell over...a lotPlan for 3+ machines – it’s enough to run all the needed bits for a light-weight distributed HBase and TSD setupData pruning – ~4 bytes per metric before HDFS replication add up quicklymysql_tcollector - 370 metrics -- ~1.5k per server. X 30s interval = ~4.2MB/dayeither have a plan to prune old data or build out extra capacity and predict storage needs per server/metric added
  • #33 New metrics available with every code push